Extracting Unique Twitter handles from Custom Jekyll Include Tags

[ bash jekyll regex ]

While I work mainly on Windows, I’ve also been using Debian Linux for quite a while, with a first dabbling in the sarge era and then a more prolonged, ongoing experience with the recent wheezy/jessie/stretch sequence (list of all Debian versions). I currently am maintaining a dual quad-core Xeon Dell PowerEdge (purportedly a cast-off from a Facebook server farm, interestingly…) that’s running stretch, which I’m using for development and for running some odds-and-ends quantum chemical calculations. In the course of finalizing my PyCon 2019 recap post, I had reason to use some of the Linux/Unix CLI tooling in a way I hadn’t had to before. I imagine most of this post will not be particularly novel to many, but I was pleased at how quickly I was able to get the info I needed.

As part of adapting Barry Clark’s jekyll-now to make this blog, I’ve created a number of custom includes to streamline the writing process. For example, the GitHub repo link icon above was created using a gh.html include; I have similar ones for Twitter, PyPI, YouTube, and others. As part of the PyCon recap post, I tried to connect everyone I mentioned there to their profile on some sort of social media platform, or on GitHub. Since most everyone mentioned was on Twitter, I wanted to follow up the announcement of the post with a couple of ‘acknowlegment Tweets’. Since those Twitter mentions were all achieved using tw.html, that {% include %} syntax provided a handy way of generating a sorted, duplicate-free list of handles for me.

The tw.html include is set up to take a Twitter handle as input using the typical parameter syntax, via user="handle", which provides a natural pattern to key off of for a regex search of the post’s Markdown source. A bash one-liner, inspired by this SO answer, handed these to me quite neatly:

$ grep tw.html 2019-06-26-PyCon-2019-Recap.md | sed -E 's/^.+user="([^ ]+)".+$/\1/' | sort -u

I didn’t need to use regex in the initial “grep tw.html ...”, since I was just trying to pick out the lines that had the tw.html includes in them. If I’d enabled regex in grep, it would’ve been best to handle the period specially: grep -E 'tw[.]html'. Surrounding it in square brackets would turn it into a character class; a period within a character class is not a metacharacter.

All of the regex power was brought to bear in the sed command, in the form of the search-and-replace 's/.../.../' syntax; see here. The -E flag to sed activates ‘extended regex’; I don’t really know what differs between “regular” and “extended” regex (lookaround, maybe?), but I do know I basically always want extended regex.

Here’s what each piece of the regex search-and-replace does:

s/          # Start the search/replace
^           # Start matching at the start of the line
.+          # Match one or more arbitrary non-newline characters
user="      # Match the explicit string 'user="'
(           # Open a capturing group, for later use
[^ ]+       # Character class to match at least one *non*-space character; '^' makes it a negated character class
)           # Close the capturing group
"           # Explicit double-quote, closing the passed 'user' parameter value
.+          # Match one or more arbitrary non-newline characters
$           # Match the end of the line
/           # Switch from the search pattern to the replace pattern
\1          # Replace the entire line (since I used '^.+' and '.+$') with the single capturing group, representing the desired Twitter handle
/           # Close the sed search-and-replace directive

This regex is not perfectly constructed; in particular, I possibly should have used .* instead of .+ both places the latter occurred, just on the off-chance that the includes were adjacent to the start or end of a line. However, I almost never write my includes that way, so I was pretty confident I didn’t have to worry about those cases.

The final sort -u was the key thing I took from the SO question… sort, naturally, sorts the lines; but the -u argument (for ‘unique’) removes any duplicates, which provided a handy list for transferring into Twitter.

In order to at least be approximately sure I hadn’t missed anyone, I needed to check to be sure I hadn’t split any {% include %} commands over multiple lines in a way that would break the detection. I thus compared my original search:

$ grep tw.html 2019-06-26-PyCon-2019-Recap.md | wc -l

to a similar search, but set up to also find the closing curly braces of the relevant include tags:

$ grep -E 'tw.html.+}' 2019-06-26-PyCon-2019-Recap.md | wc -l

(I actually forgot to escape/character-class the “.” in tw.html in that second search, but it shouldn’t really matter.) Here, I’m using wc with the -l flag to count the total number of lines in the search results. Thus, since these two searches have the same number of results, I was confident that I was catching all of the occurrences of user="handle" that I cared about. (Note that if it were possible that any of the includes had been written with internal Liquid tags, and thus might have had an extra “}” within the overall {% include %}, then the second search would not have been reliable. Since I knew I hadn’t done this, though, this search was sufficient to satisfy me that I had found all of the Twitter handles.)

Still, being the ever-so-slightly compulsive nerd that I am, I still went ahead and did a negative control check, by searching for a similar regex that should never match. AFAIK Jekyll/Liquid does not use carets in its syntax, so the following grep should return zero results.

$ grep -E 'tw.html.+\^' 2019-06-26-PyCon-2019-Recap.md | wc -l

And, indeed it did. All good!

This post was written with StackEdit.

Written on July 13, 2019