Twitter threads to plain HTML

Lately, I wrote two long Twitter threads to describe the process behind some intaglio printmaking techniques. I can't get tired of explaining this stuff to my friends and other people, but I hadn't yet tried to write it down. The lockdown boredom gave the perfect motivation. The first thread is about aquatint, the second about mezzotint, and both are in French.

Some tools involved in the mezzotint process.
Some tools involved in the mezzotint process.

I had never written threads so long (64 and 56 tweets respectively), on topics so complex, but I found the experience interesting. For many years I've had trouble writing non-trivial texts, even on the many subjects I'd love to talk about. Hence the poor publication rhythm of this blog. I feel like I lost the ability to lay words on a page in a simple and fluent way. I'm constantly losing focus and trying to refine other parts before they are even written. As a contrary, the step-by-step format imposed by Twitter threads helped me write in short, well-organized ideas, while keeping track of the larger goal. Being unable to edit a tweet once it's been posted sure helped a lot.

The end result is no great literature. It's systematic and quite jerky, sounding like a long staccato of less-than-240-letters paragraphs. So I don't want to make it an habit. Still, it's perfectly readable. And the amount of effort and information I put in these two threads was worth making them more persistent and less dependent on Twitter. So I planned to publish their content as articles on Patterns in the Ivy, my other blog dedicated to my artworks. It uses the Dotclear blog engine, similar to WordPress, so I needed simple HTML to paste in the post editor. Something like a <p> for each tweet, and images as <figure> blocks.

You probably guessed what followed. I could have done it manually, with one hour or so of copy-pasting and reformatting. But of course I tried to find an automatic way.

Obligatory XKCD
Obligatory XKCD

Saving myself half the work

Many concepts of the Twitter platform have emerged organically along its history. Like retweets, which were done originally by prefixing a copy-pasted tweet with 'RT', before becoming a built-in feature. Twitter's acknowledgement of threads as a real feature is very limited. For the writers, they provide a way to post multiple chained tweets at the same time and a button to make sure you continue the chain from the right one. But they don't give much to the readers. There's no easy way to see if a single tweet is actually the start of a long chain, so authors have to mention it in the content. And even if it's often how you'll come across a thread in the first place, there's no easy way to go back to the start from any tweet in the chain. You have to scroll to the top of the feed and wait for the previous tweets to be fetched, often multiple times as tweets are retrieved by limited batches.

Of course, Twitter prefers you stay on their platform, so there is no official way of exporting a tweet content. You can 'embed' individual tweets in other sites, like that:

But it's not possible for a chain of tweets. And anyway, I didn't want a Twitter widget in my blog articles. I just wanted the raw content of my threads, stripped of their twitter-ness.

Using the full-fledged Twitter API for something so simple was out of the question. I thought more about something like scraping and transforming the HTML of a fully unrolled thread page. Too bad the Twitter HTML markup is the usual mess of <div>s with human-unreadable CSS class names generated by CSS-in-JS tools. Working from this base material can be done, but it's not pleasant and code based on it would break any time in the future. So I found a way to ease the task first, by using the Thread Reader web app.

Thread Reader works like the reader mode included in most browsers, but for Twitter threads. Given the URL of the first tweet of a thread, it 'unrolls' the chain and present the full content in a clean, easy-to-read page. Here is the result for my aquatint thread. Each tweet appears as a separate paragraph, with its images or videos displayed below, and that's it. No avatar, no like and retweet buttons, no separator. That's a really neat tool.

What's great too is that the markup generated by Thread Reader is as clean as its content presentation. So it's relatively straightforward to write a parser to transform it to even simpler HTML.

For text-only threads

The older I grow, the more I love one-purpose command line programs. Commands with well-defined I/O, that individually do one task very well and unleash their true power when composed with each other. Not exactly a groundbreaking opinion, as this principle was there in the first days of Unix. But it never hurts to constantly rediscover the genius behind some designs. Among my all-time favorite tools, which saved me a lot of time in the past, are the JSON processor jQ and the TopoJSON suite.

I tried this CLI-only approach here. I quickly got something working for text-only cases, by combining three tools in a pretty minimalist result:

curl -s https://threadreaderapp.com/thread/1257997429347098624.html \
      | pup '.t-main .content-tweet text{}' \
      | pandoc

How it works:

  1. First we fetch the content of a given Thread Reader page using the ubiquitous curl.
  2. We pipe this content to pup, a command-line HTML parser inspired by jQ, to select the tweet elements using CSS selectors and then extract their text content. This generates plain text lines separated by empty lines…
  3. … which happens to match the Markdown syntax, so we can finally feed this result to any Markdown-to-HTML converter, for example here the default mode of the pandoc format converter.

Et voilà, we have our chain of tweets as clean <p> HTML paragraphs.

If your threads are made only of text, or you don't care about including images, you can perfectly use this solution. In my case, I included in these two threads a lot of images and videos that I wanted to keep, so it wasn't enough. The source and target markups are different and more complex for these elements, so it proved impossible or too convoluted for my taste to do the transformation in one-round with such tools. So I resigned myself to open my code editor instead of just my terminal.

With images and videos

I won't dive into details, but the result is a small NodeJS project, thread-reader-reader.

I turned to JavaScript by habit and thus efficiency, now I think it was kind of a lost occasion to try it in another language I practice less. But no big deal, next time! As usual for such small projects, it was fun to write simple, vanilla code and not bothering about the latest React hook or Webpack plugin.

This project simply takes a Thread Reader page URL (so you have to submit your Twitter thread to it first), fetches its content and transforms the main part to even simpler HTML, ready to be pasted in a blog post or anywhere else. The parser uses only simple DOM selections and manipulations, so I could basically code everything in my browser dev tools on Thread Reader before turning it into a real program.

The output will be a series of <p> blocks for tweets, with their optional internal links, and of <figure> blocks when images or videos are met. This is how I format my articles in my blog, as well as some special classes and attributes on image tags.

For now it's still a command-line program, with the specific formatting logic hardcoded in it. I even had to remove emojis because there was a database problem on my blog with some of them. But the parser itself is generic and clearly separated from the rest, with no dependency. So the main entry point can easily be reworked to support templating options and such. Everything could be included in something more practical, like a web interface with a form and options. Maybe later if somebody thinks it's useful.

The main drawback is image and video resources are still hosted on Twitter. So, outside of text, the resulting articles are not really independent from this platform. But fetching images and importing them somewhere else would be a much more difficult thing to do. So I decided it was not worth it. After all, I just wanted to publish two articles in not-that-much time. (Update 2020-05-27: I added the option to download the resources from Twitter and rewrite urls so it's easy to store files anywhere you want on your site. That way, articles don't depend on Twitter anymore).

And here they are:

… but still in French ^^