Mike Kasberg

Husband. Father. Software engineer. Ubuntu Linux user.

Image for Better Related Posts in Jekyll Using AI

Better Related Posts in Jekyll Using AI

23 Apr 2024

On any blog, it’s really common to link to related posts near the end of an article. It keeps readers on your website by linking to another post they might be interested in, and it can help with SEO. For a long time, Jekyll has provided site.related_posts as a convenient way to link to related posts. Unfortunately, the default implementation just lists the ten most recent posts (which might not actually be that closely related). Jekyll does offer a better implementation using Latent Semantic Indexing (LSI) with classifier-reborn. This plugin tries to populate related_posts with posts that are actually related, but it’s difficult to install and doesn’t always produce the best results.

I was aware of many of the problems and challenges with the existing approach in classifier-reborn since I updated classifier-reborn for Ruby 3 back in 2022. Classifier-reborn isn’t too bad (and was useful enough to me that I updated it for Ruby 3), but I’ve wished for a long time it was easier to use and produced better results. More recently, with the rapid growth of ChatGPT and LLMs, I’ve been wanting to try a personal project that could make use of modern AI. I read an interesting blog post from Hacker News about how Embeddings are a good starting point, and it occurred to me that embeddings from OpenAI would be a great way to get better related post functionality into Jekyll! I decided to try building my own Jekyll plugin for related posts to see if AI would work well here, and I got some really great results!

Solution Design

OpenAI offers an Embeddings API that’s very easy to use. You provide some input text, and the API returns a vector embedding from OpenAI’s LLM. These vectors can be compared outside an LLM using simple vector similarity algorithms like cosine similarity. With a little searching, I found a SQLite plugin that extends SQLite with vector database functionality. This seemed like a great solution to me! I could cache vector embeddings from OpenAI in a small SQLite database, and use the same database (with the plugin) to perform a vector similarity search to find related posts!

Jekyll has a rich plugin ecosystem, and provides hooks that plugins can use to integrate with various steps in the build process. I designed my plugin as a generator plugin.

Generators run after Jekyll has made an inventory of the existing content, and before the site is generated.

When my plugin gets called during site generation, the first thing it does is ensure that we’ve cached a vector embedding for every post. If any posts are missing an embedding, we make a request to the OpenAI API to get it. Then, with embeddings for all posts in our database, we perform a vector similarity search for each post in SQLite, making this data available to use in the post itself (via a Liquid template) as ai_related_posts in the page data. The approach is very simple, and turned out to work great!

Accuracy

One of my biggest concerns when designing this plugin was accuracy. Could I design a solution that would produce better results than classifier-reborn?

I think I was successful. Let’s look at some examples from my own blog, mikekasberg.com.

Example 1

Here’s an example from one of my recent blog posts, 3D Printing Map Figurines with GPS. The table below shows the related posts produced by each approach.

classifier-reborn ai_related_posts
3 Months of 3D Printing 3D Printing the Strava Logo
How to Dual-Boot Ubuntu (20.04 - 23.10) and Windows (10 or 11) with Encryption 3 Months of 3D Printing
I Did My Own Taxes By Hand (and You Can Too!) 3D Printing with OpenSCAD

It seems obvious to me that the posts on the right are much better related posts. All the posts generated by ai_related_posts are about 3D printing. Very relevant! In contrast, classifier-reborn only produced one related post about 3D printing. I’m sure there’s some reason the LSI approach thought the posts on the left might be related, but they seem somewhat random!

Example 2

Let’s look at another example, Home WiFi Upgrades: Adding an Access Point with Wired Backhaul.

classifier-reborn ai_related_posts
Learning to Solder: A WLED Project How to Test and Optimize Your Home Wifi Coverage
How to Test and Optimize Your Home Wifi Coverage I Installed My Own Coax Cable for My Internet Modem (and You Can Too)
Buying Used Computers: A Story and Some Advice Learning to Solder: A WLED Project

These results are interesting because two out of the three results are the same (but in a different order), and I don’t think either set of results is bad. But I do think the results on the right are, again, definitely better than those on the left. The most closely related article to “Adding an Access Point with Wired Backhaul” is “How to Test and Optimize Your Home Wifi Coverage”, and the AI plugin got this right! I also think the article about installing coax cable is indeed the next most closely related article, and the AI plugin got this right too while classifier-reborn missed this completely!

With evidence like the above, it seems clear to me that my AI plugin’s producing good results – much more accurate than classifier-reborn, which I was previously using on my blog. I could find many other examples where the AI approach produced better results, but I think examples above illustrate the point.

Performance

Another concern I had was performance. LSI is compute intensive, but when it uses computing libraries like Numo (a Ruby interface for LAPACK) it works fairly quickly. A Jekyll build on my machine using LSI with Numo averages about 3.5 seconds.

When I tested my AI plugin, the first Jekyll site build was very slow. But this was expected since it needed to fetch embeddings for every post for the first time. My blog currently has 84 posts, and this took 40 seconds (or about 0.5s per post). While not ideal, this is fine for a first run, and because we cache the embeddings the performance is much better after that. Any subsequent run takes about 4 seconds total. (Even with the cached embeddings, we perform a vector similarity search for each post on every build, for now.) So the performance isn’t faster than LSI, but it’s at least not noticeably slower. At only 4 seconds for a full site build with nearly 100 posts, I’m happy with the performance and it feels like a win to get better results than classifier-reborn in about the same amount of time!

Cost

Classifier-reborn is a open source plugin, so it’s free to use. My AI related posts plugin is also open source and free to use, but requires calling OpenAI APIs, which aren’t free. Fortunately, since we only need to call the API once per post and we cache the results, the costs are minimal. I paid $5 for OpenAI API access to get off the free plan and get higher rate limits. It turns out I might not have even needed to do this – I got embeddings for all 84 posts in my blog for $0.00 in API fees, using 1,277 tokens on the text-embedding-3-small model. So while you do need an API key, it doesn’t seem like cost will be a prohibitive factor to using the AI plugin. For most blogs, you can get embeddings for all your posts from the OpenAI API for a few pennies!

Onward 🚀

I’m excited that relatively new AI technologies allowed me to build a plugin, with relatively little code, that produces better related posts than the LSI plugin that’s been used with Jekyll for a long time. And I’m excited to already be using it to make the related posts on my own blog better!

The plugin is open source on GitHub, and I’d love to see others start using it. I’d also like to collaborate to make it better! While it already produces great results, I think there’s potential to make the results even better, and to add integrations with other models and APIs besides OpenAI. (The approach should work with any model that can produce an embedding vector.) It’s exciting to see advancements coming out rapidly in the AI field and to think about how we might use them in the future!

About the Author

Mike Kasberg

👋 Hi, I'm Mike! I'm a husband, I'm a father, and I'm a senior software engineer at Strava. I use Ubuntu Linux daily at work and at home. And I enjoy writing about Linux, open source, programming, 3D printing, tech, and other random topics.

Share!

Sharing my blog posts is a great way to support me and help my blog grow!

I run this blog in my spare time, without any ads. There's no need to pay to access any of the content on this site, but if you find my content useful and would like to show your support, this is a small gesture to let me know what you like and encourage me to write more great content!