---
url: 'https://altinity.com/blog/bluehoover-analyzing-bluesky-posts-with-clickhouse'
title: 'BlueHoover: Analyzing BlueSky posts with ClickHouse®'
author:
  name: Josh Lee
  url: 'https://altinity.com/author/josh/'
date: '2024-11-27T10:47:15-08:00'
modified: '2025-04-08T15:23:32-07:00'
type: post
summary: Have you ever wanted to know whether cats or dogs are more popular on a social network? The BlueHoover project uses the power of ClickHouse® to see what people are talking about on Bluesky.
categories:
  - Blog
image: 'https://altinity.com/wp-content/uploads/2024/11/blue_hoover_cover_graphic.png'
published: true
---

# BlueHoover: Analyzing BlueSky posts with ClickHouse®

Have you ever wanted to know whether cats or dogs are more popular on a social network? Now, with the power of ClickHouse, you can answer that and many other questions about what people are talking about on [Bluesky](https://bsky.app/).

This is all thanks to [BlueHoover](https://github.com/JosephRedfern/bluehoover), an awesome new hobby project by [Joe Redfern](https://www.linkedin.com/in/josephredfern/) of [SenseOn](https://www.senseon.io/). The project uses python scripts to hoover up a firehose of every Bluesky post and store them in a ClickHouse database. You can easily run it yourself (it’s dockerized) on pretty much any hardware. I’ve left it running on an 8-core MiniPC, and I forgot I had it running.

This is just awesome – and I don’t use that word lightly. It reminds me of the earlier days of the internet when places like Twitter and Reddit were considered public forums, with open APIs available to all. What’s new is with the relative power of modern mini-PCs and the small but growing volume on BlueSky, this kind of knowledge-graph recording is more accessible than ever before.

Let’s take a look at how the project works and leverages ClickHouse to achieve this awesome performance.

If you love ClickHouse®, Altinity can simplify your experience with our **fully managed service** in your cloud (BYOC) or our cloud and **24/7 expert support. [Learn more](https://altinity.com/plans-and-features-clickhouse/).**

## **Architecture**

I’ll let Joe describe the overall [architecture](https://github.com/JosephRedfern/bluehoover):

> BlueHoover consumes the Bluesky [Jetstream](https://docs.bsky.app/blog/jetstream) websocket feed, extracting post events, and writing them to a posts table in ClickHouse in batches.
> 
> 
> 
> On the ClickHouse side, the tokenizer Materialized View is attached to the posts table and updates a `tokens_by_interval` table with the tokens found in the posts. `tokens_by_interval` is a `SummingMergeTree`, with an ordering key on `(token, period)`, where `period` is the 10-minute interval in which the post was created. Tokenization is currently very basic, simply splitting on non-alphanumeric characters with ClickHouse’s `splitByNonAlpha()` function, and then lowercasing.
> 
> 
> 
> Trends are identified by comparing the count of a token in the last hour with the average count of that token in the previous 24 hours. This is performed by the `trends_1hr_mv` Refreshable Materialized View, refreshing hourly, and updating the `trends_1hr` table with the results.

A basic [FastAPI](https://fastapi.tiangolo.com/) webapp is provided for viewing the trends, and visualizing token rates.

The [Jetstream Websocket feed](https://docs.bsky.app/blog/jetstream) is a lightweight alternative to running a full Bluesky relay, and there are several instances hosted by the Bluesky team.

## **Visualizing the Data Yourself**

Getting started yourself is extremely straightforward: [clone the repo](https://github.com/SenseOn/BlueHoover) and run `docker compose up` – since the project includes a reverse proxy, you can access the web UI at [locahost:80](http://locahost:80/).

We can see the results:

![](https://altinity.com/wp-content/uploads/2024/12/visualization-1-1024x644.jpg)*Screenshot*

Since you’re ingesting your own copy of the BlueSky firehose, it will take some time for the visualizations to fill in – although here we can see that I have some results after just a few minutes. I recommend letting it run for at least an hour before coming back to play with it more. You can also backfill by editing `cursor.txt` to a timestamp 20-24 hours in the past. Here’s what things look like once they are filled in:

![](https://altinity.com/wp-content/uploads/2024/12/visualization-2.jpeg)*Seems like somebody is starting a conversation about badgers?*

### **The Schema**

This is pretty cool! While we’re waiting for data, let’s take a look at the schema:

### **Example Queries**

With the architecture and schema in place, we can start exploring the wealth of data BlueHoover has collected. Here are a few example queries to get you started:

```
CREATE TABLE IF NOT EXISTS posts (
    cid String,
    uri String,
    created_at DateTime64(3),
    author String,
    text String,
    reply_parent_uri String,
    insertion_time DateTime64(3) DEFAULT now64()
) ENGINE = MergeTree()
ORDER BY (created_at, cid)

CREATE TABLE IF NOT EXISTS tokens_by_interval (
    `period` DateTime64(3),
    `token` String,
    `count` UInt64
) ENGINE = SummingMergeTree
ORDER BY (period, token)

CREATE MATERIALIZED VIEW IF NOT EXISTS tokenizer TO tokens_by_interval (
    `period` DateTime,
    `token` String,
    `count` UInt8
) AS SELECT
    toStartOfInterval(created_at, toIntervalMinute(10)) AS period,
    arrayJoin(splitByNonAlpha(lower(text))) AS token,
    1 AS count
FROM posts
```

#### **Most Active Authors**

If you’re interested in knowing which authors are the most active on BlueSky, you can query the posts table:

```
SELECT    author,    COUNT(*) AS post_countFROM    postsWHERE    created_at >= now() - INTERVAL 1 DAYGROUP BY    authorORDER BY    post_count DESCLIMIT 10
```

This will give you the top 10 authors who have posted the most in the last week.

#### **Word Frequency Over Time**

To analyze how the usage of specific words changes over time, you can use the `tokens_by_interval` table:

```
SELECT    period,    sum(count) AS total_countFROM    tokens_by_intervalWHERE    token = 'clickhouse'    AND period >= now() - INTERVAL 30 DAYGROUP BY    periodORDER BY    period ASC
```

This query tracks the usage frequency of the word “clickhouse” over the past month, broken down into 10-minute intervals.

## **Performance and Scalability**

One of the standout features of ClickHouse is its exceptional performance, especially when dealing with large volumes of data. BlueHoover leverages ClickHouse’s columnar storage and parallel processing capabilities to ensure that queries are executed swiftly, even as the dataset grows.

## **Conclusion**

BlueHoover exemplifies the synergy between open-source projects and powerful data platforms like ClickHouse. By enabling real-time ingestion and analysis of social media data from BlueSky, it opens up endless possibilities for understanding trends, user behavior, and the dynamics of online conversations.

Whether you’re a data enthusiast eager to explore social media analytics or a business looking to gain insights into your audience, BlueHoover provides the tools and flexibility to achieve your goals. 

Kudos to Joe Redfern for developing such an interesting project. 

Ready to dive in? Head over to [BlueHoover’s GitHub repository](https://github.com/JosephRedfern/bluehoover) to get started.

