Analyzing DockerHub Pull Counts with Clickhouse and Altinity.Cloud

Our marketing team is often asking questions about market adoption of Altinity open source products. This is not an easy thing to know. One may look at GitHub stars and number of forks. But there are also interesting statistics on DockerHub for image pulls. In this article, we will create a very simple ClickHouse app that extracts data from DockerHub and makes it available for reporting.

DockerHub API

It is well known that DockerHub provides APIs for managing images. It is less known that it also collects detailed statistics and makes them available for customers. Unfortunately, in order to use the most informative and detailed API, you need to be a part of the Docker Verified Publisher program. However, there is one undocumented API that can give us enough data. You can usually find it in StackOverflow answers. I do not know the official name, so let’s call it ‘repositories v2’.

The API is very simple:

https://hub.docker.com/v2/repositories/altinity

It returns a JSON object that looks like the following:

{"count":39,"next":"https://hub.docker.com/v2/repositories/altinity/?page=2","previous":null,"results":[{"name":"demo","namespace":"altinity","repository_type":"image","status":1,"status_description":"active","description":"Altinity Demo Appliance","is_private":false,"star_count":0,"pull_count":237,"last_updated":"2018-10-23T09:22:03.485397Z","date_registered":"2017-11-10T09:49:36.680003Z","affiliation":"","media_types":["application/vnd.docker.container.image.v1+json"],"content_types":["image"]},{"name":"clickhouse-operator","namespace":"altinity","repository_type":"image","status":1,"status_description":"active","description":"ClickHouse Operator for Kubernetes","is_private":false,"star_count":2,"pull_count":2600789,"last_updated":"2023-02-16T07:03:27.050279Z","date_registered":"2019-01-24T14:37:54.103794Z","affiliation":"","media_types":["application/vnd.docker.container.image.v1+json","application/vnd.docker.distribution.manifest.list.v2+json","application/vnd.oci.image.index.v1+json"],"content_types":["image"]},

The response object contains row count, link to the next page, and result array. Let’s take a look at one result element.  I’ve formatted it to make it more readable:

{
  "name": "clickhouse-operator",
  "namespace": "altinity",
  "repository_type": "image",
  "status": 1,
  "status_description": "active",
  "description": "ClickHouse Operator for Kubernetes",
  "is_private": false,
  "star_count": 2,
  "pull_count": 2600789,
  "last_updated": "2023-02-16T07:03:27.050279Z",
  "date_registered": "2019-01-24T14:37:54.103794Z",
  "affiliation": "",
  "media_types": [
    "application/vnd.docker.container.image.v1+json",
    "application/vnd.docker.distribution.manifest.list.v2+json",
    "application/vnd.oci.image.index.v1+json"
  ],
  "content_types": [
    "image"
  ]
}

As you can see, there is some basic information about the repository itself, and it also contains a pull_count field! This is exactly what we were looking for. Apparently, it shows a number of image pulls from the beginning of time. And it changes in real time! If we want to measure the usage of the image, we can simply compare pull count at different time intervals.

So let’s create a simple ClickHouse application that will import this data into ClickHouse, and analyze how pull_count changes over time. This will help us to understand the real usage of the project.

Loading into ClickHouse

Let’s start with a simple table:

CREATE TABLE dockerhub_repos (
  timestamp DateTime,
  date ALIAS toDate(timestamp),
  name String,
  namespace String,
  is_private Int8,
  last_updated DateTime,
  star_count UInt32,
  pull_count UInt32
 ) Engine = MergeTree
 PARTITION BY toYYYYMM(timestamp)
 ORDER BY (namespace, name, toDate(timestamp));

In order to load data, we will use the ClickHouse URL function as follows:

SELECT * FROM url('https://hub.docker.com/v2/repositories/altinity/?page_size=100')

Note the use of ‘page_size’ parameter. It is a feature of the DockerHub API that allows us to fetch all repositories for an organization with a single request. In order to parse the DockerHub response we will use the power of JSONExtract functions: 

WITH
  (SELECT * FROM url('https://hub.docker.com/v2/repositories/altinity/?page_size=100', 'RawBLOB', 's String')) as raw,
  assumeNotNull(JSONExtractRaw(raw, 'results')) as results_raw,
  JSONExtractArrayRaw(results_raw) as results_array,
  arrayJoin(results_array) as row
SELECT
  now() timestamp,
  JSONExtractString(row, 'name') name,
  JSONExtractString(row, 'namespace') namespace,
  JSONExtractInt(row, 'is_private') is_private,
  parseDateTimeBestEffort(JSONExtractString(row, 'last_updated')) last_updated,
  JSONExtractInt(row, 'star_count') star_count,
  JSONExtractInt(row, 'pull_count') pull_count

Note the use of the arrayJoin function that converts ‘results’ array into rows.

Now we can insert it into our table as simply as:

INSERT INTO dockerhub_repos
<previous select query>

The only thing that remains to be done is running this query several times per hour to update the data. Of course we do not want to run it manually. We will use ClickHouse “internal cron” – dictionaries. 

ClickHouse “cron job”

In our recent article “ClickHouse as the Looking Glass for OLTP data” we explained how to use executable dictionaries for regular ETL jobs. We will use the same approach here, but make it even simpler. We will run a SQL script using an executable dictionary directly. 

Let’s add the following configuration settings to ClickHouse. I will use the YAML notation of the ‘clickhouse-operator’ CRD for this purpose. In Altinity.Cloud it can be configured via cluster settings:

    dictionaries_config: /etc/clickhouse-server/config.d/*_dictionary.xml</dictionaries_config
    dictionaries_lazy_load: false
    config.d/dockerhub_task_dictionary.xml: |-
        <clickhouse>
            <dictionary>
                <name>dockerhub_task</name>
                <source>
                    <executable>
                        <command>clickhouse-client --queries-file=/etc/clickhouse-server/config.d/dockerhub_update.sql</command>
                        <format>CSV</format>
                    </executable>
                </source>
                <lifetime>
                    <min>900</min>
                        <max>900</max>
                    </lifetime>
                    <layout>
                        <flat />
                    </layout>
                    <structure>
                        <id>
                            <name>id</name>
                        </id>
                    </structure>
            </dictionary>
        </clickhouse>
    config.d/dockerhub_update.sql: |-
        insert into dockerhub_repos
        with
          (select * from url('https://hub.docker.com/v2/repositories/altinity/?page_size=100', 'RawBLOB', 's String')) as raw,
          assumeNotNull(JSONExtractRaw(raw, 'results')) as results_raw,
          JSONExtractArrayRaw(results_raw) as results_array,
          arrayJoin(results_array) as row
        select
          now() timestamp,
          JSONExtractString(row, 'name') name,
          JSONExtractString(row, 'namespace') namespace,
          JSONExtractInt(row, 'is_private') is_private,
          parseDateTimeBestEffort(JSONExtractString(row, 'last_updated')) last_updated,
          JSONExtractInt(row, 'star_count') star_count,
          JSONExtractInt(row, 'pull_count') pull_count;

This way we deployed a SQL script and a dictionary to execute it. Note the <command> part of the dictionary definition. It performs the trick of executing our insert script every 15 minutes! Now we need to leave it for a few hours. Then we can look at stats.

Querying the Data

Once we have data collected for a few hours, we can run a query to check hourly pulls, for example:

SELECT t, runningDifference(pulls) FROM (
  SELECT toStartOfHour(timestamp) t, max(pull_count) pulls
  FROM dockerhub_repos
 WHERE name = 'clickhouse-operator'
 GROUP BY 1 ORDER BY 1
)

┌───────────────────t─┬─runningDifference(pulls)─┐
│ 2023-02-16 11:00:00 │                        0 │
│ 2023-02-16 12:00:00 │                     2262 │
│ 2023-02-16 13:00:00 │                     3450 │
│ 2023-02-16 14:00:00 │                     2606 │
│ 2023-02-16 15:00:00 │                     2484 │
│ 2023-02-16 16:00:00 │                     2571 │
│ 2023-02-16 17:00:00 │                     2801 │
│ 2023-02-16 18:00:00 │                     2622 │
│ 2023-02-16 19:00:00 │                     2407 │

As we can see, the clickhouse-operator image is pulled 2500-3000 times per hour. Let’s also visualize it using ClickHouse built-in chart capabilities.

ClickHouse has a built-in ‘play’ interface that allows users to run interactive queries in the browser. For Altinity demo server the ‘play’ interface can be found as follows: 

https://github.demo.altinity.cloud:8443/play?user=demo , use password ‘demo’.

You can run the query above, but where is the chart? This is undocumented ClickHouse feature. You need to run a slightly modified query:

SELECT t, runningDifference(pulls) FROM (
  SELECT toStartOfHour(timestamp)::INT t, max(pull_count) pulls
  FROM dockerhub_repos
 WHERE name = 'clickhouse-operator'
 GROUP BY 1 ORDER BY 1
)
FORMAT JSONCompactColumns

The trick here is the ‘JSONCompactColumns’ format. If and only if this format is used, ClickHouse ‘play’ interface interprets the results as a chart. It is very basic and does not support multiple series. But it is very quick: 

As you can see the pull rate is pretty constant over time. We do not know the reason; it’s likely some CI/CD pipelines are running somewhere.

What if we want to collect statistics about other organizations? We can extend the dockerhub_update.sql file and fetch queries from multiple endpoints. ‘--queries-file’ parameter of clickhouse-client allows us to run multiple queries. For example, we can add one more INSERT and collect statistics for ‘clickhouse’ using the following URL:

https://hub.docker.com/v2/repositories/clickhouse/?page_size=100

Since we have multiple organizations now, we need to add a namespace filter to queries. For example, we can check ‘clickhouse-server’ pulls using the query below:

SELECT t, runningDifference(pulls) FROM (
  SELECT toStartOfHour(timestamp) t, max(pull_count) pulls
  FROM dockerhub_repos
 WHERE name = 'clickhouse-server' AND namespace = 'clickhouse'
 GROUP BY 1 ORDER BY 1
)
FORMAT JSONCompactColumns

Quite surprisingly, ClickHouse docker pulls vary between 500 and 2000 in different hours, with a clear drop over the weekend.

Conclusion

As we showed in this article, it is very easy to start collecting data into ClickHouse that is provided by public HTTP APIs, like DockerHub. It took me only 30 minutes to set this up in the Altinity.Cloud demo instance. After that I had to be patient to let the data be collected for a few days. The data confirmed the popularity of clickhouse-operator and clickhouse-server docker images. Deeper analysis requires a different DockerHub account with more access to data. We will explore that in the future.

Share