Distributed Tracing with ClickHouse® & OpenTelemetry

Recorded: June 18 @ 08:00 am PDT
Presenters: Josh Lee & Maciej Bak

In this webinar, Josh Lee (Open Source Advocate, Altinity) and Maciej (Support Engineer, Altinity) cover two complementary topics: using ClickHouse® as a backend for OpenTelemetry observability data, and using ClickHouse®’s own built-in OpenTelemetry tracing to observe ClickHouse® itself.

Josh opens with a concise introduction to distributed tracing, explaining why it fills the gap between metrics and logs, what OpenTelemetry is and why it matters, and why ClickHouse is an excellent data store for telemetry at scale. He covers ClickHouse’s columnar storage, time-friendly ordering, TTL support, write-once-read-many architecture, and its ability to handle extreme cardinality without sampling. He then walks through a live demo of the OpenTelemetry Collector ClickHouse exporter sending traces to ClickHouse, visualized in Grafana using the Grafana-maintained ClickHouse data source plugin.

Maciej then explains ClickHouse’s internal query execution, covering how inserts create immutable parts, how parallel threads build those parts in memory before flushing to disk, and how aggregation parallelizes across threads using hash tables and partial aggregates that are later merged. He then delivers the second demo: distributed traces emitted by ClickHouse’s own built-in OpenTelemetry instrumentation, visualized in Grafana, showing real inter-node communication across a two-shard, two-replica cluster. He also demonstrates the same tracing capability against a Project Antalya query that delegates reads to a four-node swarm cluster reading Bitcoin transaction data from S3.

The Q&A covers preventing too many small parts with batch processing, handling 20,000 requests per second with head and tail sampling, how ClickHouse handles cardinality through its sparse index architecture, the out-of-the-box suitability of the default OpenTelemetry schema, and best practices for ordering map keys for compression.

Here are the slides:

Webinar_ OpenTelemetry Tracing with ClickHouse Download

Key Moments (Timestamps)

Key moments generated with AI assistance.

00:06 – Welcome and introductions
00:44 – Agenda overview: two demos, distributed tracing primer, ClickHouse for observability
02:17 – Why logs are the foundation: request logs, metrics, and the golden signals
03:38 – How distributed tracing works: trace IDs, span IDs, parent span IDs, and waterfall graphs
05:23 – The three pillars of observability: metrics, traces, and logs
07:14 – Why OpenTelemetry exists: the need for a vendor-neutral open standard
09:16 – What OpenTelemetry provides: OTLP, semantic conventions, W3C trace context, SDKs, and the collector
13:37 – ClickHouse for observability: columnar OLAP, extreme cardinality, write-once-read-many, TTL
18:06 – Integrations: OpenTelemetry Collector exporter, Grafana plugin, Kafka, SigNoz, Coroot, Clickstack
20:13 – OpenTelemetry Collector architecture and the ClickHouse exporter configuration
25:13 – Demo 1: Sending traces to ClickHouse and visualizing them in Grafana
31:51 – ClickHouse architecture: how inserts create parts and parallel threads
36:04 – ClickHouse architecture: how GROUP BY aggregation parallelizes using hash tables
39:56 – Demo 2: Built-in ClickHouse distributed tracing across a two-shard, two-replica cluster
42:52 – Demo 2 continued: Tracing a Project Antalya query across four swarm nodes reading from S3
45:28 – Final thoughts and summary
46:41 – Q&A: batch processing, sampling at scale, cardinality, and schema tuning

Webinar Transcript

[00:06] – Welcome and Introductions

Josh: Hello, everybody. Welcome to our webinar on ClickHouse® and OpenTelemetry, where we are going to show how to store distributed traces in ClickHouse and then use distributed tracing to trace ClickHouse itself and show some of its inner workings. I am very excited for this webinar. Thanks again for joining us.

I am Josh Lee. I am the open source advocate here at Altinity. I know just enough to get myself in trouble, which is why I am here, supported by Maciej. Maciej, would you like to introduce yourself?

Maciej: Hello, everyone. I am Maciej, and I am a support engineer at Altinity.

Josh: All right. So quickly, our agenda. We have an hour today. A lot of it we are going to save some time for. We have two demos for you today: one fairly short one in the middle, and then a longer one at the end where we really dive into some of the stuff and the code behind it. Before we get to that, we are going to do a very brief version of what distributed tracing is, what OpenTelemetry is, and why ClickHouse is a good choice for distributed tracing with OpenTelemetry. If you have already seen me talk, it might be a little bit repetitive. I am going to go through it quickly. If you have not seen me talk before and you feel like I am going too fast, go ahead and find any of the YouTube videos where I have covered this. This five-minute intro has also been done as a 20- or 45-minute talk.

Then we are going to talk about ClickHouse for observability and why ClickHouse makes a great data store for all of your telemetry signals, especially traces and logs. We are going to show a short demo of sending OpenTelemetry traces to ClickHouse and visualizing them with Grafana. Maciej is going to talk to us about the architecture of ClickHouse and the tools built into ClickHouse that we can use for observing ClickHouse itself. Then we are going to go into a longer demo showing distributed traces from an actual running ClickHouse cluster across the various nodes. And possibly, if we have time, we will show some of our exciting new work with Project Antalya, which is all around connecting ClickHouse to the data lake for infinite scalability.

[02:17] – Why Logs Are the Foundation: The Golden Signals

Josh: The way I usually start this talk is by making fun of logs. This usually gets a laugh when I am doing it live because logs are an easy punching bag. We have all been frustrated by reading way too many logs trying to find that needle in a haystack. But I think maybe we are too hard on logs, because they are actually the fundamental building block of everything we are doing with observability.

Here is a basic request log that I am sure we have all seen many times. Typically this will be in structured format in your system, but just for readability here we have it in plain text: timestamp, duration, HTTP verb, endpoint, and response code. It turns out if we have all of those things for every request inside our system and we can count them, we can build an awesome dashboard with all of our RED metrics: requests, errors, duration, or the golden signals as Google likes to call them. So we can get really valuable metrics out of those request logs.

With the combination of metrics and request logs, we can kind of answer the questions “is something wrong?” and maybe “what went wrong?” if we can find the right log message. But that is the key point: can we actually find the right log message? And that is where I think distributed tracing can be really helpful.

To anyone wondering: yes, the recording will be shared. We will be following up with an email with the recording for everyone who signed up, as well as links to code samples and follow-up blog posts. So no worries if you have to drop.

[03:38] – How Distributed Tracing Works: Trace IDs, Span IDs, and Waterfall Graphs

Josh: If we take those logs and simply add a request ID as a first step, that is really useful because then we can tie each log message back to the end user action that it is responsible for. But that gives us just a bag of logs that we then have to correlate imperfectly using timestamps and things like that.

What tracing does is encode all of that useful information about hierarchy and what called what directly into the log messages themselves. We are really only adding two fields to our log: a span ID and a trace ID. For child spans, we also add a parent span ID. That allows us to structure our logs into a tree. We get this amazing tree structure, this waterfall graph, that tells us all about the latency of an individual request and the exact order of what called what and when. If there is an error in one of these calls, we can reverse the path and see exactly what led to that error.

This graph is immediately apparent in how valuable it is for understanding latency or understanding the chain of events leading up to an error message. But I think distributed tracing is also much more powerful when we think about traces in aggregate, because then we start to understand entire pieces of our system: the architecture and the topology. But observability is not any one signal. We need them all.

[05:23] – The Three Pillars: Metrics, Traces, and Logs

Josh: The fundamental characteristic of metrics is that they are aggregable. I would argue that right now most are pre-aggregated, but if we had a good enough storage engine that could compress traces and store all of them in an unsampled fashion cost-effectively, we would not necessarily need to pre-aggregate our metrics. We could slice and dice them after the fact based on any new dimension that we might not have thought of and that is embedded in our code and binaries.

So we have our metrics, whether pre-aggregated or post-aggregated. We have our traces, which help us zero in on where the problem is. And all of that is in service of finding that log message with the error. Distributed tracing is the killer app for this workflow. It lets you understand the complete request flow. When combined and aggregated, we can create a real-time map of our system topology and all of the dependencies. We can derive metrics across any number of dimensions with essentially unlimited cardinality, unlike with metrics where we have to be very careful to guard cardinality. And finally we can make our logs so much easier to find by attaching them directly to a trace message. That is something you can do with the OpenTelemetry specification, where you can have all of the relevant log messages attached to the relevant traces, making them so much easier to find.

[07:14] – Why OpenTelemetry Exists: The Need for a Vendor-Neutral Standard

Josh: One of the main reasons the OpenTelemetry project exists is the need for an open standard for distributed tracing. For logging and metrics, which have been the traditional ways to do monitoring, the standards are essentially text. Logs are just text lines you can feed into other systems. Metrics are very simply a number and a bunch of tags. Prometheus has become essentially an open standard in itself. But with distributed tracing we did not have anything like that.

We needed a way to propagate the trace and span IDs from incoming requests all the way down into all outgoing requests. We needed something that could hook into the actual internal state of our applications. We needed some kind of embedded SDK, and without OpenTelemetry we were left to use vendor SDKs and actually embed those SDKs in our artifacts. At a large organization that might mean tens of thousands of code artifacts.

Now we do not have to go and put vendor code in all of our code artifacts. All of our code artifacts can have vendor-neutral, open-source instrumentation that speaks a vendor-neutral, open-source language that we can then use with any tool, or any number of tools, at the same time. That is why we have OpenTelemetry.

[09:16] – What OpenTelemetry Provides: Specifications, SDKs, Collector, and Extensions

Josh: So what actually is OpenTelemetry? The most important part is the specifications themselves. OTLP is the wire protocol for how telemetry signals are transmitted over the network. Semantic conventions are the consistent names for things so that one team does not use http_requests while another uses requestsCount for the exact same metric. That inconsistency can consume all of the time of a central observability team. Then there is the W3C trace context, the open specification for how trace ID and span ID get put into HTTP headers and propagated to downstream services. If you put those three things together, any tool that is compatible with all three is OpenTelemetry compatible, even if it is not using any of the OpenTelemetry-provided code from the community.

Of course the OpenTelemetry community does provide language-specific SDKs and APIs. Those are what get dropped into our programs to let them emit tracing. There are also instrumentation libraries. For example, the OpenTelemetry instrumentation library for Express knows how Express works, how to observe its HTTP calls, and how to inject headers into outgoing HTTP calls, so you can get automatic instrumentation of Express just by adding a single Node.js library. There are hundreds if not thousands of instrumentation libraries available for the 11 languages supported by OpenTelemetry.

Then there is the OpenTelemetry Collector. This is something we are using in our demo. Think of it as an alternative to a vendor agent. It can work as a node or cluster agent gathering node and cluster-level metrics and logs, and it can also act as a proxy for telemetry from your applications forwarding it to the backend of your choice. It is extremely extensible and at this point compatible with pretty much every monitoring standard under the sun through its extension ecosystem.

OpenTelemetry is vendor-neutral, open source, and is the second most active CNCF project. That is really significant. There are so many people contributing to and using this project. It is also one of the few instances in the entire world of software where we have had format consolidation. Open Census and Open Tracing both existed, they merged, and now we have OpenTelemetry. One fewer standard. And we now have OpenTelemetry co-evolving with Prometheus toward harmony between those standards.

[13:37] – ClickHouse for Observability: Columnar OLAP, Cardinality, and Write-Once-Read-Many

Josh: Very quickly, what is ClickHouse for those not familiar? It is a SQL-compatible OLAP database. OLAP means it is great for analytics and things where we are interested in the aggregate of many rows, not individual transactions like an OLTP database. It is columnar. This architecture is part of what makes it so fast at reading all of the data from one column, which is a very common operation in observability use cases.

Anytime there is an architectural trade-off in ClickHouse between speed and something else, nine times out of ten the choice is speed. There are tons of optimizations under the hood. It is petabyte-scale. We can go up to dozens of petabytes, and we are working on Project Antalya which brings ClickHouse to Iceberg and Parquet-formatted data lakes on S3-compatible object storage for theoretically infinite scalability.

ClickHouse eats cardinality for breakfast. In a system like Prometheus, high cardinality might be dozens or hundreds. In ClickHouse, high cardinality means millions. It is just in a completely different ballpark.

And it is very optimized for write-once-read-many workloads, which is essentially what we are doing with observability. We are ingesting events at a very high rate, recording them, and almost never need to update them. Once a trace is completed, it is complete. We ship it to our database, we want to analyze it, maybe derive some metrics from it, but we never need to update that record. At a certain point we want to delete it when we do not care about it anymore. This write-once, read-many architecture really lends itself to that type of workload.

The way it works is that data gets written into a partition and in the background ClickHouse is merging those partitions and optimizing the order in which everything is written so that it can be read as quickly as possible. You get the ability to do small writes that are very fast, and in the background the data is merged so that you get more and more efficient reads as your data gets more and more compacted into larger, ordered partitions.

The benefits for observability workloads specifically are: it is time-friendly, right? If we are ordering data by the timestamp on the telemetry signal, that is usually the way we are going to be reading the data, so we are reading data aligned with the way it is written to disk, which makes everything very fast. We have easy cleanup because everything is partitioned by time. When something is past our retention policy, we can just delete an entire partition. We get automatic TTL, so we can automate that process of deleting old partitions or moving them to cheaper storage for archiving, perhaps for compliance reasons. You have extreme cardinality support, excellent compression even on semi-structured data like JSON blobs and map columns, and flexible schema with things like JSON and map columns.

[18:06] – Integrations: OpenTelemetry Collector Exporter, Grafana, SigNoz, Coroot, and More

Josh: There are a bunch of integrations that make ClickHouse really great for observability. The two I am going to show in my demo are the OpenTelemetry Collector Exporter and the Grafana data source plugin.

The OpenTelemetry Collector ClickHouse Exporter is a collector extension that can export traces, metrics, and logs to ClickHouse. What that means is that any of the 90-plus things that can be received by the OpenTelemetry Collector can now be exported to ClickHouse in one of those three signal formats. For actually visualizing and working with that data, we have a Grafana data source plugin. There are actually two: one maintained by us at Altinity and one maintained by Grafana. There are reasons to use both.

You might also have something like the Kafka connector. This is maybe not a common use case, but at very large-scale organizations where all of your telemetry already lives in a single Kafka pipeline, you can use a sync connector to dump that data into ClickHouse and do analytics with it.

If you are at a much smaller scale and you want something that is drop-in and ready to go out of the box, Coroot, SigNoz, and Clickstack are all fully open-source and more-or-less complete observability platforms. They each include an agent, a backend, and an analysis UI. SigNoz has alerting. I believe Coroot and Clickstack have some alerting too. All three use ClickHouse under the hood. So if you are thinking, “Wow Josh, everything you are talking about today looks great but I do not want to build it all myself,” those are the three things you will want to check out.

Qryn is a really cool tool that essentially provides an Elasticsearch-compatible API on top of ClickHouse. If you are already using some kind of Elastic or OpenSearch-compatible tool and you would like the performance of ClickHouse, Qryn can provide a translation layer. And then as I mentioned, we are working on Project Antalya, which will integrate ClickHouse with Iceberg and Parquet for cheap storage and infinite scalability.

[20:13] – OpenTelemetry Collector Architecture and ClickHouse Exporter Configuration

Josh: I do not want to spend too much time here because I want to make sure we save time for the demo. The OpenTelemetry Collector configuration for sending data to ClickHouse is pretty straightforward. Common receivers you might use include the OTLP receiver to receive traces from the OpenTelemetry SDKs, the hostmetrics receiver to use the collector as a node agent, and the filelog receiver to gather logs from a Kubernetes cluster or from a non-Kubernetes host. Commonly you are going to use the batch processor. In my demo I am using the Kubernetes attributes processor, which adds metadata for the Kubernetes pods where the telemetry is coming from, doing that dynamically and intelligently. And then we are using the ClickHouse exporter.

In our demo I have two instances of the OpenTelemetry Collector. One is deployed as a DaemonSet to every node and is gathering cluster logs from that node. There is also another collector deployed as a Deployment, responsible for receiving all of the OTLP data coming from the DaemonSet collectors and acting as a sort of gateway so there are not too many connections going to the ClickHouse cluster.

Another thing the ClickHouse exporter does is it automatically creates the schema in ClickHouse for you. By default you can turn it off, but by default it automatically creates the schema. For me it was nice to be able to control that and just have one deployment responsible for creating the schema, and then have multiple DaemonSets pointed at that central deployment gateway.

For ClickHouse itself, I am using the Altinity Kubernetes Operator for ClickHouse®. This is fully free and open source under Apache 2.0, just like ClickHouse itself. It is by far the easiest way to run ClickHouse on Kubernetes. There is a Helm chart to just Helm install the Altinity Operator, and then there is another Helm chart for creating the actual ClickHouse cluster, or it is a pretty simple custom resource if you just want to create it yourself: a couple dozen lines of YAML and you will get yourself a ClickHouse cluster. The Altinity Operator can also manage ClickHouse Keeper for you. Keeper is responsible for maintaining consistent state across your ClickHouse cluster and coordination. If you are doing high availability, you will definitely want Keeper with replication and some degree of sharding as you scale.

[25:13] – Demo 1: Sending OpenTelemetry Traces to ClickHouse and Visualizing in Grafana

Josh: Let us look at the demo showing how we get traces into ClickHouse with OpenTelemetry. For this demo I have deployed a couple of simple services: Grafana, the OpenTelemetry Collector as a DaemonSet, and the OpenTelemetry Collector again as a Deployment. This is a pretty simple umbrella Helm chart with just three subservices.

If we look at the values, the first OpenTelemetry Collector has the OTLP exporter configured pointing to our deployment instance of the collector, and it has all of the Kubernetes presets turned on for gathering stuff from nodes. That gives us a bunch of logs.

Then we have the new deployment instance of the OpenTelemetry Collector. This one has some of that Kubernetes stuff turned off. We still enable the Kubernetes attributes processor in case we get any telemetry signals not coming from one of those other collectors. We have our ClickHouse exporter configured with our endpoint and database name. A super secure username and password, of course. You always want to make sure that is not in plain text in your values like it is here. We have the create_schema flag enabled. We can give it names for the logs tables. These are the defaults, but I am just being explicit here. And then retry settings, all of which are mostly defaults.

We are also exposing an OTLP receiver on this deployment so it can receive OTLP from the DaemonSet collectors.

What we get from that is a basic dashboard. This is not the state of the art for what you could do with Grafana and this data, but just to show you that the data is there and give you some ideas. I am using the library tool telemetrygen and I piped a few artificial traces into this ClickHouse instance so that we would have something to look at. In the second demo later, Maciej is going to have much more interesting traces for you.

You can see I can create a graph for traces over time. If we edit this, we can see the underlying query. I am using the upstream, Grafana-maintained plugin for ClickHouse because it has built-in support for OpenTelemetry. You can just check a box saying “yes, I am using the ClickHouse OpenTelemetry exporter” and it already knows what all of the columns are and how to map the schema to what Grafana expects for traces. That is basically the main work when visualizing these traces in Grafana: getting the schema to match what Grafana expects.

And then we have this list of traces. One of the interesting things is we have the service name, the operation name, and if we look at one of the individual traces we can see the spans. We also get these attributes: resource attributes like k8s.pod.ip and service.name, and span attributes like service.instance and service.namespace. These are the things that actually define the entities you care about. If you want to navigate your telemetry based on services, or the team that owns a service, or the services that are downstream from another service, all of those types of queries begin with this metadata. And that is one of the ways where ClickHouse is not just efficient at this storage but also a relational database that lets us connect information together in a graph and navigate that graph. This becomes really powerful and we can do really interesting things with all of this metadata, create entities from it, and use those entities as a way to navigate our telemetry as an alternative to navigating purely by time.

This is not a complete observability solution quite yet. What I am showing here is really the collection, sampling, processing, and storage blocks from the observability pipeline diagram. We are not doing a lot on the analysis and UI side. I have not set up any alerts, though I could do that using Grafana Alert Manager. But if you want something you could just drop in and have work out of the box, as I mentioned, Coroot is a great choice. SigNoz is another great choice. Clickstack is a good option too.

And just to look at this Coroot screenshot quickly: this is combining OpenTelemetry with another awesome observability tool called eBPF. eBPF is not just for distributed tracing. It is part of the Linux kernel and is used for a ton of things, but since it allows us to observe other processes from inside the Linux kernel it is really powerful for observability and especially distributed tracing. This screenshot comes from just using eBPF without any OpenTelemetry at all. Coroot also speaks OpenTelemetry, and there are other tools that use eBPF agents and export that data as OpenTelemetry so you could use it with any OpenTelemetry-compatible backend. And that is one of the really cool things about OpenTelemetry: it is so interoperable as long as you are speaking the specification. You do not have to use the built-in implementations.

[31:51] – ClickHouse Architecture: How Inserts Create Parts

Maciej: So for those who do not know ClickHouse, it is superb and easy to start with. You do an apt-get install, you have SQL, it reads most formats, it compresses data excellently, and it is really fast. So how does ClickHouse actually process inserts?

ClickHouse offers a variety of methods for data insertion, from integration with messaging systems like Kafka to reading data from external sources like S3. It can even connect to MySQL and PostgreSQL, and it reads most formats like JSON and Parquet. So what is exactly happening inside ClickHouse?

Like in any other database, a query arrives to the database. ClickHouse does some parsing, analyzing, and planning. Data is sorted, indexes are built, and then the insert is executed, creating what we call parts, which are immutable segments of the table, and those are flushed to disk. The key thing to know here is that the data lives in memory while the insert is being processed.

So what we want to do is parallelize this process to execute it faster. ClickHouse has a setting max_insert_threads, which tells ClickHouse how many threads to use when performing the insert. This particular setting applies only to INSERT INTO SELECT. For a plain insert like inserting from a CSV or TSV, those are also parallelized by input_format_parallel_parsing, which is enabled by default.

In this example we use four insert threads. The SELECT will read the source table in parallel using the max_threads setting, which by default uses all available CPUs in the system. Every thread allocates memory to read and pack the columns in blocks. Those blocks are squashed into bigger ones, and then parts are created. If the table has partitioning, each partition needs to have separate parts, which later get built in memory and flushed to disk.

In the course of the operation, ClickHouse can create many parts across all of those threads. It is important to create as large a part as you can, because later there is the merging process, which can consume I/O and resources. And for an INSERT INTO SELECT, it is basically two separate operations, the insert and the select, and both need memory. You cannot create too many parts because the merging process that follows can take I/O and resources.

[36:04] – ClickHouse Architecture: How GROUP BY Aggregation Parallelizes

Maciej: Now we will talk about how ClickHouse processes queries with aggregates. If you have a GROUP BY, like in the example here where we calculate the average delay per carrier from a flights table, after ClickHouse parses the query it builds a memory structure, analyzes the query, applies some optimization, and then the planner tells the executor how to get the data.

As I mentioned, ClickHouse by default will use all available CPUs to execute the query. Each thread scans over the data parts and runs streaming functions that read the data in groups, which we call buckets. Those buckets have hash tables, which are eventually merged, and at the end the result is passed back to the user. There is no caching layer built into the reading of data. ClickHouse relies on the Linux page cache to make things faster. If you do I/O over the same block, it will read from memory if it is available. It is an efficient and transparent caching layer.

Each thread splits data into buckets and calculates partial aggregates, which are later calculated into final aggregates. Those partial aggregates are accumulated in hash tables, and internally this is called an aggregate state. For example, for COUNT, the partial aggregation state is simply an incremental counter variable. For AVG, the partial aggregation calculates a weighted average, so it remembers the numerator and denominator numbers and in the end merges them into the final aggregation. It looks similar to MapReduce, but ClickHouse aggregation is actually more sophisticated. For example, the aggregation is dynamic: those hash tables can be split into multiple levels to allow merging hash volumes in parallel.

This is all happening in memory. You can enable spilling to disk for sorting, but it will slow down the query, so it is generally better to avoid it.

[39:56] – Demo 2: Built-In ClickHouse Distributed Tracing Across a Cluster

Maciej: Now I will show you distributed tracing inside ClickHouse itself.

As you can see here, this is Grafana, and we are using the trace component to visualize traces. I executed a simple GROUP BY query over a ClickHouse cluster with two replicas and two shards. As you can see here in the trace details, we can see all the traces from inside ClickHouse. It is built in. ClickHouse saves the traces into the OpenTelemetry span logs table. What is important here is that by default, these built-in traces cannot trace distributed queries. This needs to be done manually for now, though as far as I know the ClickHouse team is working on this.

As we can see here, the query is being executed. We have all the information we need. It was executed on the first node. If we scroll down, we can see that another node executed part of it. Since it is a distributed table, ClickHouse will delegate some part of the query to another node. With two replicas, the default load balancing is random, so it will choose one of the nodes to execute the GROUP BY query. It will do the partial aggregates and then aggregate again on the node where the query was originally executed.

Josh: Just to clarify for everybody: these trace spans are coming directly out of ClickHouse itself. ClickHouse stores trace spans for its own operations. With other databases you might get an exit span from the service that called the database, and then you would just see something like “there is a database call and it took 8 seconds” without any other insight into what happened inside that database call. Here, what we are seeing is the actual communication between ClickHouse nodes. We would also see communication with Keeper in these traces. You can see everything that is happening behind the curtain of the database call, using distributed tracing that is built into ClickHouse and is OpenTelemetry compatible. I love it. I wish that more tools worked this way, and I think they can in the future if we just pay attention to open standards and build observability in from the beginning.

[42:52] – Demo 2 Continued: Project Antalya Swarm Cluster Tracing

Maciej: What is most interesting is that we can do this for Project Antalya as well. For those who do not know, Project Antalya extends ClickHouse storage onto Iceberg, so we can have really cheap storage on Iceberg with fast executing queries. Most importantly, we have what we call swarm clusters, which are scalable pools of self-registering, stateless ClickHouse servers. The ClickHouse node that runs the query, which we call the vector, delegates reads for Parquet files to the swarm, and the swarm nodes execute the query.

As we can see here, this is a simple query that reads from Bitcoin transaction data on a public blockchain stored on S3. I am delegating the work to the swarm cluster. We can see that there are four swarm nodes and each has done some work. If we scroll here, we can see that the second node has done some aggregation and so on. The same for the others. We have the price details and so on. It can be easily used to track your queries, especially if they are executing slowly.

[45:28] – Final Thoughts

Josh: So just to summarize, distributed tracing is awesome, especially for understanding applications. For us, ClickHouse is an application. For you, it might just be part of your infrastructure, but distributed tracing is awesome for understanding any application. You might want to consider ClickHouse as a massively scalable telemetry data store if you are doing DIY observability at your organization. And if you are not doing completely DIY observability, then you might want to consider some ClickHouse-based observability tools like the ones we shared, especially if cardinality and dimensionality are important to you. You do not really want to think about that stuff in advance. You want your database to just handle it so you can figure it out later.

And tools with built-in observability like ClickHouse has: that is just awesome. Let us see more of that. Way to go ClickHouse team.

[46:41] – Q&A

Josh: How do we make sure not to create too many small parts? The answer is: use the OpenTelemetry batch processor in the OpenTelemetry Collector to make larger batches. There are also some settings on the collector exporter for ClickHouse, and we have a knowledge base article on atomic inserts that we will make sure to include in the follow-up email.

Q: How do you handle traces with an app generating 20,000 requests per second? Can I enable traces only for errors?

Josh: Yes, the OpenTelemetry Collector has a bunch of tools built in for sampling, and you can also sample at the source. Sampling at the source is going to be the most efficient, but then you are doing what is called head sampling: you do not actually know if there was an error in the trace yet when you are emitting it. Or you can do sampling in the collector using a strategy called tail sampling, where you wait until the trace is completed to see if there was an error and then make a decision about whether to keep that trace or not. It does come with challenges: you need to record the ratios at which you are discarding spans so that you can recreate any dashboards or counts that might be derived from them. But certainly if you are dealing with 20,000 requests per second, that might be something to look into. That said, ClickHouse could certainly handle trace spans at that volume if you scale it correctly.

Q: We are concerned about cardinality. Can you explain how ClickHouse handles it?

Maciej: What needs to be said is that ClickHouse indexes are sparse. That means it does not create a full bitmap index. It just saves the minimum and maximum values for each granule. So if you have a high-cardinality column, the index only stores min and max for that column in the index. We generally do not recommend high-cardinality columns in the ORDER BY clause. But if you need to have them, you just need to be aware that the index will be stored in memory.

Josh: And if you do want to use a high-cardinality dimension in a way that makes queries faster, you might make a materialized view with a subset of the dimensions that can do some pre-calculation for you.

Q: If I send metrics over the collector to ClickHouse with no changes or processing, is the default schema good?

Josh: The default schema is pretty good. I have heard from people using it in production just fine. I have also heard some people had scale issues with the way they were reading the data and so they made minor adjustments to the schema to suit their read case a little bit better. Unless you are hitting extreme scale, the out-of-the-box schema should be good enough. If you do want to refine it, ping us in Slack.

Q: Is there a ClickHouse schema that matches semantic conventions?

Josh: The semantic conventions mostly have to do with what goes in the resource attributes and span attributes columns, which are maps. The schema is very flexible, and you can also invent your own attributes. I would prefix those with something like app.key or a prefix specific to you. One of the minor changes to the default schema that can help performance is ordering the keys in that map schema so that the keys are always written in the same order. You do not have to always have the same keys. They just have to always be in the same order. That has a dramatic impact on performance and compression.

Josh: All right, thanks for coming everyone. This recording will go out later today. If you are in the New York City area, come join us tonight. We are hosting an event. Details are on our website. Thanks again for coming, and see you next time.

FAQ Section

Q: Why is ClickHouse a good storage engine for OpenTelemetry data?

A: ClickHouse is purpose-built for write-once-read-many workloads at massive scale, which maps perfectly onto observability data. Traces, logs, and metrics are written once and almost never updated. ClickHouse’s columnar storage, parallel query execution, extreme cardinality support, automatic TTL and partitioning by time, and excellent compression for semi-structured data like map columns all make it a natural fit. Unlike time-series databases that require careful cardinality management, ClickHouse handles millions of unique dimension values without performance degradation.

Q: What is the OpenTelemetry Collector ClickHouse exporter and what does it do?

A: The OpenTelemetry Collector ClickHouse exporter is an extension for the OpenTelemetry Collector that exports traces, metrics, and logs directly to ClickHouse. It uses SQL and by default automatically creates the necessary schema in ClickHouse, so you do not have to manage table definitions manually. Any of the 90-plus receiver types supported by the OpenTelemetry Collector can be directed to ClickHouse using this exporter, covering a wide range of data sources from OTLP receivers to file log receivers to host metrics receivers.

Q: How does ClickHouse insert data, and why should I batch my inserts?

A: When ClickHouse processes an insert, it parses the data, sorts it according to the table’s ORDER BY, builds indexes, and writes an immutable data segment called a part to disk. In the background, ClickHouse merges these parts into larger, more efficient ones. If you create too many small parts, the background merge process cannot keep up, leading to insert throttling and eventually insert refusals. To avoid this, use the OpenTelemetry Collector’s batch processor to accumulate larger batches before writing, or use ClickHouse’s async insert feature. The goal is to create as large a part per insert as practical.

Q: How does ClickHouse parallelize GROUP BY aggregations?

A: When executing a GROUP BY query, ClickHouse uses all available CPU threads by default. Each thread scans a subset of the data parts and computes partial aggregates using hash tables. For example, for a COUNT, each thread maintains its own running count. For AVG, each thread maintains a numerator and denominator. Once all threads complete their partial aggregates, ClickHouse merges them into a final result. This is similar to a MapReduce approach but more sophisticated: the hash tables can be split into multiple levels and merged in parallel, and ClickHouse relies on the Linux page cache rather than a database-level cache for repeated reads over the same data.

Q: Does ClickHouse have built-in distributed tracing for its own operations?

A: Yes. ClickHouse has built-in OpenTelemetry instrumentation that records trace spans for its own query execution and stores them in the system.opentelemetry_span_log table. This allows you to see the actual inter-node communication happening inside a distributed query, including which shards received which portions of the work and how the partial aggregates were collected and merged. By default, distributed queries require manual trace ID propagation to appear as a single connected trace, though the ClickHouse team is working to improve this. This instrumentation is what makes it possible to observe ClickHouse at the same depth that you would observe your own microservices.

Q: What are the easiest ways to get started with ClickHouse for observability if I do not want to build everything from scratch?

A: Three fully open-source, complete observability platforms use ClickHouse under the hood and are ready to use out of the box: SigNoz, Coroot, and Clickstack. Each includes an agent, a backend powered by ClickHouse, and an analysis UI. If you already have a Prometheus or Elasticsearch-compatible toolchain, Qryn provides a compatibility layer so you can use ClickHouse as the backend without changing your existing frontends. For DIY setups, the OpenTelemetry Collector ClickHouse exporter combined with the Grafana ClickHouse data source plugin provides a flexible foundation you can customize to your needs.

© 2026 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE^® SOLUTIONS

Get in touch with ClickHouse experts.

Related:

Leave a Reply Cancel reply