Migrating from Druid to Next-Gen OLAP on ClickHouse®: eBay’s Experience

In this webinar, hosted by Altinity’s Robert Hodges, eBay engineers Sudeep Kumar and Mohan Garadi describe how their monitoring and telemetry team migrated the OLAP backend of eBay’s Sherlock.IO observability platform from Druid to ClickHouse running on Kubernetes. Sherlock.IO handles enormous scale, on the order of eight million metrics per second and over a billion raw events per minute, and the team introduced a new signal type, Sherlock.IO events, that is as verbose as a log but can be aggregated at scale like a metric and is not limited by data cardinality.
Mohan walks through the previous Druid pipeline and why it became hard to sustain: a 400 percent traffic increase over four years, rising CapEx, many moving parts across five-plus Druid components, the need to run separate real-time ingestion for one-minute, fifteen-minute, and one-hour granularities, costly daily reindexing, and slow direct Kafka indexing that forced a Tranquility-based near-real-time approach. Sudeep then explains how a ClickHouse proof of concept impressed the team with high ingestion throughput, roughly five times storage savings versus an inverted-index store like Elasticsearch, and notable stability, which led them to adopt the open source Altinity Kubernetes Operator for ClickHouse and build a custom federated operator for cross-region clusters.
The team details their ingress and egress architecture, a ClickHouse discovery module that tracks pod health by IP, a single raw table with materialized views producing one-minute, fifteen-minute, one-hour, and daily aggregations, buffer tables for small inserts, and partition-drop retention in place of TTL. They expose EventQL, a custom JSON format, and native SQL, route queries through a dedicated query cluster that holds the distributed tables, and lean heavily on monitoring, tracing, and a DevOps CLI. The result was higher ingestion throughput, fewer moving parts, reduced storage, and far fewer incidents, followed by an extended Q&A.
September 23, 2020 | 10 am PST / 7 pm CEST
with Sudeep Kumar – MTS 2, eBay and Rober Hodges – CEO, Altinity
If you are interested in receiving the PowerPoint presentation, please email us at marketing@altinity.com.
Key Moments (Timestamps)
Key moments generated with AI assistance.
- 0:06 – Introduction and housekeeping
- 1:36 – Meet the eBay team and Sherlock.IO
- 5:52 – Why eBay looked beyond Druid
- 8:11 – The Druid pipeline and its scale
- 12:36 – Druid infrastructure footprint and challenges
- 19:47 – Discovering ClickHouse: PoC and first impressions
- 22:21 – Running ClickHouse on Kubernetes with a custom operator
- 27:15 – Overall architecture: ingress and egress
- 31:13 – The OLAP schema, buffer tables, and retention
- 35:30 – Query granularities and the query cluster
- 39:54 – Monitoring, tracing, and DevOps
- 46:37 – Q&A
Webinar Transcript
0:06 — Introduction and Housekeeping
Robert Hodges: Hello everybody, my name is Robert Hodges and I’m CEO of Altinity. It’s my pleasure today to introduce two of our friends from eBay, Sudeep Kumar and Mohan Garadi, who’ll be presenting on migrating from Druid to next-gen OLAP on ClickHouse.
Before we get into the webinar, I’d just like to draw your attention to a few things that may help you have a better experience. First of all, this webinar will be recorded, and it is being recorded now. The slides as well as the recording will be posted as links after the talk. Second, we anticipate plenty of time for questions. If you have questions as the talk is proceeding, go ahead and post them into the question and answer box, which should be available off the menu bar at the bottom of the screen. We may answer some as we go along, but there will be time at the end, and hopefully we’ll be able to get to every question. Finally, we run a short poll at the end of all our webinars, just to find out whether the content was at the right level and to understand future topics that might be of interest. It takes you all of 10 seconds to fill out and helps us make future webinars better. With that, it’s my pleasure to turn this over to Sudeep Kumar and Mohan Garadi. Gentlemen, take it away.
Sudeep Kumar: Hello everyone, just to make sure, is my audio clear, Robert?
Robert Hodges: It’s perfect.
Sudeep Kumar: Hi. So I am Sudeep, I am a senior engineer with eBay, and we also have Mohan today who will be co-presenting the webinar. We are part of the monitoring and telemetry team within eBay. So today what we intend to do is talk about our story of how we migrated our OLAP backend from Druid onto ClickHouse using Kubernetes.
1:36 — Meet the eBay Team and Sherlock.IO
Sudeep Kumar: Talking about eBay, it’s important that I introduce eBay first. eBay is a global e-commerce leader operating within 33 countries. We have around 180 million active buyers, we had around 27.1 billion gross merchandise volume in Q2 2020, and we have around 400 million-plus downloads of the eBay app around the world. Across the major markets, we are in the top five in three of the continents. Whatever we are presenting today, just to put it out there, is also published as a tech blog which came out yesterday, so all the content we have here is available on our tech blog at ebaytech.com.
Talking about our platform, our platform is called Sherlock.IO, and it is a platform for all the monitoring needs within eBay. So we monitor anything and everything, such as applications, networks, and devices. The platform is centralized in nature and handles all kinds of monitoring signals, whether it’s logs, metrics, traces, or events. To give an idea about the scale of the platform, we handle around eight million metrics per second, we have around one billion raw events per minute, and we handle low single-digit petabytes of daily log volumes every day, which is fairly large.
So it’s a good time to introduce Sherlock.IO events. Most developers are familiar with logs and metrics. Logs are generally anything the application creates, and they’re quite verbose in nature, while metrics are time series data. What we did is define something new, which we called Sherlock.IO events, that is somewhere between a metric and a log. The nature of the event is such that it is as verbose as a log, but you can do aggregations at scale just like you would for metrics, so it’s somewhere in between. They are immutable, and they represent some key signals in the system, which could mean one of your databases or one of your network devices has failed. The volumes of these events are often sporadic, because when something happens in your system you tend to see a lot of these events coming in, so you should be able to handle that kind of volume in your searches. They adhere to a predefined schema format, and they are typically not limited by the cardinality of data, which most metrics backend systems today are typically sensitive to. You can run out of memory in some of these more recent databases where you have to keep these series in memory to take advantage of things like delta encoding. So it’s not really limited by the cardinality of data, and this is what we wanted to define as the nature of the event. To look at a few examples of use cases, we have application traces, OLAP data, and generic events, where a generic event could be something where a customer comes to you and says, this is my event, this is how the schema looks, and this is the daily volume I would like to ingest into your platform.
As we go along, I’d encourage folks to put in your questions, and if I can get to them I’ll get to them as we’re speaking, otherwise towards the end I can take all these questions.
5:52 — Why eBay Looked Beyond Druid
Sudeep Kumar: Now it’s important to understand why we had to look at alternatives. We had a system created using Druid, and it really worked for us over three years and was quite stable. But as we moved along, we saw an increase in volume, and there was an increase in CapEx requirements, and at some point we had to justify this CapEx need for the Druid pipeline. There were just too many moving parts for DevOps to maintain. Druid has a lot of components which work fairly well, but you need to be really aware of all these moving parts, which our DevOps team may not be comfortable with, especially as we don’t just handle Druid, we also use a lot of open source technologies like Elasticsearch and Hadoop, so it’s hard to keep track of so many things.
More recently, all our workloads have been moving towards Kubernetes, and as we tried to move our deployment towards Kubernetes, we did face some availability challenges. This is specific to eBay, because within eBay we have our own platform called Tess.IO, which is essentially Kubernetes as a service, and it has its own challenges with respect to network availability and so forth. So we saw some challenges running these workloads. We also thought that if we could prove this backend for OLAP with ClickHouse, we probably could look at the same backend for future generic events use cases, and it could be a good fit, because we have requirements where we need to handle generic events when a customer comes and says this is the schema I want you to handle. So these are the reasons we looked at why we had to go for something else. With this, I’ll hand over to Mohan, who will talk about the pipeline we had before, and then we’ll introduce how we moved on to ClickHouse.
8:11 — The Druid Pipeline and Its Scale
Mohan Garadi: Thanks, Sudeep. So like Sudeep gave a key introduction, I hope everyone is audible. We had our own challenges. To begin with, our initial volume on the Druid pipeline was around 250 million events per minute, but as shown in the previous slide, over a period of four years the traffic volume grew around 400 percent.
To give an idea of what this pipeline looks like, there are around 6,000 applications at eBay, give or take. It’s a little more today, probably, but at the time of writing this blog there were around 6,000 applications. We call it an application pool, so a pool represents a set of machines, and each app could have a different set of machines, so this number would be somewhere around 200,000 machines overall. All these machines send events in a specific format. There were around 11 dimensions and two metrics in this use case. It’s a fixed schema, but values could be anything, so we are looking at a cardinality of around 20 million unique cardinals coming in. The volume is around, today, a little more than a billion per minute.
These applications push the data into ingress, this is our own ingress that’s managed on our Sherlock.IO backend. Then this ingress does a high-level aggregation, not the overall aggregation level like the analytical engines do, but just to reduce the amount of volume coming into the system. The data comes into Kafka, which is just a messaging system we have in place in case something should go wrong, so we should be able to go back and replay from the time we could have potentially lost the data. But eventually this started to cause a bottleneck as well, because going back in time for this kind of volume in case of data loss is very challenging. So we have to make sure we don’t lose the old data for the period we missed, and also the new data that’s coming in. So you have to make a choice, either say it’s okay to lose old data because what matters for monitoring is real time. Everything else below Kafka that you see is a Kubernetes deployment. Given that there are no guarantees made on how the deployments happen on Kubernetes, in terms of, let’s say, a pod goes dead for some reason, you just need to see how you want to bring it back, and can you actually get back into the same state you were earlier in terms of data? That’s a big challenge.
So this is just the architecture. Everything else coming out is egress, where we get the data out of Druid using the Druid brokers. On top of that we have our own egress layer that gets the data out in a certain format, because we should have control over what kind of backend store we use, so there is a separate layer on top of the Druid brokers to address the data, so we can control what format we want to enforce for the user to query. There are other tasks that run outside of the Druid workload, which is a batch reindexer that you see on the left side. The batch reindexer is basically for the higher temporal aggregations. Let’s say you want to do daily aggregation; we don’t want to do it in real time because of the resource and hardware cost that incurs, so we had to do it in batch. One day is not a real-time requirement, so it’s just a thing for people to use for querying over a long period of time.
12:36 — Druid Infrastructure Footprint and Challenges
Mohan Garadi: Moving on, the Druid infrastructure footprint looks like this. There are these six components. The top five are the Druid standalone components to run the Druid backend. Tranquility is just an ingest API, but it still costs pretty huge capacity. This is all the capacity shown for one region. To operate Druid at scale, we need two such regions, because we cannot operate Druid on a single ZooKeeper that spans across regions. The reason is that if something goes wrong with partitions going down, let’s say a pod running a partition went down, and if two replicas of the same partition are running in two different regions, that would cause a problem. So we had two standalone Druid deployments running in two regions, and for each region we had two replicas per partition.
Why we incurred this kind of huge hardware hit was basically a couple of things. One is we need to support different granularities. This is mostly for SREs and application developers to triage if there is an issue, so you need data at one-minute granularity. But as you grow towards a larger number of days in your query, one-minute granularity doesn’t hold up anymore, so you need to move to 15 minutes. If you’re querying historical data over 15 days or three weeks, even 15-minute granularity starts to slow down and you need one-hour granularity. We were doing real-time ingest for all three granularities so that all three are available for triaging and for availability when people query for a long time. The highest in terms of capacity consumers were the middle managers and historical nodes, and this is per region. With this in consideration, the daily reindexing I was talking about earlier was needed because, if you were to do another real-time ingest for one-day granularity, it would again incur at least 25 percent more hardware than what we are presenting here, which could be a huge challenge.
There were other challenges. For example, there will be special days like seasonal traffic, or some application is doing bad logging. Of course we have protection in place for the system not to take any bad logging, but it’s a feedback loop, so we still have to let the traffic into Druid for some time to calculate that there is a huge surge in some kind of cardinality. Let’s say somebody decided to put item IDs or seller IDs on their applications’ events, and that suddenly increases the cardinality. For that to be calculated, we have to go to Druid, calculate it for a certain period, see that this application is exceeding some threshold we have, and then it has to be a feedback loop. So that kind of bad logging, if it happens on multiple applications, would definitely take a hit on Druid’s real-time processing. So resources always have to be kept over-allocated so we don’t have problems like these affecting everybody else.
Second is too many moving parts. There are five different components we need to run to operate Druid, and if any of these go wrong, however Druid is advertised so that none of these causes a single point of failure, in reality, while operating, if more than one of these things go wrong in different collocations, that would still pose a problem, and we have seen that in our experience. Additional deep storage also needs to be configured for higher availability. If you want to bring down historical nodes for some reason, let’s say the Kubernetes platform is not available for some time and the nodes running historical went down and the disk was taken out, you don’t have any context of historical data, because there is no cross-regional talk between two different Druid clusters. So you need to configure a highly available deep store for data to be retrieved in case something goes wrong with historicals. And higher granular temporal aggregations need reindexing, like I explained earlier, where we needed at least 25 percent more hardware if we were to do real time for one-day granular indexing.
Kafka indexing options were very slow. There is a direct Kafka indexing plugin. What we were showing in the slide before was Kafka taking data directly from Druid. In reality, there is one component actually missing in that diagram. We tried direct Kafka indexing from Druid, and for this kind of volume, where there were around 250,000 events per second, it used to get very, very slow. Direct Kafka indexing has to do periodical commits as Druid reads data, so if Druid processing slows down for any reason, the Kafka commits slow down and the lag builds up. The lag was really bad, and over time it was not able to catch up, which caused a huge problem. So we said real-time ingest is the only one that works for us, meaning we read from Kafka with the Tranquility API in between and then send it to Druid as if it’s real-time data. However, the data itself is near real time, not real time, and could be delayed from each application for up to three minutes. With that in mind, we had to do real-time ingest. So these were some of the challenges we faced. It worked pretty well for the first couple of years, but in the last year, when the traffic was almost 400 percent more, we started to face these issues related to direct Kafka indexing and things like this.
19:47 — Discovering ClickHouse: PoC and First Impressions
Sudeep Kumar: Thanks, Mohan. As Mohan explained, there were multiple things we had to take care of to run Druid for OLAP use cases. At some point last year we looked at the different alternatives. Since we were comfortable with Hadoop and Elasticsearch, we did explore them, but things were not working out as we wanted. Towards the end of 2019 is when we stumbled upon ClickHouse. Initially we just started looking at ClickHouse as a proof of concept, and almost immediately we realized it has very impressive ingestion numbers. Especially as your batch sizes grow, it can really eat up any kind of volume.
We also did some comparison with respect to storage, because most of the questions on the legacy platform were about why we had to spend so many resources on storage. As we looked at this, what we immediately realized is that when you compare something with Elasticsearch, which stores an inverted index, the storage requirements are much higher. While it works very well to search sparse data in Elasticsearch, storing an inverted index on disk has its own storage capacity cost. When we compared that, we saw we had about 500 percent storage savings if you do it on ClickHouse versus something like Elasticsearch. We had these proofs of concept done, and this is just PoC code, and we incubated the setup over multiple weeks, and surprisingly we didn’t really encounter many issues as we kept it running, which meant the system was quite stable. We decided at that point to look at this very closely and invest time here.
Another thing we realized is that there is already a ClickHouse operator which Robert’s team has created, which worked very well for us because we were already running workloads on Kubernetes. So we thought maybe we can just leverage this open source ClickHouse operator directly on the Kubernetes platform, and things should be much more seamless. And the OLAP story did fit nicely with the ClickHouse backend, which was always an advantage.
22:21 — Running ClickHouse on Kubernetes with a Custom Operator
Sudeep Kumar: Now, talking about ClickHouse and Kubernetes, there were a few things we decided to do. One is to leverage the open source ClickHouse operator to spawn these ClickHouse clusters locally. At the same time, what we wanted is for our ingestion and query layers to be aware of the ClickHouse infrastructure directly, so we can work out of IPs. The reason we wanted to work out of IPs is because, if you look at eBay’s network and some other huge internet networks, you have issues with respect to DNS propagation delays and DNS consistency. We just didn’t want to work with DNS name resolution, but work out of IPs directly.
As part of that, we created a thin ClickHouse discovery module, which looks at the annotations of different pods. In the annotation of the pod you basically have information that this is a replica ID and this is a shard ID, based on which you can create a logical cluster topology. This library continues to observe the lifecycle of the pods, and you can tell whether a pod is ready or not ready, and this data is made available in real time to our ingestion and query layer, so they have a good healthy snapshot of the cluster. The open source operator aligned well with our existing deployment strategy, so we did leverage it.
One limitation we saw, and this is something very specific, is that we run all our distributed applications across multiple regions, so you have region one and region two with applications hosted in both. The reason is recipiency. While the operator worked well to create the local clusters within Kubernetes, it didn’t really take care of the cross-region ClickHouse cluster that we wanted. So we decided this is something where we need to put in some engineering effort, and we created our own custom operator. It handles two custom resources, referred to as a federated ClickHouse installation and a federated ClickHouse cluster, and we have controllers that look at all the instances of these custom resources and take actions based on them. The federated ClickHouse installation represents the different clusters deployed on our Kubernetes infrastructure. The way we designed it, you can have multiple ClickHouse clusters on our infrastructure, and that is one of the advantages Kubernetes provides, that you can easily spawn as many clusters as you want. We wanted to take advantage of that instead of having a monolithic single huge ClickHouse cluster where all the data comes in.
The other object is the federated ClickHouse cluster, which creates metadata for the ClickHouse cluster for an individual Kubernetes cluster in different regions. So it will say, for cluster one in this region these are the settings, and in this region these are the settings, and it will use the open source ClickHouse operator’s objects to spawn them within these local clusters. That’s the model we work with. We did have multiple iterations to really harden this and get it into a very stable working state, and on top of this we have a lot of monitoring so we know exactly what is happening in case of any state inconsistencies, but this has really turned out well for us. These controllers run on a federated control plane on Kubernetes, and they don’t run on the local Kubernetes clusters; that’s how our default deployment looks.
27:15 — Overall Architecture: Ingress and Egress
Sudeep Kumar: In terms of architecture, this is what our overall high-level architecture looks like. We have ingress and egress. Ingress means your ingestion layer, and egress means your query layer, where all the query traffic comes from. The ingress layer is made of multiple processors, switched like a pipeline. At the moment you get a signal in your ingestion layer, it goes through different pipeline stages. First you validate the signal, then you see whether it adheres to the schema, and once it adheres to the schema we do a lookup on the backend infrastructure to find where it has to be written, and we go ahead and write onto the infrastructure and give an acknowledgement back to the client. At the same time, we additionally push some of this data onto the Kafka backend.
As we talked about earlier, the ingestion layer and the query layer use this thin library called the ClickHouse discovery module, which looks at the cluster’s view of the ClickHouse infrastructure and gets all the real-time states of the pods to see whether a pod is healthy or not, and will only use those handles to write the data, so your availability is pretty good that way. On the query layer, we have different formats exposed, which we’ll talk about later, but the most popular one is visualizing the data through Grafana. This visualization uses our own custom data source on Grafana, which adheres to a particular API format, so we encourage users to come onto our platform and use this plugin to visualize the data. On the ingress layer we also have remote read and remote write of Prometheus implemented, so your Prometheus Alertmanager can directly plug into egress and you can potentially have an alerting layer as well. Users can of course also have API-based access, which is always access controlled, so you need to authenticate and authorize yourself before you can query anything out of our backend. The ClickHouse clusters themselves are deployed across two regions, region A and region B. Typically we have two replicas in each region, so we have four copies essentially across two regions, and that’s the model we’re going with.
31:13 — The OLAP Schema, Buffer Tables, and Retention
Sudeep Kumar: Sudeep gave a quick insight about what kind of footprint Druid had. If you compare that against ClickHouse, you will see an immediate benefit. This is the total infrastructure footprint we have for OLAP on ClickHouse across two regions, which was fairly impressive for us, and probably this is also a little over-provisioned, to be honest, but this is what we have currently in production, which seems to work very well. We have in total 68 vCPUs, a memory of 1.7 TB, and 46 pods that we run in total across two regions. Some we run for ZooKeeper and some for ClickHouse.
So how does this OLAP data look now? We have 14 fields within the OLAP data, each of these attributes listed in this table. We did a few things, like we take advantage of low cardinality whenever we can. The ORDER BY is also important, because generally whenever someone queries for data, it always has a pool, which typically represents the application information. So we always order by pool, then the timestamp, followed by other labels, and this model has helped us look up data much faster. Additionally, we use buffer tables, because sometimes in some of the ingest sources the batch sizes could come down and that would create just too many files on the backend, which we really don’t want. So if we see that a batch size is not meeting a particular threshold, we write into the buffer table directly. We want the system to be eventually consistent, so that’s fine; it will take its own time to process back into the backend, which works well for us.
We have different materialized views for 15 minutes, one hour, and daily data, so you have these different views created on this particular table. On top of it we have a retention policy configured. Retention is something I’d like to talk about a little, because initially when we started retention, we just thought we’d use time to live and maybe we’d be done with it. But that didn’t really give us a lot of control, and we did run into occasional issues where sometimes it was not honoring the TTL. So what we instead decided to do is not use TTL, and instead just look at the different partitions and drop the partitions if they exceed a certain size. That worked well, and that’s the model we are running with for retaining data, so different tables have different retention periods configured.
35:30 — Query Granularities and the Query Cluster
Mohan Garadi: For the query layer, we support three different formats. One is EventQL, as we call it, which is just a shortened version of PromQL. The reason for supporting this is it naturally fits our existing use case for metrics, because for metrics we are using Prometheus and PromQL, so instead of developers learning a brand new language, since they’re already comfortable writing PromQL queries, EventQL would look similar. The only thing we need to do here, at least in this version of our own PromQL implementation, is that for OLAP data that already has counts and other numerical columns aggregated, it’s okay, because you know which column you want to do the projection on for aggregation. But for the rest, let’s say for some other events the customer wants where they just want to read out the raw events, in case of raw events there is most likely no numerical column; all values are string values. So we fabricate an extra column in ClickHouse, saying there is a numerical value with a default value of 1 for every row. That value is really not used, it’s just to make it compatible with the PromQL-style query, so we always assume there is a numerical value associated even if it’s a dummy value in the ClickHouse table.
The second is the custom E1 format. We had an event store on Elasticsearch, as Sudeep was mentioning earlier, so this one was just to keep it backward compatible. It’s really a JSON payload that you send with what columns you’re looking for, what aggregations, and the time range, it’s as simple as that. Third is, of course, SQL, the native support provided by ClickHouse. For SQL, it’s really not integrated with our Grafana today, but probably we could support this in future. The support is there if somebody wants to query through an API where they just want to do a GET call to our backend through egress and provide a natural SQL query, and it just works like any other SQL query.
So based on what the user is looking for, let’s say in Druid we used to do one minute, 15 minute, and one hour. Here we really don’t have to do three different real-time ingests into different tables. What we really do is ingest into a single table as raw data as it comes in, and aggregations happen on materialized views on top of the concrete table. So for one-minute, 15-minute, one-hour, and one-day granularities, the aggregations happen in real time as data is inserted into the concrete table. So based on the user query, we optimize it. Let’s say the user is looking for six hours of data, or even one day of data, and if they don’t specify granularity, we can fall back to the least granular data, which would be one minute. If they want one week’s worth of data, we know we have a preset interval of time for which we need to fall back to the next granularity, so we move to 15-minute data. For three weeks we probably go to one-hour data, and anything more than a month or two months, the query will fall back to one-day granularity. But all this data is available until the current minute, where aggregations are happening, so we don’t have to do anything else apart from setting up the materialized views.
To give a gist of how the query layer is set up, per ClickHouse we don’t have the distributed tables created on every ClickHouse cluster. Like Sudeep mentioned, we could spawn even 50 individual ClickHouse clusters at some point, each with its own standalone installations in every region. But what we have is a concept of a query cluster. These query clusters are basically a standalone ClickHouse cluster that doesn’t do any kind of data ingest, but is just a cluster that holds all the distributed tables. So if we have, in the picture, one OLAP cluster and another generic events cluster that customers are sending data to, we have distributed tables created for each one of these. The cluster mapping happens based on what cluster you specify in the distributed table. For every OLAP query that comes in, we just hit the query cluster, and it naturally goes into the OLAP cluster pods to get the data from the concrete tables, combines the data, and returns it to the user. If any other generic events come in, it goes to the generic cluster. So it’s always mapped to the right cluster. As we onboard any new schemas, we figure out which cluster is least occupied, we create the concrete tables and buffer table definitions there, and all the distributed table definitions happen on the query cluster, so we should be able to query the data as soon as it starts coming in.
39:54 — Monitoring, Tracing, and DevOps
Sudeep Kumar: An important part of everything we built is monitoring, which I’m sure most of us are already doing, and these days the most popular one is Grafana for visualizations. A few things to touch upon: other than just doing infrastructure monitoring, we do extensive monitoring on both our ingestion and query layers. This includes things like how many requests per user are being fired and what the average duration is for a particular user. Based on this we can plan to block a particular user, or say maybe we need to add capacity if it’s really a valid use case. We use a lot of dogfooding, because we are the monitoring platform of choice within eBay, so we use our own platform to capture these metrics and create alerts on top of them. We do have parallel systems in place just in case our platform is in question, then we can always depend on them, but we always prefer the dogfooding operation. That’s the first place we go to check whether everything is working fine, and it gives us a good insight about how good your platform is. There are a lot of alerts we created with respect to availability. For example, if our ingestion is having less than 99 percent availability we get an alert, and on the query side, if we see queries where too many rows are being scanned and they’re timing out, we have alerts based on that. These give us a good sense of the availability of the platform. When you say availability, we don’t just look at the downtime, we look at all these aspects, and that’s a really good way to measure what the customers are seeing on the platform.
Other considerations: we also do tracing. For tracing we have our own tracing setup. If we briefly touched upon our injection layer, where we have different processors, on the right you’ll see things like whitelist, validate processor, sanitize, then validate the schema, then the ClickHouse write. We have all this information here, and if something fails we get a lot more insight about what’s really happening. At some point we would look at storing some of this back into ClickHouse if possible, but right now in the current state we use a separate tool.
As we created the new platform, we wanted to make sure life for our DevOps team really gets better, and that was something hugely missing in the previous pipeline. We realized that if you don’t enable DevOps to do their job well, there would be a lot of complaints and interactions, which we wanted to avoid. So we created a CLI with which we can look at the different ClickHouse clusters on the setup, what kind of disk utilization is happening, what kind of memory footprint. You can do all this with the CLI; you just have to log in and provide your credentials, and then you can query these clusters and get a good snapshot. This has also helped us a lot in terms of the ZooKeeper deployment. As we said, we want everything distributed across two different regions, so we have ZooKeeper across two regions. One region always participates in the leader election, while the other region is always in observer mode. So if one region’s availability is impacted, we have to go in and flip the other region to leader election mode. We are trying to automate a few of these things on ZooKeeper, however we are not yet where we want to be with respect to the spending story, but this is how we do the ZooKeeper deployment on our infrastructure.
To summarize, one immediate difference we saw is very high ingestion throughput, which is what we really wanted, and we were quite happy with it. We had a reduced processing layer, so we didn’t want too many moving parts, and that helped. To be honest, we were also quite comfortable with ZooKeeper, so if there is just a ZooKeeper dependency, we were very comfortable handling it. We also saw a reduced storage footprint, which definitely helped. The main important thing is it’s been very easy to manage in server migration, so we hardly had any incidents for the last six or seven months since we migrated. That’s been a good success story for us, fewer incidents for DevOps, and since availability was better, our customers, which is the SREs Mohan mentioned previously, were very happy with the new platform in the way they got their output back immediately. So this is where we are in terms of our story with ClickHouse on OLAP. At some point we would look at whether we can support generic events use cases on ClickHouse as well, and we already started putting in efforts around that.
This is a typical slide with every presentation, I guess, but eBay is hiring. We are hiring across all the different cities and countries, and remote work is definitely an option in the current times; in fact we encourage it right now. There are a lot of areas where we are hiring, some listed below, and you can always look at careers.ebayinc.com. Please also reach out to Amber, who is our engineering head for monitoring, and he will be happy to help. I’d like to thank everyone who has been on the call. Thank you so much for your time and patience. I think it’s a good time to pull up the Q&A.
46:37 — Q&A
Robert Hodges: Let me jump in for a second here. I want to thank you, Sudeep and Mohan, for a really awesome talk. I also want to thank you for all the feedback you’ve given on running ClickHouse on Kubernetes. We’ve had conversations with you over, I think it’s almost two years at this point, discussing what you’re doing, and it’s been great to see this develop. I’m going to kick off our poll right now. Please take a few seconds and fill it out, it’ll help us do better webinars in future. Sudeep and Mohan, I think what might work best is why don’t I just read the open questions out and you can jump in and answer them.
Sudeep Kumar: Sure.
Robert Hodges: First one: thanks, eBay team, this is a great presentation. I’ll say yes to that. The question is about other alternatives you considered as Druid replacements.
Sudeep Kumar: Sure. One thing we considered, we looked at Hadoop. Hadoop can really take a lot of this traffic and do a lot of things we wanted, but it’s not real time. You have to run a MapReduce job offline, which is not really real time, and that was really limiting for us, so although we looked at it a little bit, we didn’t want to go that route because it didn’t work for us. The other one we looked at was whether Elasticsearch works well, because it’s also quite popular within eBay and we use it for a lot of different things, and some of us are very comfortable with that backend. As we looked, we saw that the cost was not justified, because as I said we saw 500 percent improvement in storage cost having it on ClickHouse versus on Elasticsearch, and that’s our personal observation. With that we decided we’d be running into the same issues we had before, which we didn’t want. So those are the two we definitely explored, and then we just looked at ClickHouse, so I would say three.
Robert Hodges: Great, thank you. Next question: does ClickHouse have equivalents to Clarity and Pivot for Druid?
Sudeep Kumar: Clarity and Pivot for Druid, I’m sorry, Mohan, do you want to take that?
Mohan Garadi: I really haven’t explored that option.
Robert Hodges: What I’d say is, if the person who asked that wants to qualify it a little bit, we can dig more deeply into it. Meanwhile we’ll go to the next question, regarding clustered ClickHouse. What happens when you want to dynamically resize the cluster? Does ClickHouse manage that automatically for you, or did you implement it yourself?
Sudeep Kumar: That’s a good question. It actually happens immediately, in terms of latency, as long as your capacity can be met. The way we do it, we mentioned the two resource objects on the federated control plane that our controllers run on, one is the federated ClickHouse installation and the other is the federated ClickHouse cluster. The federated ClickHouse cluster has metadata about how the cluster has to be created, so it would say I need two shards in this region for this cluster and two shards for this region in this cluster. You can always increase that particular number, and then the controller would look at it: the spec says this is what I need, but the current state is not this, let me use the ClickHouse installation object, which the open source ClickHouse operator uses, and spawn those additional shards. There is not much delay as long as we have capacity. To be honest, we haven’t done a lot of resizing on our clusters, so I cannot completely vouch for what kind of stability we’d see there.
Robert Hodges: The next question is what database to use, and I think that was in reference to something else, so I’m going to pass over that one. Next question: can ClickHouse handle batch fetch, like tens of millions of records? Are there any limitations? Not sure ClickHouse is designed for this type of usage or not.
Robert Hodges: I’ll take that, because this is a general question. Usually people don’t do this, and it’d be interesting to understand your use case. What you typically do with ClickHouse is, for example, you’re processing aggregates, or in this OLAP case you’re drilling down on a particular service or transaction type to understand why something is not working. In these slicing and dicing cases, the drill-downs tend to be on relatively constrained datasets. That said, there’s nothing that prevents you from doing a batch fetch out of ClickHouse using a select. The one thing I would say is, depending on what problem you’re trying to solve, you might be better off pushing the data into something like S3 and using that as your exchange format, only because if you’re doing it at real scale, like the scale we’re talking here, it’s an enormous amount of data, and downloading it over a single connection is not the most efficient way to do it. So definitely an interesting question, and one that, if you’re interested, feel free to go deeper on.
Robert Hodges: Here’s a question: were you storing an equivalent amount of data in both Druid and ClickHouse? The equivalent with regard to binning was a bit confusing.
Sudeep Kumar: In terms of volume, yes, it was equivalent, it was the same data. The way we switched over is not like an immediate cutover. We ran both systems for a period of time, both Druid and ClickHouse, and we wanted to incubate the ClickHouse setup for about a month or so. Between this period we started seeing some more issues on Druid, at which point we said let’s just do the cutover even before that and see what happens. So we did a cutover within about two weeks of our data pipeline being ready. That’s how we went about it.
Robert Hodges: Great. Next question: if MySQL is your database, did you replicate the data from MySQL to ClickHouse? Was that relevant for your use case?
Sudeep Kumar: No, because we aren’t using MySQL in any capacity, at least for this.
Robert Hodges: Mohan, do you want to jump in on that?
Mohan Garadi: To answer that question, one thing that just flashed on my mind: there was a recent use case where we have a logs storage system, and there is metadata associated with it. To store the logs, there is some metadata we need to look up, like where the logs are stored on the file system, and that metadata itself is a huge-scale dataset that was using a MySQL database earlier. But they had issues with deduping the data into MySQL. It’s our own platform monitoring use case, by the way, so we had to migrate them out of MySQL into ClickHouse. That’s not fully migrated, but we did a PoC, it was pretty successful, and we use the ClickHouse native capability to dedupe the data, and it’s quite successful.
Robert Hodges: Just for anybody specifically interested in replication from MySQL to ClickHouse, there is a new engine in ClickHouse which can actually read the binlogs. It’s called, I believe, MaterializeMySQL. I haven’t personally used it, but it’s coming up. Here’s a question: you mentioned that ClickHouse is great at compressing data; are there any numbers on your previous Druid storage capacity versus current ClickHouse?
Sudeep Kumar: To be clear, when we said the storage cost, we were not really looking at Druid per se. We were looking specifically at Elasticsearch as another potential option, and that’s where we did the comparison, because that also looked very promising to us. One complaint we often had with respect to our platform is that, since it’s a huge platform, we have to always justify the cost. So we were very sure we didn’t want to go that route, because we saw 5x CapEx needs on Elasticsearch, and Elasticsearch in general recommends you use SSD nodes, which are fairly expensive.
Robert Hodges: There’s a clarification of a previous question about Clarity and Pivot. Clarity is monitoring, which is already covered. Pivot is a web-based query interface that allows you to issue arbitrary queries with facets based on introspection of the data. Was there any equivalent to that in the new system?
Mohan Garadi: Yes, basically you can explore Tabix. Tabix is the equivalent. One of the customers is actually exploring it. Like I was explaining earlier, we have EventQL, we have SQL support on the API, but these are essentially returning data in JSON or PromQL response format. So to answer the question, for interactive data exploration in real time, where you want to query multiple tables or join across tables and discover the schema, you can use Tabix. I haven’t personally used it, but I’ve seen how it works, and we are in the process of looking into how we can support it for our customers at eBay. Tabix actually looks great.
Sudeep Kumar: To give better clarity, what Mohan was speaking about is the future iterations for this platform, when we want to onboard generic events use cases. Some people within eBay are already using ClickHouse in their limited capacity, and since we are the centralized platform, we were looking at providing this at scale. As part of that there were questions around whether they can use Tabix and things around it. What we generally prefer is to come with a standardized way, come onto our platform and do the visualization. For SQL support, what we’ve done is we ask them, you can host your own workload with the open source ClickHouse data source plugin and set your authentication headers properly, and then we would entertain you, but of course the policies on rate limiting and other quota-based thresholds are always applied.
Robert Hodges: We’re getting close to the top of the hour, so I’m going to try to get three more questions in before we close. Let’s take the next one: how many types of components are there in ClickHouse, as opposed to Druid, which has five plus ZooKeeper? How does the performance compare to Druid? And did you just have a single cluster definition with a couple of shards?
Sudeep Kumar: No, right now as we speak we have like four or five clusters.
Robert Hodges: And how many ZooKeeper ensembles?
Sudeep Kumar: We have ZooKeeper per cluster. We typically don’t go with the model where we share ZooKeeper across clusters; that’s also our recommendation.
Robert Hodges: Great. And just performance compared to Druid, I think you’ve talked about this, but you were pretty impressed, it sounds like?
Sudeep Kumar: Yes, especially with the ingestion and the stability overall on our Kubernetes platform.
Robert Hodges: Here’s a great question: does ClickHouse provide ways to backfill older data?
Sudeep Kumar: For backfill, you have to run some kind of, there’s a copier tool that you can run to copy some of this data. But if you just create a new materialized view, I think it will just supply from that point in time over time.
Robert Hodges: There actually is the ability to backfill, but it depends on exactly what you mean by that, so if anybody wants to go deeper on that, feel free to contact us later on. I’m going to take one final question, which I think is really interesting: what’s the storage on each node? Because you’re in Kubernetes, and I think this is also relevant to another question, which was whether you experienced any performance problems with Kubernetes persistent volumes. Can you talk a little about the storage you’re using and how you manage that in Kubernetes?
Sudeep Kumar: Yes, we are using persistent volumes. In terms of the size, Mohan, do you remember?
Mohan Garadi: Yes, we have around 12 terabytes, basically for OLAP data, approximately 12 terabytes for three weeks’ worth of retention. Per node it’s around 580 to 600 GB right now, on local SSD.
Robert Hodges: Interesting. And did you have any issues with pod scheduling or things like that, because you obviously have to schedule storage?
Mohan Garadi: There are a few issues. Sometimes what happens is, say you lost a PVC, and the pod keeps coming up on the same PVC always, so at that point you have to get in and clear up the PVC and then spawn the pod again. There are a few issues like that, but it may be very specific to how eBay runs Kubernetes within our infrastructure, so I’m not very sure whether it’s a more general problem or not.
Robert Hodges: I thought it was interesting, because we do know that managing PVCs effectively, for example not losing storage, is definitely one of those areas where, if you’re operating at scale in Kubernetes, you do want to pay attention to that. In general we haven’t seen a lot of problems on our side, so it sounds like it wasn’t a big deal for you as well.
Mohan Garadi: It does happen occasionally, but it’s okay; at some point the replication does catch up, so it’s fine.
Robert Hodges: We’re four minutes over the hour, we have more questions, but I think we’re going to close now. If you have additional questions, feel free to post them out to info@altinity.com and we’ll get them relayed to Sudeep and Mohan. And Sudeep and Mohan, thank you so much for an absolutely awesome talk. It’s been a total pleasure hearing about this, and seeing the result of all the work you’ve done over the last year to set up this system on ClickHouse.
Sudeep Kumar: Thanks, Robert. Thanks for providing the opportunity to talk today, it’s a pleasure. I’d like to thank all the attendees as well.
Robert Hodges: It’s wonderful to host you today. I hope this was useful, and we look forward to seeing you at our next webinar. Thank you very much and have a great day.
FAQ
Why did eBay decide to move its OLAP backend off Druid? Druid worked well and stably for about three years, but as traffic grew roughly 400 percent over four years, CapEx rose and became hard to justify. Druid also had many moving parts across five-plus components that were difficult for DevOps to operate alongside other systems like Elasticsearch and Hadoop, and the team hit availability challenges moving workloads onto eBay’s Kubernetes platform. They also wanted a single backend that could later serve generic, customer-defined event schemas.
How does eBay run ClickHouse on Kubernetes? They use the open source Altinity Kubernetes Operator for ClickHouse to spawn clusters locally, plus a thin ClickHouse discovery module that reads pod annotations (shard and replica IDs) to build a live, IP-based cluster topology, avoiding DNS propagation issues. Because the operator handled local clusters but not cross-region coordination, eBay built a custom federated operator with two custom resources, a federated ClickHouse installation and a federated ClickHouse cluster, managed from a federated control plane.
How does the new platform handle different time granularities? Instead of running separate real-time ingestion for each granularity as in Druid, eBay ingests raw data into a single concrete table and uses materialized views to produce one-minute, fifteen-minute, one-hour, and one-day aggregations in real time as data arrives. The query layer then picks the appropriate granularity based on the query’s time range, falling back to coarser aggregations for longer ranges, with data current up to the latest minute.
How is data ingestion and querying structured? Ingestion flows through an ingress pipeline that validates the signal, checks the schema, looks up the target backend, and writes to ClickHouse, while also writing to Kafka. Small batches are written through buffer tables to avoid creating too many files. Queries go through an egress layer exposing EventQL (a PromQL-like language), a backward-compatible JSON format, and native SQL, and are routed through a dedicated query cluster that holds the distributed tables and maps each query to the right backend cluster.
How does eBay handle data retention in ClickHouse? They initially tried TTL but found it gave limited control and occasionally did not honor the configured expiry. Instead they monitor partitions and drop them once they exceed a certain size, configuring different retention periods per table. This partition-drop approach gave them more predictable control over storage than relying on TTL alone.
What were the main benefits eBay saw after migrating? The team reported very high ingestion throughput, a reduced and simpler processing layer with fewer moving parts, and a substantially smaller storage footprint, with their comparison showing about five times storage savings versus an inverted-index store like Elasticsearch. Operationally, the platform became much easier to manage, with hardly any incidents in the six to seven months after migration, and their SRE customers were happier with response times and availability.
© 2026 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.
ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.