Aug 26, 2019
On August 17 I had the pleasure of presenting at Data Con LA 2019. My talk was Data Warehouse and Kubernetes: Lessons from the ClickHouse Operator. It described learnings from our work to enable ClickHouse to run easily on Kubernetes. This short article discusses key points from the talk as well as takeaways from the conference itself.
Data Warehouse On Kubernetes
ClickHouse runs well on Kubernetes. Our involvement with this environment began when we commenced implementation of the ClickHouse Operator, which sets up ClickHouse clusters in Kubernetes. We introduced the first version in April and have been working non-stop since then to make it stable for production use as well as add improvements requested by users. We now see a number of users who were using home-grown approaches switching to the ClickHouse Operator instead.
Along the way we’ve learned a number of basic but useful lessons about operating ClickHouse on Kubernetes.
- Kubernetes DNS breaks ClickHouse assumptions. Kubernetes DNS mappings can switch quickly due to pod restarts, which invalidates ClickHouse DNS caches in other pods. Pod DNS names also do not resolve on pod startup, which means ClickHouse nodes cannot determine their local IP at boot time. We worked around these by explicitly invalidating caches and populating /etc/hosts directly. Similar problems are likely to arise in other clustered databases, especially any that cache DNS lookups.
- Kubernetes performance overhead is minimal. Our internal tests in Amazon environments show ClickHouse response on Kubernetes is not markedly different from running direction on VMs.
- Error handling is tricky. Kubernetes processes cluster resource definitions asynchronously. Users can directly configure underlying resources like storage claims. There are a lot of things that can go wrong. For instance, storage provider semantics depend on information about cluster configuration that may not be visible to the operator, which makes it hard to anticipate errors. This is a problem for any data service on Kubernetes. We’re adding global profiles among other features to help ClickHouse users get defaults right and make misconfiguration less likely.
Kubernetes applications for data tend to have complex resource definitions. The ClickHouse operator is a big step forward as it reduces the data warehouse definition to a single file with simple management. You can set up complex data warehouse configurations in minutes, something that was previously only possible with cloud services like Redshift. Kubernetes also allows you to map ClickHouse flexibly to underlying hardware, which paves the way for economic operation over time.
We still have work ahead to round out ClickHouse operator features and add additional layers above Kubernetes to implement simple policy-based control of security, availability, and migration. Our experience so far, though, has been been very positive. It shows that data warehouses can run well on Kubernetes.
Data Con LA 2019 Conference
The Data Con LA 2019 conference was great: well attended and well organized. It took place on the University of Southern California campus, which is a great venue.
Most talks focused on analytics and data science. I liked that presenters zeroed in on specific business problems, showing how to solve them with data. Annie Flippo did a talk describing how geographic data enables extremely accurate marketing segmentation. Two engineers from GumGum talked about how they use time series data and forecasting to drive ad placement. John Cooper of FabFitFun discussed how his company created a user community to help understand their customers better and enable data-driven product decisions. In each case the selection committee for the conference picked good speakers who knew their subject well.
Aside from learning more about data-driven applications, the biggest takeaway for me was how many businesses are on Redshift. A lot of them are not especially happy–high cost and difficulty scaling to larger data volumes. ClickHouse does not cover SQL as completely as Redshift but is fast and economical to operate. As our Kubernetes work showed, there is now a viable way to set up ClickHouse data warehouses just as easily as Redshift clusters. I would encourage community users who have experience with both to post about their experiences. We are happy to publish articles on the Altinity blog, as we did back in February.
Meanwhile I would like to thank Subash D’Souza as well as the sponsors, speakers, and volunteers who made Data Con LA an outstanding experience for all attendees. We are already planning to be a sponsor for the 2020 conference and hope to see many of our readers there.