Build a Low-Cost, High-Performance Analytic Platform with Kubernetes and Open Source

Recorded: July 27 @ 10:00 am PT
Presenter: Robert Hodges, CEO @Altinity

In this webinar, Altinity CEO Robert Hodges makes the case that businesses can build analytic platforms that outperform proprietary cloud databases like Snowflake and BigQuery for specific, well-defined use cases. He begins by walking through Snowflake’s architecture and strengths, including its general-purpose SQL capabilities, self-tuning columnar storage, and serverless operations, before cataloging the key weaknesses: data is not stored in your own account, costs spiral for 24×7 real-time workloads, and the platform is fully proprietary with no easy exit.

Robert then frames a specific design exercise: building a GDPR-compliant replacement for Google Analytics. This problem scope deliberately avoids Snowflake’s general-purpose strengths and plays directly to open-source alternatives. He introduces ClickHouse® as the ideal database for this use case, explaining its shared-nothing architecture, vectorized parallel query engine, columnar storage with deep compression, and support for tiered storage combining block storage hot tiers with S3 cold tiers.

With the database chosen, Robert adds the surrounding open-source stack: the Altinity Kubernetes Operator for ClickHouse to manage the database, ZooKeeper for cluster consensus, Prometheus and Grafana for observability, and CloudBeaver for SQL editing. He then demonstrates how to deploy the entire stack in minutes using Argo CD and GitOps, storing all configuration in GitHub and synchronizing it to a live Amazon EKS cluster. The demo shows full lifecycle management: creating applications, syncing them, modifying configuration in Git, and re-synchronizing to update the running stack.

The webinar closes with three practical areas every team must address before going to production: the build vs. buy decision (including the option to let Altinity.Cloud Anywhere manage ClickHouse inside your own Kubernetes), security hardening including TLS encryption, IP whitelisting, and Kubernetes Secrets, and the broader operational tasks of capacity planning, backup, monitoring, and automation. Robert also addresses audience questions on tiered S3 storage with Parquet, comparisons between Altinity Stable Builds and upstream ClickHouse builds, on-premises Kubernetes compatibility, and FIPS-compatible builds for FedRAMP and PCI DSS environments.

Here are the slides:

Build-Low-Cost-High-Performance-Analytics-with-K8s-and-Open-Source-2023-07-27 Download

Key Moments (Timestamps)

0:04 – Introduction, housekeeping, and speaker overview
1:28 – Robert Hodges background and Altinity company overview
3:11 – Snowflake architecture overview and key strengths
7:32 – What Snowflake does not do well
8:42 – Defining the problem: GDPR-compliant Google Analytics replacement
11:38 – Evaluating open-source database options: OpenSearch, Presto, ClickHouse
13:15 – ClickHouse architecture overview for the analytic platform
14:43 – Full platform design: ClickHouse, ZooKeeper, Prometheus, Grafana, CloudBeaver
15:41 – Kubernetes fundamentals and resource mapping explained
18:02 – Adding the Altinity Kubernetes Operator for ClickHouse to the stack
20:36 – Argo CD introduced as the GitOps deployment solution
23:27 – Demo begins: Argo CD project and app structure walkthrough
26:00 – Creating and syncing the ClickHouse operator via Argo CD
28:07 – Full stack deployment script: all services created and synced
31:51 – Connecting to the running ClickHouse instance via browser
33:25 – Connecting with CloudBeaver SQL editor
33:59 – Grafana monitoring dashboards in action
34:42 – Live GitOps update: modifying replica count in Git and syncing
36:16 – Argo CD strengths: infrastructure as code, multi-environment mapping
38:22 – Argo CD weaknesses: Kubernetes expertise required, GitOps automation complexity
39:54 – Three production considerations: build vs. buy, security, and remaining tasks
41:19 – Altinity.Cloud Anywhere: managing ClickHouse in your own Kubernetes
43:34 – Security: TLS, IP whitelisting, and Kubernetes Secrets
45:24 – Remaining production tasks: automation, capacity planning, backup, monitoring
47:15 – Tips for building your own analytic platform
49:49 – Audience Q&A: S3 tiered storage and Parquet options
55:14 – Audience Q&A: Altinity Stable Builds vs. upstream ClickHouse builds
58:35 – Audience Q&A: on-premises Kubernetes compatibility
59:37 – Audience Q&A: FIPS-compatible builds for FedRAMP and PCI DSS

Webinar Transcript

[0:04] Introduction and Housekeeping

Robert: Hello, everyone, and welcome to our webinar. My name is Robert Hodges. I’ll be presenting. I’m CEO of Altinity. We’re going to be talking about building a low-cost, high-performance analytic platform with Kubernetes and open source, and actually building something that can replace proprietary services like Snowflake or BigQuery. If you came in hoping to find a solution to that question, we’re going to be answering that today.

Before we get into the webinar, I’d like to make a couple of announcements that will help you enjoy it. First, this is being recorded. You don’t have to take frantic notes. We will post the recording to everybody who signed up, usually within a few hours of the webinar completing, and at latest by Friday morning. We will also send the slides.

Second, we have time for questions. If you have questions as you’re going along, you can post them in the chat or use the question-and-answer box. If it’s relevant to the talk, I may just dive in and address it right then and there. Otherwise, we have some time at the end.

[1:28] Speaker Introduction and Altinity Overview

Robert: Let me make some more detailed introductions. My name is Robert Hodges. My day job is CEO of Altinity, but I’ve been working on databases for getting up to 40 years now, and I’ve been working on Kubernetes since 2018. This technology, and particularly the topic we’re talking about today, is very near and dear to my heart.

I’m backed up by a great engineering team. We have 45 people in the company spread out over 17 or 18 countries. We’re heavily engineering-focused. Taken together, we have centuries of experience in databases and the applications built on them, particularly analytic applications. That’s the core business of Altinity.

We’re enterprise providers for ClickHouse®, which we’ll talk about during this talk. We built the first cloud service for ClickHouse in Amazon and also in GCP. Azure is on the way. It’s been up and running for several years now. We’re also the authors of the Altinity Stable Builds, which are builds of ClickHouse with long-tail maintenance and certified for production use. And finally, we wrote the Altinity Kubernetes Operator for ClickHouse, also known as the ClickHouse operator for short. We use it in our own cloud service, but it is also widely used throughout the world. There are tens of thousands of ClickHouse clusters that are managed by that operator.

[3:11] Snowflake Architecture Overview and Key Strengths

Robert: Let’s dive in and look at cloud analytic databases and explore some of the trade-offs.

The database I’m going to pick is Snowflake. It could be BigQuery, it could be Redshift, but Snowflake is a great one to start with. It was and is a pioneering database, really the first cloud analytics service that exploited the full capacity of object storage. In the architecture picture you can see that the base layer of storage is object storage. On Amazon that would be S3. All your data when you’re using Snowflake goes into object storage. When you actually want to run queries, you stand up what’s called a virtual warehouse. Those compute nodes pull parts of the data into cache and run your queries.

One of the great things about this architecture is that if you have a bunch of data that everybody looks at, you can set up different virtual warehouses that share the underlying data but are completely separate for compute purposes. It’s a very flexible architecture. The top layer has your cloud services, including access control, metadata management, a great query optimizer, full transaction security, and more. It’s a very capable service and a really outstanding product.

Let’s talk about some things that are genuinely excellent about it. It’s general-purpose: you can handle a wide number of applications without having to do anything really special. It has completely serverless operations. From the user’s point of view, there are VMs and hosts somewhere under there, but you have no direct connection with them. It just works. It can handle a very large number of tenants with completely different applications, which is part of being general-purpose. Widely varying applications will run safely on this architecture without interfering with each other.

Another big strength, which no doubt comes from the Oracle heritage, is a very complete implementation of SQL, including full ACID transactions and a sophisticated query optimizer that can handle complex queries with lots of joins. It has very good columnar storage, which is the standard for data warehouses. What’s really important here, and where a lot of the innovation is, is that it’s completely self-tuning. You don’t have to do anything to get high levels of compression and just the right partitioning. Snowflake does it automatically. Finally, it handles big table joins, so you can join lots of different tables and very large ones. It also has a nice web interface with built-in SQL editing and various management features, so you can just jump right in without installing any software.

These are great features. This is why Snowflake is among the most popular data warehouses on the internet.

[7:32] What Snowflake Does Not Do Well

Robert: However, there are some things Snowflake doesn’t do, and from the perspective of users these could be important.

For example, it doesn’t keep the data in your account. It’s in the cloud, but it’s owned by Snowflake. Another thing it doesn’t do is minimize cost when you’re running a SaaS business or an analytics system that runs 24×7, such as crypto monitoring or security management. Snowflake is not really designed for that. As a result, you can get huge bills, and it takes careful management to ensure they don’t go out of bounds.

It’s not designed to give a stable real-time response. If you need a response in 20 milliseconds because you’re filling a page, Snowflake is probably not the right database. It’s also completely proprietary, so if you want to go somewhere else you’re going to have a major migration on your hands. The dangers of vendor lock-in are real, and this is precisely the kind of situation where they become expensive.

Robert: This opens up an opportunity to think about a different way of solving problems that does not involve Snowflake. Let’s dig into a specific problem as an exercise.

What we’re going to focus on is building a GDPR-compliant replacement for Google Analytics. Many of the people on this call have used Google Analytics. It’s one of the classic data warehouse use cases. It allows you to go in and see what’s going on on your websites and do slicing-and-dicing queries. To make a long story short, you’re going to need an analytic platform.

That platform consists of data storage and query, data pipelines and service integration, and data visualization and consumption. On the source side, you’ll have data feeding in fast, potentially millions or tens of millions of rows per second. On the consumption side, you’ll have facilities for visualizing data, user interfaces, APIs to get data out, ways to prepare extracts, and alerting. The core thing we’re going to focus on is building this basic analytic platform.

When we’re coming into this and thinking about whether to use open source, the first step is to scope the requirements. If you look back at that list of Snowflake strengths and weaknesses, many of Snowflake’s strengths, like being general-purpose, don’t really count here. We’re only solving one problem. So our platform can be very focused. One thing Snowflake does well that we’d like to have is SQL editing capabilities. But what we really need, to solve this effectively, is everything that Snowflake does not meet: keeping data in our own account and avoiding vendor lock-in. This is actually well suited for open-source solutions. If you want to bring your own cloud rather than rely on a third-party vendor’s infrastructure, open source gives you that option.

[11:38] Evaluating Open-Source Database Options

Robert: The next thing to do is pick a database. Here are some options that are all Apache 2.0 licensed, so they’re permissively licensed and you can use them any way you want. They include OpenSearch, ClickHouse, and Presto.

OpenSearch is based on Lucene. It’s great for document search and has full-text indexes, very efficient at sifting through large amounts of semi-structured data like JSON. It’s often used for log analytics and has nice integrations like Kibana and Logstash. But that’s not quite what we’re looking for here, because we’re trying to do slicing-and-dicing queries that need very rapid response.

Presto allows you to do federated queries across a bunch of different data sources. It was originally developed to scan very large amounts of data sitting on object storage or HDFS. That’s not quite what we’re looking for here because our data is reasonably structured and we can afford to put it in a proper format.

ClickHouse turns out to be the right solution. It’s designed for real-time analytics when the data are structured. When data fits nicely into tables, there’s a lot of flexibility about what goes in a table. And web analytics is actually the canonical use case for which ClickHouse was developed. So this is a good choice. For those who want a deeper introduction, Kubernetes for beginners and what Kubernetes is is a great starting point for understanding the platform we’ll be deploying on.

[13:15] ClickHouse Architecture Overview

Robert: For those of you who are not familiar with ClickHouse, let me quickly walk through the database architecture we’ll be working with.

You can think of it as MySQL on steroids. It’s an open-source database that speaks SQL and runs anywhere, but it’s designed to do analytics. It has server nodes that do very efficient vectorized and parallel query, support for sharding, replication, and distributed queries, meaning you can blast a query out and have a bunch of servers work on it simultaneously. Data is stored in columnar form with extremely efficient compression, in fact just as efficient as Snowflake, though you do need to tune it a little to get there. It can also read and write data efficiently to object storage. We sometimes call this tiered storage, where you keep your hot data in block storage and then have older data out on S3.

ClickHouse also includes ZooKeeper. For replication to work between cluster nodes, ZooKeeper is there for consensus. It keeps the list of all the table parts that need to be replicated between nodes.

[14:43] Full Platform Design: Stack Components

Robert: This brings us to laying out the platform design. We have ClickHouse as our analytic database, we’ve got ZooKeeper for consensus. We’d also like good monitoring to keep an eye on what the database is doing and help diagnose things. For that we’re going to use Prometheus, which is the standard monitoring solution on Kubernetes. We’re going to use Grafana to visualize the monitoring data. For Kubernetes and ClickHouse monitoring, this Prometheus and Grafana combination is the approach we recommend. And to make life easier for our developers we’re going to include CloudBeaver. CloudBeaver is a containerized version of DBeaver, a very popular SQL editing tool. That’s the basic design of the stack.

[15:41] Kubernetes Fundamentals and Resource Mapping

Robert: The next step is to think about how to turn this into reality. Let me introduce Kubernetes basics for those of you who are not familiar with it.

Kubernetes is a system for orchestrating container-based applications. You have a pool of machines, and if you have an application that can be defined as a set of one or more Docker containers containing the software, you can put it on Kubernetes and Kubernetes will run it for you.

Here’s a simple example. Let’s say we have a ClickHouse server, which runs great in containers, and it’s talking to some block storage. In Kubernetes we represent that as a series of resources. There’s a StatefulSet, which is a type of resource that manages storage. There’s a Pod, which is the compute part, the processes that need to execute. There’s the PersistentVolumeClaim, which is a request to Kubernetes to allocate storage of a particular type. And then there’s the PersistentVolume, which represents that actual storage. Kubernetes will take these resources and spin them up in the physical infrastructure. In the end, somewhere in the Kubernetes cluster you get a ClickHouse server process running on a host, with a mount to Amazon EBS storage.

That might seem trivial here, but real systems like the one we’re developing are complex. The nice thing about Kubernetes is it will manage a lot of containers with a lot of different resources and do so efficiently. For teams running on Amazon EKS, it’s also worth knowing how to handle managing EBS GP3 volumes in Kubernetes to get the right performance and cost profile from your storage.

[18:02] Adding the Altinity Kubernetes Operator for ClickHouse

Robert: When we go to Kubernetes, we have to think about the logical design. All of the elements in our stack are available as Docker containers. There’s also one extra thing we’re going to add, which is the Altinity Kubernetes Operator for ClickHouse (also known as the clickhouse-operator). To manage ClickHouse properly, we’re going to need an operator, which is a special kind of controller in Kubernetes. It creates new types of resources. In this particular case, we create a resource for ClickHouse databases.

One big problem in Kubernetes is that everybody does things differently. The ClickHouse operator needs a manifest. CloudBeaver can be installed through a manifest. Prometheus has Helm charts. The Altinity Kubernetes Operator has Helm charts. They all do this a little bit differently. So the first problem we face in standing up this stack is: how do we paper over all these differences and make this stuff work together?

And there’s one more thing we’d like to do. Since we’re building a cloud native app running on Kubernetes and based on containers, one of the really important things we want to do is define this infrastructure as code and store it in GitHub or GitLab. It should be stored in a source control system, and we want a way of taking the current definition of the system and turning it into something that runs on Kubernetes.

[20:36] Argo CD: The GitOps Solution

Robert: The answer for that is something called Argo CD. It’s not the only answer, but it’s become quite popular because it does this well.

What Argo CD does is, first, it knows how to read data from GitHub automatically. Second, it has the ability to support various kinds of application definitions in Kubernetes. These include defining an application through a manifest, using something called Kustomize, which is a Cloud Native Computing Foundation project that can take resource definitions and tweak them based on the environment you’re targeting. Argo CD also supports Helm charts, another very popular way of installing services on Kubernetes. It passes all of these through, looks at them, transforms them as needed, and then maps them to resource definitions that get pushed to the Kubernetes cluster you want to target.

Here’s a more detailed picture of how it works. A developer on the left defines the application in a GitHub project with application definitions. Argo CD runs inside Kubernetes itself. You tell it where the applications are, and it pulls them down through its repository service. An application controller then keeps track of the application defined in GitHub and decides whether to sync it to what’s running in Kubernetes. It has an API that gives you the ability to define applications, tell it where they are, synchronize them to Kubernetes, and remove them when you’re done.

[23:27] Demo: Argo CD Project Walkthrough

Robert: Let’s go ahead and try this. We’re going to go to the demo.

One thing to start with is that this demo is based on a project published on the Altinity GitHub called argocd-examples-clickhouse. All this code you can clone afterwards and repeat yourself. It’s fairly straightforward. There are instructions included.

Let’s look at a typical app. I’m going to pick the ClickHouse operator because I’m going to walk through the full lifecycle. The app uses a Helm chart. We have a definition of the Helm chart stored in the repository. If that Helm chart has values, we can stick them in a values.yaml file. I’m going to take the defaults. There’s also documentation here that includes a sample command to actually get this thing defined in Argo CD. Let’s go ahead and execute that command.

I have a namespace in Kubernetes where all this stuff is going to go. As we execute these commands, we can watch how things appear in Kubernetes.

[26:00] Demo: Creating and Syncing the Operator

Robert: We’ve just told Argo CD about a new application. Here’s where to find the GitHub project, here’s the directory it’s in within that project. Now we can ask Argo CD what applications it knows about.

There we go. We see one application, and it says “out of sync,” which means Argo CD knows about it but it also knows that this application has not been synchronized with the state in our target cluster. The destination server tells it where to point to actually set things up. Let’s go ahead and synchronize it.

You can see a bunch of stuff flowing out. Argo CD is applying that Helm chart to Kubernetes. If we go back to the namespace list, you can see the application is now up and running. All I had to do was issue just a couple of commands. And as a user at this point, an SRE for example, I don’t have to know how it was done. I don’t have to know it was a Helm chart. It’s just up and running.

Let’s go ahead and clean it off, since we want to show the full lifecycle. We’re going to do a delete and say yes to deleting all its resources. And if we look at the namespace, it’s gone. It’s that simple. This same approach works for every single service in our stack.

[28:07] Demo: Full Stack Deployment Script

Robert: Rather than type all these commands individually, let’s just run a script. There are other ways to do this with Argo CD, but I’m doing it explicitly so you can see what’s happening. This command is going to go ahead and create the entire stack.

The script creates each of these applications and then syncs them. One of the reasons I’m doing it this way is that it allows me to ensure things come up in the right order. We’ll talk a little about dependency management in a moment. Let’s go ahead and let it run.

It’s giving us a chance to follow along. It’s created the applications and now it’s synchronizing them. If I go to the bigger display showing all containers coming up, the entire stack is now being built inside this namespace. I’ll give it a second or two.

It looks like we’ve got most of our services. Let me give this one more second to fully bake while we talk about some key issues.

[29:38] Dependency Management and Connecting to Services

Robert: When you’re developing these stacks, you need to pay attention to dependencies. This is something in Argo CD you have to think about carefully. Argo CD has features to help with this, but sometimes the simplest way is just to run a script that ensures things come up in the right order.

My services have the dependencies shown by the arrows in the diagram. The ClickHouse operator, for example, really needs to be up before ClickHouse. Prometheus is nice to have, though not strictly necessary first.

Another thing to think about is exposing these services so you can work with them. You can forward ports, you can do VPC peering, you can use a VPN. These are all ways to connect your Kubernetes cluster to your local machine or application.

By now the stack should be fully baked. We can see ClickHouse there at the top and services for everything. Let me go ahead and forward the ports. I’m doing it in another window. For those of you who are Kubernetes experts, we now have a bunch of port forwards exposing services. Since this is running out in Amazon and my browser lives on my laptop, I also need to open SSH tunnels to those endpoints.

[31:51] Demo: Connecting to ClickHouse and CloudBeaver

Robert: Let’s go have a look at the applications and actually talk to them.

The first one is ClickHouse. I’m tunneled through and it’s showing up at port 8123. Running the query now: I can select the number of rows in the system query log. It’s only three. ClickHouse is fully set up and running.

I’ve got a much better SQL editor available with CloudBeaver. Let me go refresh it and log in. I’ve configured this previously. I’m going to open a connection to ClickHouse. I defined a root user in my stack, and I’ll use that account.

We’re in. Let me open a SQL editor for this. It’s alive. We’re now talking to ClickHouse using a full-featured SQL editor. It knows all the data types. If I had tables, it would show them. I can now begin development of schema on this system.

[33:59] Demo: Grafana Monitoring

Robert: Let’s have a look at some metrics. I stood up the monitoring in Grafana on a previous run, so you don’t have to see me load it continuously, but you can see some of the queries that I actually executed showing up in the monitoring. It’s bouncing back and forth between a couple of different hosts. The SELECT * FROM system.query_log I set up earlier is showing up on the dashboard.

[34:42] Demo: Live GitOps Update

Robert: By running one script, I was able to set up this entire stack. Moreover, if I want to change things, for example the amount of storage I’m using, I can just go to the appropriate place, make changes, and show you what it does.

If I wanted to add more replicas or more shards, I would simply go into the resource that defines the ClickHouse server and make that change. You can see that I’ve modified a file in Git. I could now say git add demo.yaml, git commit -m with a message, and then push it up to GitHub. I can then sync it and the value will be applied to my stack. This is GitOps in operation.

[36:16] Argo CD Strengths

Robert: Let’s look at some additional issues. Let’s talk about dependencies and the strengths and weaknesses of Argo CD.

Some of the really great things about Argo CD: first and foremost it enables infrastructure as code, so your configuration lives in Git. You don’t have a problem remembering what you did to deploy the system because that’s exactly where it came from. If you want to change it, you can just make a change and re-synchronize.

A very powerful capability is that you can map configurations to multiple environments. You could have a dev environment, a staging environment, and a prod environment. Or you can do blue and green deployments, where you have two production versions running in parallel, one that’s your current production version and another that’s maybe post-upgrade, then you switch over. Argo CD enables this in a big way.

It’s also super adaptable. You can basically get just about anything to install as long as you can get it as a Helm chart, make your own manifest, or containerize it. A cool capability that’s implicit in the ones above is that you can evolve the components or exchange components easily. That’s really important, because the hardest part of building these systems is often not putting them up the first time but changing them later on.

[38:22] Argo CD Weaknesses

Robert: Some weaknesses or things to take into consideration: you really do need to understand Kubernetes to use Argo CD effectively. You sometimes have to connect the dots about why it’s doing things a certain way. If you understand Kubernetes it’s very straightforward.

Not everything is totally mature. Argo CD is a fairly new project. It’s very powerful and evolving nicely, but there are things like ApplicationSets, which can group apps together, that are still a work in progress.

The GitOps automation, when you fully enable it, is complex. You’re using webhooks and various kinds of notifications coming out of Argo CD. If you go in fully, it will take you a while to set up.

And finally, it doesn’t handle deployment outside of Kubernetes. For that you need Terraform, Ansible, or whatever your favorite form of automation is for things like setting up EKS or GKE.

But on balance, this is a really powerful tool to build stacks that allow you to have your own analytic processing.

[39:54] Three Production Considerations

Robert: Let’s talk about three final issues that are between you and getting a production analytics stack up and running.

One really important thing to think about hard is build versus buy. You’re building a stack and not using Snowflake, but it still makes sense to manage some parts. For example, in systems we see, if they’re running in the cloud and on Kubernetes in Amazon, they’ll almost always use EKS. We use it ourselves. The reason is it’s cheap. You pay a small amount for the management costs, but basically you’re just paying for the VMs that run on your cluster, which you’d have to do anyway. The same is true on GKE.

At the other end, you have your applications, which people almost never outsource because that’s your basic value. You control that. The part in the middle is where you get interesting choices. You can run ClickHouse yourself, or you can use Altinity.Cloud Anywhere. And one thing worth noting as you think about costs: if your workload is variable, you can significantly cut compute costs by scaling ClickHouse servers to zero on Kubernetes during off-peak periods.

[41:19] Altinity.Cloud Anywhere: Managing ClickHouse in Your Own Kubernetes

Robert: Altinity.Cloud is a cloud for ClickHouse, and it has an interesting property that this whole Kubernetes architecture enables: you can actually manage ClickHouse clusters in your own Kubernetes.

Here’s how it works. Most of our customers are actually running in our cloud account, just like Snowflake. We stand up for each of our customers one or more environments, which are dedicated Kubernetes clusters. We run ClickHouse there and take care of security, management, Kubernetes upgrades, and so on.

But what you can also do, because we run on Kubernetes, is use Altinity.Cloud Anywhere, which allows you to register your own Kubernetes clusters. You install something called the Altinity Cloud Connector, and at that point we can manage the ClickHouse clusters for you. More importantly, we have full visibility into them, so if you have problems we can often deal with them proactively before you even see them. For example, if you’re running out of storage, we can address that for you. This means this complex part of the stack is taken care of by somebody else, but you’re still meeting your other requirements: having data local, controlling costs, and avoiding vendor lock-in, because what we are managing is completely open source and you can disconnect from us at any time.

This model is sometimes called bring your own cloud for ClickHouse: you keep the data in your own infrastructure, but benefit from managed operations.

[43:34] Security: TLS, IP Whitelisting, and Kubernetes Secrets

Robert: The second issue is security. In-flight protection of data in Kubernetes connectivity is something that requires thought. The good news is there are standard ways to deal with this. The trick is just to make sure you’re using them effectively.

For example, making sure your client applications are always using TLS encryption with strong ciphers, doing things like IP whitelisting, and all the way to ensuring that you use standard Kubernetes features like Secrets to pass around sensitive data safely, so that passwords and credentials are not moving around in the environment as plain text. A practical guide to using Kubernetes Secrets with the Altinity operator covers exactly this pattern.

These are things you need to do. Luckily most of the services have thought about this at some level. From our ClickHouse operator, for example, by correctly annotating things you can make sure that the load balancer doesn’t open up a public port. That’s a classic problem with EKS. This ensures that access is only locally routed. Another example: you can have the ClickHouse installation default to secure ports only and shut down anything else that’s not being used.

You’ll want to look for operators that have security features and a hardening guide to help you with this.

[45:24] Remaining Production Tasks and Deployment Tips

Robert: The final thing is all the other tasks required to deploy the analytics stack. Setting up the prototype is pretty easy, but what does your full stack look like? You could need Airflow, you could need Spark. These can be installed and managed the same way. Then getting the GitOps automation fully built out: for example, adding a webhook so that if you publish something to a production branch on GitHub, Argo CD will automatically wake up and apply the changes.

Other standard database tasks include capacity planning, performance scaling, backup, and monitoring. These are all issues you have to get through before deploying. One of the great things about Kubernetes is it allows you to iterate through a lot of different combinations of VM types and storage types, and with Argo CD you can wipe the applications out, bring them back with a new configuration and new storage, and off you go.

[47:15] Tips for Building Your Own Analytics Platform

Robert: We’re approaching the end of the talk. Here are some tips for building your own analytics platform.

First, know that you can, and that you have the option to own your data. For specific use cases, you can build something better than Snowflake on balance, provided you focus on a specific problem. If you want to build a general service it’s going to be tough to beat Snowflake. It’s really good and can support a wide range of applications conveniently.

Keep the scope as small as possible. Don’t add requirements you don’t need. For example, on security there are ways, by protecting the overall environment, to make security within Kubernetes much simpler.

Kubernetes, if you’re going to build this kind of platform, is the stack to use. There really isn’t an alternative at this point that’s general-purpose, portable, supports scaling up and down, and has all the parts you need. We definitely strongly recommend it for this type of platform.

Bring up ClickHouse on Kubernetes with Argo CD is the key to getting GitOps to work: mapping definitions flexibly from GitHub to an actual application running in Kubernetes.

One thing to note with open source is you own the problem. The stack is yours. You can see what it’s doing and you have full control, but it also means you need to understand what you’re doing, or partner with people who do. That’s typically what we see: people will often outsource the complicated parts. For example, understanding the details of how ClickHouse works makes sense to outsource to someone else, but you can keep the rest yourself.

Just to get started: go look for the argocd-examples-clickhouse project, clone it, and read the directions. It’s a pretty fresh project. You may run into a few bumps, but feel free to log issues or contact us on Slack.

Here are the projects that went into this stack: obviously Argo CD. We have a number of relevant projects including these examples, the Kubernetes operator, and the Altinity Stable® Builds for ClickHouse®, which I was using for this demo. And then there are other components: the upstream ClickHouse, plus Prometheus, Grafana, and CloudBeaver. We recommend all of these wholeheartedly.

[51:00] Closing Remarks

Robert: That’s it. I hope this has been interesting. We have a few minutes left for questions, so if you’d like to ask about building your own stacks or anything you’ve seen in this talk, I’m happy to answer. Go ahead and post those into the chat or into the question and answer.

Is anybody here building a stack like this? Just curious.

[49:49] Q&A: S3 Tiered Storage and Parquet Options

Robert: Here’s a great question. What’s the best way to include S3 as a data store: Parquet, or ClickHouse on S3 format? The answer is yes, both are good, and that’s actually a really important question.

Let’s go back to the picture of the ClickHouse storage. What ClickHouse supports is something called tiered storage where you’ll keep your hot data in block storage and then you’ll have older data out on S3. You set up what are called storage policies, which you can apply when you build a table. For example, you have an events table and you say: use the hot-cold storage policy. When you define the table in SQL, you can explicitly tell ClickHouse at what point you want the data to move between tiers. These are called TTL moves. The good thing about that model is that your applications don’t have to make a distinction between whether the data is stored in block storage versus S3. You can just do a SELECT and it’ll happen. If it goes to S3, it could be a lot slower depending on how big a scan you’re doing, but that’s the only difference you’ll see.

The other way to do this is to take the data out of block storage, archive it from ClickHouse, and put it into something like Parquet on S3. ClickHouse has what are called table functions that allow you to read that Parquet as if it were a table. There’s actually a Parquet table type too, though I haven’t used it very much.

The distinction has a plus and a minus. The plus is that once you push it into archive storage, basically your data lake, other applications can read it without going through ClickHouse. You can light up pandas and go read these things, feed it into something like PyTorch to do ML on it. That’s an advantage. The disadvantage is that your analytic applications inside ClickHouse will most likely need to know they’re going to a different table because its properties will be different. There are also limitations on things like how to delete data out of Parquet. That doesn’t happen automatically, whereas if you delete data from a ClickHouse table, ClickHouse will take care of removing it in good time.

If you’re going to use tiered storage, you just set up the storage policies and include them in the definition of the ClickHouse cluster you’re building. You put it in GitHub and then sync it using Argo CD, just as we saw.

[55:14] Q&A: Altinity Stable Builds vs. Upstream ClickHouse Builds

Robert: Somebody’s asking about Druid and Pinot. Druid, Pinot, and ClickHouse occupy a similar space. Most of what we’ve described you could do with Pinot as well. It’s a matter of taste and there are some trade-offs in the systems.

Another question: what’s the difference between the ClickHouse server built by clickhouse.com and Altinity? Which is better suited for the operator? The answer is we can support both of them. What’s the real difference?

The difference is what you get with the Altinity Stable® Builds for ClickHouse®. Let me pull up the docs. Basically there’s one table that explains it all.

We release on the LTS, or long-term support, releases. These come out twice a year. We use those as a basis for ClickHouse builds that are, one, certified for production use, meaning no nasty bugs that we know of, nothing that will corrupt your data. Second, they’re supported for three years. If you’re an Altinity customer and you want bug fixes, this is the way to get them. You may run the build for two years. We have customers that run it for actually far longer than that. The stable build is around, and if you run into a bug we will push fixes. We can also port small features back. If a new feature comes out in a monthly community build and it’s really valuable, we can actually port it back into the stable builds.

The part about them being production certified is more important than you might think. ClickHouse has a model where it seeks to evolve very quickly. What that always means in databases is that the head revisions are unstable. What we find is that you usually have to wait three to four months before an LTS build becomes suitable for production use. That’s basically been the long bar for us on testing. We monitor usage carefully to find out what the state is, and we do detailed release notes including all the things you have to look out for when you upgrade.

[58:35] Q&A: On-Premises Kubernetes Compatibility

Robert: There’s a question about whether this is compatible with a Kubernetes installation on premises, or whether the Altinity ClickHouse stack can be implemented on-premise VMs. Yes, absolutely.

Kubernetes, and this is one of the things I was thinking of demoing but it seemed like a bridge too far. I could take this thing and point it to Minikube and just bring it up on a computer sitting on my desktop. The answer is absolutely yes. The thing you don’t get as easily in-house is auto-scaling, which is a really great feature of EKS and GKE. In the cloud you can basically set up a provisioner that spins up VMs whenever you need to add capacity. But other than that, everything should be the same and will work fine.

[59:37] Q&A: FIPS Builds for FedRAMP and PCI DSS

Robert: There’s a question about FIPS compliance. We do it. We have FIPS-compatible versions that are available for these builds. For those going into PCI DSS and FedRAMP environments, we support those builds for customers with those requirements. Check out the Altinity blog and look up FIPS. You’ll find those builds are available open source and we also support them for customers going into those compliance environments.

I’m at time, so in fact I have to go to a meeting to talk about FedRAMP. Thank you very much for your attendance. I hope this was useful. If you’d like to talk to us further, it’s easy to find us. Just come to altinity.com. You can find links to join our Slack channel, contact us, spin up trials, and if you’d like to talk further about this modern analytic stack based on GitOps, we’d love to talk to you. Thank you very much for attending and have a wonderful day.

FAQ

What is Argo CD and why is it useful for deploying ClickHouse on Kubernetes?

Argo CD is a GitOps continuous delivery tool for Kubernetes that reads application definitions from GitHub and automatically synchronizes them to live Kubernetes clusters. It is useful for ClickHouse deployments because it normalizes the different installation methods across stack components, including Helm charts, Kustomize manifests, and plain YAML, making it possible to deploy an entire analytics stack with a single script. Changes to configuration are made in Git and then synced, giving you a full history of your infrastructure and the ability to apply changes consistently across multiple environments.

How does ClickHouse tiered storage with S3 work?

ClickHouse tiered storage combines fast local block storage for hot, recently accessed data with cheap S3 object storage for cold, older data. You define storage policies in configuration, apply them when creating a table, and set TTL rules in the table’s SQL definition that specify when data should automatically move from hot to cold storage. Applications query the table transparently regardless of which tier the data resides on, though queries that hit S3 will run more slowly than those served from block storage.

What are Altinity Stable Builds for ClickHouse and why would I use them instead of upstream builds?

Altinity Stable Builds are releases of ClickHouse based on upstream LTS versions that have been rigorously tested, certified for production use, and supported for three years. Upstream ClickHouse releases are frequent and can take three to four months after an LTS release before they are stable enough for production. Altinity Stable Builds eliminate that uncertainty, come with detailed release notes covering known issues and upgrade considerations, and allow customers to receive backported bug fixes and selected new features during the support window.

What is Altinity.Cloud Anywhere and how does it fit into a self-managed Kubernetes deployment?

Altinity.Cloud Anywhere allows teams to connect their own Kubernetes clusters to the Altinity cloud management plane by installing a lightweight Altinity Cloud Connector. Once connected, Altinity can manage ClickHouse clusters running inside your infrastructure, monitor them proactively, and handle issues like low storage before they become problems, while you retain full ownership of the data, control over costs, and freedom to disconnect at any time. It is the best-of-both-worlds option between fully self-managed ClickHouse and running in Altinity’s cloud.

How do I handle security for a ClickHouse stack running on Kubernetes?

Key security measures include enabling TLS encryption with strong ciphers for all client and inter-server connections, using IP whitelisting or VPC-level controls to restrict who can reach ClickHouse ports, and using Kubernetes Secrets to pass credentials and certificates into ClickHouse pods rather than embedding them in configuration files or Git repositories. The Altinity Kubernetes Operator for ClickHouse supports annotations that prevent the load balancer from opening public ports, and the operator’s hardening guide provides step-by-step instructions for securing a cluster.

Can this stack run on premises or only on public cloud Kubernetes?

The entire stack, including ClickHouse, the Altinity Kubernetes Operator for ClickHouse, Argo CD, Prometheus, Grafana, and CloudBeaver, runs on any Kubernetes distribution including Minikube on a laptop or bare-metal on-premises clusters. The main feature you give up compared to managed cloud Kubernetes like EKS or GKE is auto-scaling, which in cloud environments can automatically provision additional VMs when capacity is needed. All other functionality works identically on premises.

© 2023 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE^® SOLUTIONS

Get in touch with ClickHouse experts.

Build a Low-Cost, High-Performance Analytic Platform with Kubernetes and Open Source

Key Moments (Timestamps)

Webinar Transcript

[0:04] Introduction and Housekeeping

[1:28] Speaker Introduction and Altinity Overview

[3:11] Snowflake Architecture Overview and Key Strengths

[7:32] What Snowflake Does Not Do Well

[11:38] Evaluating Open-Source Database Options

[13:15] ClickHouse Architecture Overview

[14:43] Full Platform Design: Stack Components

[15:41] Kubernetes Fundamentals and Resource Mapping

[18:02] Adding the Altinity Kubernetes Operator for ClickHouse

[20:36] Argo CD: The GitOps Solution

[23:27] Demo: Argo CD Project Walkthrough

[26:00] Demo: Creating and Syncing the Operator

[28:07] Demo: Full Stack Deployment Script

[29:38] Dependency Management and Connecting to Services

[31:51] Demo: Connecting to ClickHouse and CloudBeaver

[33:59] Demo: Grafana Monitoring

[34:42] Demo: Live GitOps Update

[36:16] Argo CD Strengths

[38:22] Argo CD Weaknesses

[39:54] Three Production Considerations

[41:19] Altinity.Cloud Anywhere: Managing ClickHouse in Your Own Kubernetes

[43:34] Security: TLS, IP Whitelisting, and Kubernetes Secrets

[45:24] Remaining Production Tasks and Deployment Tips

[47:15] Tips for Building Your Own Analytics Platform

[51:00] Closing Remarks

[49:49] Q&A: S3 Tiered Storage and Parquet Options

[55:14] Q&A: Altinity Stable Builds vs. Upstream ClickHouse Builds

[58:35] Q&A: On-Premises Kubernetes Compatibility

[59:37] Q&A: FIPS Builds for FedRAMP and PCI DSS

FAQ

Related:

Leave a Reply Cancel reply

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE® SOLUTIONS

Get in touch with ClickHouse experts.

Key Moments (Timestamps)

Webinar Transcript

[0:04] Introduction and Housekeeping

[1:28] Speaker Introduction and Altinity Overview

[3:11] Snowflake Architecture Overview and Key Strengths

[7:32] What Snowflake Does Not Do Well

[8:42] Defining the Problem: GDPR-Compliant Google Analytics Replacement

[11:38] Evaluating Open-Source Database Options

[13:15] ClickHouse Architecture Overview

[14:43] Full Platform Design: Stack Components

[15:41] Kubernetes Fundamentals and Resource Mapping

[18:02] Adding the Altinity Kubernetes Operator for ClickHouse

[20:36] Argo CD: The GitOps Solution

[23:27] Demo: Argo CD Project Walkthrough

[26:00] Demo: Creating and Syncing the Operator

[28:07] Demo: Full Stack Deployment Script

[29:38] Dependency Management and Connecting to Services

[31:51] Demo: Connecting to ClickHouse and CloudBeaver

[33:59] Demo: Grafana Monitoring

[34:42] Demo: Live GitOps Update

[36:16] Argo CD Strengths

[38:22] Argo CD Weaknesses

[39:54] Three Production Considerations

[41:19] Altinity.Cloud Anywhere: Managing ClickHouse in Your Own Kubernetes

[43:34] Security: TLS, IP Whitelisting, and Kubernetes Secrets

[45:24] Remaining Production Tasks and Deployment Tips

[47:15] Tips for Building Your Own Analytics Platform

[51:00] Closing Remarks

[49:49] Q&A: S3 Tiered Storage and Parquet Options

[55:14] Q&A: Altinity Stable Builds vs. Upstream ClickHouse Builds

[58:35] Q&A: On-Premises Kubernetes Compatibility

[59:37] Q&A: FIPS Builds for FedRAMP and PCI DSS

FAQ

Related:

Run ClickHouse® like a Cheapskate – 6 Ways to Save Money While Delivering Real-Time Analytics

Altinity Quickstart for ClickHouse! Build Your First App

Bring up ClickHouse® on Kubernetes with Argo CD

Leave a Reply Cancel reply

CLICKHOUSE^® SOLUTIONS