Snowflake, BigQuery, or ClickHouse®? Pro Tricks to Build Cost-Efficient Analytics for Any Business

Recorded: September 12 @ 07:00 am PDT
Presenter: Robert Hodges, CEO @Altinity
In this webinar, Altinity CEO Robert Hodges delivers a detailed breakdown of how the major cloud analytic databases, including Snowflake, BigQuery, and Amazon Redshift, actually price their services and what that means for your bill. By examining Snowflake’s public financials and reverse-engineering the underlying instance types used by both Snowflake and BigQuery, Robert reveals that compute markups can reach 5 to 10 times the actual cloud cost, while storage markups tend to be modest. He introduces three distinct cloud database pricing models: “buy the box” (Redshift-style attached block storage), the virtual data warehouse model (Snowflake), and the serverless on-demand query model (BigQuery), explaining the tradeoffs and best use cases for each.
Robert then pivots to practical alternatives. Using the example of building a GDPR-compliant replacement for Google Analytics, he walks through what he calls the modern analytics stack: a custom, cloud-native platform built on managed Kubernetes, open-source ClickHouse® as the query engine, the Altinity Kubernetes Operator for ClickHouse® to manage the database, Prometheus and Grafana for observability, and Argo CD for GitOps-based deployment. He demonstrates live how this entire stack can be spun up from a GitHub repository using Argo CD in just minutes on Amazon EKS, and how a database version upgrade can be executed by pushing a single commit to GitHub.
The webinar closes with six actionable tips for getting the best price from any cloud analytics service, including decoupling storage and compute, ensuring storage is billed on compressed rather than raw data, and negotiating prepay discounts on compute. Robert argues that for workloads requiring 24×7 real-time analytics, tenant-facing dashboards, or data sovereignty, a self-managed open-source stack built on ClickHouse and Kubernetes can substantially outperform the commercial alternatives in both cost and flexibility.
Here are the slides:
Key Moments (Timestamps)
Key moments generated with AI assistance.
- 0:06 – Introduction and speaker overview
- 1:19 – Robert Hodges background and Altinity company overview
- 3:11 – How analytic cloud database pricing models work
- 5:35 – Snowflake’s virtual data warehouse model and cost analysis
- 10:57 – BigQuery’s serverless on-demand pricing model explained
- 15:07 – The “buy the box” model and modernizing Redshift-style deployments
- 20:00 – How separation of storage and compute affects ClickHouse pricing
- 21:39 – Quick comparison of all three cloud database pricing models
- 23:36 – What Snowflake does well and where it falls short
- 25:51 – Six tips for getting a better price on cloud analytics
- 28:18 – When is cloud analytic database pricing actually a good deal?
- 30:28 – Picking a specific problem: building a GDPR-compliant Google Analytics replacement
- 33:00 – Introducing the modern analytics stack built on Kubernetes and open source
- 34:02 – Choosing a Kubernetes distribution, operator, observability stack, and GitOps tool
- 40:42 – Live demo: spinning up ClickHouse on Kubernetes with Argo CD
- 50:05 – Best practices for do-it-yourself modern analytics stacks
- 51:16 – How Altinity enables self-managed and cloud-managed ClickHouse deployments
Webinar Transcript
[00:06] — Introduction and Housekeeping
Robert: Hello, everybody. Welcome to our webinar on Snowflake, BigQuery, or ClickHouse®. We’ll be talking about pro tricks to build cost-efficient analytics for any business. My name is Robert Hodges and I’m CEO of Altinity. I’m joined here today by Kiara Tosselli, our head of marketing.
It is a pleasure to be able to talk about this topic. Before I dive into the introductions, let me tell you a couple of things about this webinar that will help you enjoy it more.
First, it’s being recorded. We will send out the link to the recording as well as these slides within 24 hours of the end of the webinar, hopefully sooner. You don’t have to take furious notes; they’re going to be available to you in your email and also on the Altinity website.
Second, we have time for questions. You can go ahead and post them in the question-and-answer box in the Zoom control panel. You can also put them in the webinar chat. We can answer them there. If they’re relevant to the slide I’m on, I’ll take them right then and there; otherwise we have time at the end to talk about them.
[01:19] — Speaker Introduction and Altinity Overview
Robert: Let’s dive in and do a little more on the introduction. Once again, my name is Robert Hodges. My day job is running Altinity, but I’ve been working on databases for, as it says here, 30-plus years. It’s actually 40 this year. I’ve worked on about 20 different database types. I’ve also worked extensively in virtualization. I was at VMware for a while, and that’s in fact where I began to learn how to use Kubernetes, which is something we’re going to talk about later in this presentation.
I’m backed up by an amazing engineering team. About two-thirds of our company of 45 people has an engineering background, and they are database geeks just like me, except that many of them have even deeper experience in databases and applications.
As a company, we are enterprise providers for ClickHouse, an outstanding real-time analytic database. It’s open source. Among our offerings, we have Altinity.Cloud, which is a cloud version of ClickHouse run in the cloud. We also offer Altinity Stable® Builds for ClickHouse®, which are builds designed for enterprise users. They have three years of support, they’re certified for production use, and so on. Another thing we do, and some of you on this call may already be aware of this, is we’re the authors of the Altinity Kubernetes Operator for ClickHouse®, known as the ClickHouse operator for short. We’ve actually been working on getting ClickHouse to work cloud-native for almost five years at this point. We have deep experience there, and we’ll be sharing some of the things we’ve learned in that process today.
[03:11] — How Cloud Analytic Database Pricing Works
Robert: Let’s dive in. In this first part I’d like to talk a little about analytic databases and explain how the cost models work, particularly when you’re running them in the cloud.
Analytic databases in the cloud got started in 2013 with Amazon Redshift, and it was groundbreaking. Basically, what Amazon did was they took a data warehouse, an existing on-prem install, put it in the Amazon Cloud, and allowed you with a credit card to start that data warehouse in 20 minutes or less.
This was an amazing advance. It’s sort of like how we don’t fully appreciate how great COBOL was because what came before it was assembler. It’s easy to forget just how great Redshift was because prior to Redshift it could take up to six months to install a data warehouse. You had to make a deal with a vendor, get the hardware, get it racked, get everything configured, and by the time all of that worked through procurement and management it was six months.
So this was a huge advance, and within a few years there were several excellent data warehouses available in the cloud, including BigQuery, Snowflake, Redshift, and more recently data cloud services built around Snowflake.
To get to the cloud efficiency question, it’s useful to talk about how most developers, particularly early on, learn about cost efficiency. If you’re running BigQuery, it’s easy. You can run a query that costs you hundreds or even thousands of dollars. This is an example that appeared on Twitter earlier this year, and it’s not unique by any means.
This brings up a big question: 300 bucks for one SQL query? That seems like a lot. What is going on down there? That’s what we’re going to cover in this section: how these data warehouses work at a high level and how the pricing works.
[05:35] — Snowflake’s Virtual Data Warehouse Model
Robert: Let’s start with Snowflake. We’re not going to show you architecture yet. We’re actually just going to talk about the business. This turns out to be a surprisingly useful way to understand what it really costs to operate these database systems in the cloud, and from that you can infer the markup you’re being charged.
For Snowflake it’s easy. It’s a public company and we can just go look at their financial results. Their results for the period ending January 31st this year showed total revenue of 2 billion, of which the cost of revenue, sometimes called the cost of goods sold, was about 717 million. That’s an interesting number. In a profit-and-loss statement, that’s how much it costs Snowflake to deliver their service. So what that means is that the cost of running their cloud is hidden in there.
If you just do the math, that’s a little over a third of revenue. What that means is that if all of that were cloud cost and nothing more, they would be marking things up by roughly three times.
In fact, the actual delivery costs for a service like Snowflake are much wider. They don’t just include running VMs and allocating storage. They include paying SREs, hiring other services, and all kinds of things necessary to make the whole service work. So in fact the markup on their cost is about 5x on average. That’s a useful number to keep in mind as a benchmark: if something is greater than 5x, it’s bigger than the average Snowflake markup; if it’s less, it’s a cheaper offering.
So now let’s look at the Snowflake costs and figure out what the deal looks like when you run their database.
Snowflake uses something called the virtual data warehouse. It’s a very innovative architecture. It stores data in S3 object storage, and they were, I believe, the first to operate this model at scale. When you want to work on your data and run SQL queries, you create what are called virtual data warehouses. You pay for these using credits, and a credit maps to a host somewhere that has pulled in data from object storage and is processing a SQL query on it.
We don’t necessarily know what those VMs are, but if you go look at the pricing list, on-demand pricing ranges from about two to four dollars an hour for one of these VMs.
Now, this would be the end of the cost analysis except that a little while ago there was a bug in Snowflake. If you issued a query that failed in the right way, it would print out the kind of VM they were actually running. It turns out that in at least some cases, the things I’ve marked as credits actually map to c5d.2xlarge instances, which are 8 vCPUs and cost about 38 cents an hour.
We know what the object storage costs because Snowflake is storing it, and we can see that Snowflake charges between $23 and $40 per terabyte per month. Interestingly, that’s not very different from Amazon on-demand S3 costs. So they don’t really mark up the object storage much at all. What they do mark up is the virtual data warehouse. We can see from this that if you’re going to offer cheap storage, you’ve got to make it up somewhere else. What Snowflake does is make it up in compute, which can be anywhere from 5 to 10x more expensive than the corresponding hosts.
That’s basically how the Snowflake virtual data warehouse model works.
[10:57] — BigQuery’s Serverless Query Model
Robert: Let’s have a look at another model that’s quite different: BigQuery.
BigQuery is the other really large cloud database analytics service, and it has a number of pricing models. I’m going to focus on one, which is called the serverless or on-demand query model. The way this works is you store your data on distributed storage. It’s actually not object storage; it’s a different proprietary system, at least up until recently. It’s charged at about object storage rates.
One important thing to notice is these prices are not for compressed storage. That’s a big deal we’ll talk about in a minute. The basic idea is you’ve got your data stored there, and when you run a query, BigQuery will allocate some number of compute nodes, whatever it thinks is the right number. You don’t know what that is, because what’s going to appear on your bill is a charge of $6.25 for every terabyte of data it scans in answering your query.
It’s pretty easy to see from this how somebody could run up a 300-dollar bill. What you would have to do is read enough terabytes of data out of that distributed storage. That’s going to be about 15 terabytes. If you have a large enough data set, this is not hard to do, and you can also see how this might expand into thousands of dollars if your data set were large enough and your query read enough of it.
So how does this compare with actual cloud resources? We could just pick a standard VM, the n2d-standard-32 with 32 vCPUs, and look at the cost of object storage, which is comparable except for the question of whether the data is compressed. The block storage on Google is pretty expensive, about 17 cents per gigabyte per month.
The problem is these numbers don’t tell you anything useful on their own. Depending on how your data is arranged, BigQuery could be anywhere from 10 times cheaper to 10 times more expensive. If you only run a query once, an always-on VM is costing you money even when it’s idle. On the other hand, if you run queries constantly 24×7, which is the case for many analytic applications, BigQuery could be way more expensive. It’s really hard to judge. You actually have to look at your workload and find out what your actual costs are.
I want to be totally clear: BigQuery also allows you to price by compute through what are called slots that you can reserve. That gives you a model closer to Snowflake’s. So you can play around with it.
[15:07] — The “Buy the Box” Model and Modernizing It with ClickHouse
Robert: Let’s talk about one final model. It’s pretty common in both of the models I’ve described to use object storage, or in the case of BigQuery something very like it. There is another model for operating databases in the cloud, and it’s the one that was pioneered by Redshift when they first opened up about 10 years ago. I call it the buy-the-box model.
If you go look at the pricing for Redshift, you can still see this model. Basically what you do is hire out a VM, for example a dc2.8xlarge with attached block storage (SSD), and they’ll charge you $4.80 an hour in Amazon US West 2. The actual underlying VM turns out to be an i3.8xlarge, which runs at about $2.50 an hour. The RAM is almost exactly the same, which is why it’s a pretty clear match. The markup on Redshift is actually not that high; it’s always had a lower markup than Snowflake, which really dings you on compute. The tradeoff is that Redshift’s approach gives you 66% less storage for the same spend.
Now, we can improve on this model significantly, and to see how it’s useful to introduce ClickHouse. ClickHouse is a popular open-source real-time analytic database. Here’s the basic architecture. You can have two types of storage, but the key thing is you’ll have ClickHouse servers connected to each other over a network. They can replicate: if you add data to a table on one host, it will automatically replicate to the other. The storage is columnar, like all the databases we’re dealing with today. That means data is stored in arrays that compress well. They’re also cheaper to query because you don’t have to read as much storage.
When ClickHouse was originally developed it could only store data on block storage, but it has since been improved to store data on S3-compatible object storage. It also includes either ClickHouse Keeper or ZooKeeper, a cluster used to maintain consensus about the data that needs to be replicated between servers.
Robert: So this is the database architecture. We can map it as a cloud service to what I call the modernized buy-the-box model. On the left you have your original Redshift architecture; on the right, a new version where instead of the old i3 instances we can use m6i instances. These are Intel-based with a clock speed at least a third faster than the old i3s, and they mount storage on EBS (Elastic Block Storage), which is versatile. With the newest Amazon gp3 storage you can control the bandwidth and throughput. If you dial it up to get a throughput of 1,000 megabytes per second, together with that nice VM, you’re going to run about $2.64 an hour.
But there’s another really important thing here: by separating the storage and compute, you can now easily adjust the relative amount you spend on each. That’s exactly what we’ve done for years at Altinity.Cloud when managing ClickHouse.
[20:00] — Effect of Storage and Compute Separation on ClickHouse Prices
Robert: By having separation of storage and compute, at any time you can change the CPUs and VMs you’re using and just remap them to the same storage. What that means is that depending on what you’re doing, you can scale your costs up and down simply by changing the VM types.
This graph shows the all-in, on-demand cost for Altinity.Cloud when it’s running ClickHouse. You can see that the m6i.12xlarge is going to cost about $6 an hour to operate, which is more expensive than Redshift. But what’s important is that ClickHouse is really fast, and m6i instances are faster than the Redshift equivalents. When we deploy, we would probably recommend smaller instance sizes, which means your costs are correspondingly a little lower, and you can of course adjust them. This illustrates the kind of benefits you get from this separation of storage and compute.
[21:39] — Quick Comparison of Cloud Database Pricing Models
Robert: Let me do a quick comparison. This wallet-size table shows the different ways these models compare.
The buy-the-box model gives you relatively cheap compute but more expensive storage, because you’re using block storage that is on average about five times more expensive than object storage.
Snowflake and BigQuery give you cheap storage but ding you in other ways. In the Snowflake model, compute is relatively expensive, up to 10x what you would pay if running it yourself. In the BigQuery on-demand query model, the cost can be extremely expensive if you have queries that scan a lot of data.
What this means for use cases: buy-the-box is really good for customer-facing analytics where you can’t control what customers are going to do or when, and having fixed boxes means you won’t get cost overruns. The virtual data warehouse model can be very expensive if you need to allocate a lot of storage for complex queries. BigQuery can be very expensive if you unexpectedly scan more data than you planned, which is difficult to control up front.
[23:36] — What Snowflake Does Well and Where It Falls Short
Robert: Given that the Snowflake markup is high, let’s zero in on what value you’re getting from Snowflake, because costly things are often costly because they’re delivering something good.
Snowflake is very good. It’s a general-purpose database. One of the best things about it is that you can put practically any analytic app on Snowflake and it’s going to work. It might be more expensive than other models, but if you’re operating across an entire company and you just want to pick one thing that will work for everything, Snowflake is probably it. It can handle vastly different use cases, it has a very good SQL implementation, great integration with tools, and people who’ve worked with BI in the past on things like Teradata or Vertica can convert to Snowflake without having to understand too much about what’s going on underneath.
What it doesn’t do: it doesn’t keep the data in your own account, so if you care about that it’s a problem. It doesn’t help minimize cost for compute-intensive 24×7 or real-time analytics. It’s not particularly good for tenant-facing analytics, like if you have thousands of tenants with dashboards. And finally, it’s completely proprietary, so if you want to get off it you’re talking months at the very least to port to a different implementation.
That said, a classic example of when Snowflake makes sense is when a company wants to make one choice that works across the entire enterprise.
[25:51] — Six Tips for Getting a Better Price on Cloud Analytics
Robert: If you’re going to use Snowflake or any other cloud service, what can you do to get a better price? Here are six tips based on a lot of time we spend thinking about pricing, since we’re offering a service ourselves.
First, look for decoupled storage and compute. You don’t want to buy into databases that can’t scale up or down. People tend to get fixated on object storage as the only way to scale compute, but in fact block storage and VMs scale very well and are super well suited for many analytic applications, particularly 24×7 real-time analytics.
Second, make sure that if you’re charged for storage, it’s for compressed storage. Snowflake compresses the storage, ClickHouse compresses the storage, BigQuery doesn’t necessarily do it. It’s called logical versus physical pricing. Since for columnar databases you can often get 90% compression, you really want to make sure you’re getting this.
Third, if you’re spending a lot of money per month, your vendor should be giving you a discount without you even having to ask. If you’re not getting it, ask for it.
Fourth, look for price breaks that align with the vendor’s discounts. Just as one example, unless you’re really big it’s kind of hard to get discounts on storage in the cloud, but compute is heavily discounted. You can get discounts on Amazon VMs of 50% or more simply by prepaying. Your vendor can get that discount. If you do a prepay, you can basically reduce that part of the cost, which could be a major savings of a third or more.
Fifth, buy on the cloud marketplace. For example, the Amazon Marketplace, and apply that price toward your existing commits. Even though it doesn’t lower the cost directly, it lowers your overall cost because it allows you to meet your obligations to Amazon or Google so that you can get other types of discounts.
[28:18] — When Is Cloud Analytic Database Pricing a Good Deal?
Robert: When is it cost-efficient and good for your business? It really depends on your business, for many different reasons. I’d point to two things.
One: as your revenue grows, the cost of running analytics should scale at or less than your revenue growth. What you’re seeking as you build services is that your marginal cost of adding new users goes down over time. That’s what makes SaaS applications work. If you have an analytic database that gets more expensive per customer as you add more customers, that’s a problem. Do the math and figure out if that is the case.
Two: see whether vendor pricing is in line with revenue. You may have noticed when I showed you the Snowflake P&L that their pricing didn’t actually cover all their costs. In Snowflake’s case, they’re doing a lot of investment so that will probably even out. But if somebody is charging essentially at cost or below it, that could be a big problem. At some point gravity will reassert itself and the prices will change.
[30:28] — Picking a Specific Problem: Building a GDPR-Compliant Analytics Stack
Robert: Let’s think about what would happen if a cloud service wasn’t a good deal. What are your options for doing it differently? It helps to pick a specific problem.
Say you’re in Europe and you’ve decided your destiny is to build a GDPR-compliant replacement for Google Analytics. Lots of people are in fact doing this, including Altinity customers. This has a bunch of features, from analytic queries to data pipelines to visualization and consumption. We’re going to build the core platform using open source.
First, let’s do a reality check against Snowflake. Why don’t we use it? Snowflake has strengths that are nice in this case: it’s serverless, meaning you don’t have to worry too much about instances, and it has a UI with SQL editing, management, and lots of tools. Those things are great. But many of Snowflake’s strengths don’t really matter for this specific application. Having standards-compliant SQL is nice, but you’re not porting anything. And the Snowflake weaknesses vastly outweigh the benefits here: just being able to keep the data in your own cloud account is necessary to efficiently meet GDPR requirements. You’re also concerned about data lock-in. So Snowflake isn’t the right solution.
[32:02] — The Modern Analytics Stack Built on Kubernetes and Open Source
Robert: What we see a lot of people doing in the marketplace is building what we call the modern analytics stack, which is based on Kubernetes. The idea is you’re going to build a custom platform that delivers exactly the analytics you need for your particular application.
The key things we consistently see when this goes well: people use cloud-native technology, specifically Kubernetes, which is very good for distributed applications. They use open source. They use infrastructure as code, defining the whole stack as something that can be checked into GitLab or GitHub. They use GitOps to deploy it, basically taking that checked-in code and blasting it out to one or more environments. And finally they use Kubernetes and clouds as the runtime.
There’s a pattern to these stacks. You have GitOps and CI/CD and Kubernetes as the base. Within that you have layers: a management layer that covers things like operators, a storage layer which is your database engines, an orchestration layer for data pipelines and change data capture, and finally a consumption layer for BI tools, your own applications, and APIs.
[33:50] — Choosing the Components: Kubernetes, Database, Operator, Observability, GitOps
Robert: The first thing to do is choose a Kubernetes distribution. By far the best practice is to use managed Kubernetes. If you’ve got to run Kubernetes, let somebody else run it for you. You’ve heard that Kubernetes is complicated, and that’s true, but it’s mostly complicated to run. Actually building applications on it is not anywhere near as hard.
The managed Kubernetes available from the major public clouds is amazingly cheap. In a truly remarkable coincidence, they all charge 10 cents an hour. Amazon EKS, for example, works out to about $72 a month. You’d be crazy not to use this. We use it ourselves; we run hundreds of clusters on managed Kubernetes.
Second, pick the right open-source analytic database. There are many options, and they are less general-purpose than Snowflake or BigQuery. In our particular case, ClickHouse is absolutely the right choice because it does real-time analytics. Web analytics was actually the original use case for ClickHouse when it was developed 15 or so years ago.
Third, if you’re running databases in Kubernetes you want to use an operator to manage them. In the case of ClickHouse there’s a really good one, which is the Altinity Kubernetes Operator for ClickHouse®. The way operators work is they define what’s called a custom resource definition, which makes the database a new type of resource in Kubernetes. There’s a program called kubectl; if you’ve used Kubernetes you’re familiar with it. You just apply a piece of YAML, maybe 12 to 24 lines that defines the database in a custom format. Kubernetes will recognize it as a resource the Altinity Kubernetes Operator needs to manage, hand it over, and the operator will adjust reality. It will define new resources in Kubernetes that make that database turn into reality.
Fourth, choose your observability platform. One that’s particularly powerful in Kubernetes is Prometheus, which is virtually universal for doing analytics. It’s a time-series database that will collect data out of things like the Altinity Kubernetes Operator, which will automatically export metrics. Then you use Grafana to build dashboards. Grafana is very powerful for operational dashboards; it allows you to zoom in, look at particular series, and change the time scale. It has a great plugin for both the Altinity Grafana Plugin for ClickHouse® and for Prometheus. We maintain the Altinity Grafana Plugin for ClickHouse®; it has about 11 million downloads and is used in thousands of installations.
Fifth, pick a GitOps implementation. What you want to be able to do is check in the code that defines the stack and then apply it to Kubernetes clusters in different locations. One we’ve seen used very successfully, and use ourselves, is Argo CD. Argo CD will take projects located in GitHub, which can define the different services we’re running in Kubernetes, and then sync that code to Kubernetes, which causes your application to stand up and run.
[40:42] — Live Demo: Bringing Up ClickHouse on Kubernetes with Argo CD
Robert: Okay, let’s go out to Amazon. I’m now in this screen and let’s have a look at what’s going on in my namespace.
I have a namespace called CH and I’m looking for pods. Pods are running containers, which in turn are one or more running processes in Kubernetes. There’s nothing out there right now. What I’m going to do is bring up the entire stack using Argo CD.
I’ve got Argo CD installed. I’m in this demo project, which you’re free to use. Let’s look at how the application is put together. We’ll pick the ClickHouse app and look at the manifest. Here it is. This is the manifest that defines a ClickHouse cluster. It shows things like what version we want, how many shards, how many replicas, and so on. It’s all defined in about 49 lines of YAML. Very simple.
Looking at the apps in here, they have different ways of installing. Prometheus installs through Helm. ZooKeeper we’re doing through manifests. CloudBeaver we’re doing through manifests. Grafana you can install with an operator. So each of these is defined.
Now what we’re going to do is tell Argo CD to bring them up. Argo CD also has a handy command-line tool. Let’s list our apps. Nothing running right now. Let’s run a script that’s going to bring the whole thing up from soup to nuts. These are the commands: we tell Argo CD where each of these applications is located, give it a repo and a path within the repo, give it a destination Kubernetes cluster. That’s the app create. Then we do an app sync, which tells Argo CD to go ahead and actually install it.
Let’s run that script. I press enter and you can see these commands firing one after another. Now we’re going to see, as the sync operation proceeds, a bunch of stuff going on. We can already see out in the stack these things coming up. It will take a little bit of time; for example, this is a ClickHouse instance coming up.
Let’s see what Argo CD thinks is going on. This is Argo CD’s view of the world. It tells us that not only are the applications created, from ClickHouse down to ZooKeeper, but they are fully synced. “Progressing” means the sync process is still in progress. Some of them where it says “healthy” are already up and running. Grafana is still coming up, Prometheus is still coming up.
Let’s see how our pods are doing. We’ve got a ClickHouse server coming up here. Let’s go ahead and log into that. We’ll do what’s called a kubectl exec command, which means we can go to the pod and actually look around. Okay, we’re now in the pod, which is the running container. If we do a ClickHouse client, you can see that we are logged in and we’re using ClickHouse server version 22.3.
We can do SHOW DATABASES. There are our databases. We can run queries. This is a fully functioning ClickHouse server. By the time we get out, we probably have both of them up and running. There they are. The entire stack is now up and running.
Robert: One other thing I’d like to demo very briefly is that we can actually change it. Let me exit. I’m working out of Git here. Let’s do a git status. I’ve actually made a change. Let me diff it. What I’ve done is changed the image because this was a typo. When we set up the example it should be the new version, not the old one. I’d like to update the servers.
Instead of submitting this directly, I’m going to add that file to staging, commit it, and then push to GitHub.
Okay, it’s pushed. I just changed this in GitHub. Now I need to get Argo CD to sync this for me. Let me say argocd app list. There it is. Now I’ll say argocd app sync clickhouse.
What will happen is that once Argo CD picks up this change, we’ll see the synchronization occurring. What’s happening in the background is the operator will wake up and notice that the version has changed. It will actually terminate the pod, bring up a new pod with the new version, and that’s how the upgrade works. It takes a few minutes because it has to allocate background VMs. I’m not going to wait, but that’s basically how this works.
What you see me doing is a database upgrade through GitHub. This is a really powerful way to develop apps, and it’s the reason why this custom approach works: a relatively small team can automate everything, store it in GitHub, and keep everything in sync while incrementally improving it without getting confused about what they’re doing.
[50:05] — Best Practices for Do-It-Yourself Modern Analytics Stacks
Robert: Let me go back to the slides. Here are some best practices we’ve seen consistently.
Build on managed Kubernetes. Use GitHub GitOps; I used Argo CD in this case, but an even larger number of our customers are using Terraform for everything, including actually applying schema into ClickHouse.
One important point: by developing applications this way, you’re not precluded from using cloud services when appropriate. The stack I built didn’t include Kafka. Many of our customers use a cloud version of Kafka, like AWS MSK or Confluent. You don’t have to make the all-or-nothing choice; you can go back and forth.
Robert: Just a little bit about how we help enable this model of building modern analytics stacks that allow you to keep the data on-prem, meet these additional requirements, and operate cost-efficiently at large scale.
We have Altinity.Cloud, the managed platform for ClickHouse®. One of the cool features is that if you’re running the stack in Kubernetes, we can actually manage the ClickHouse part for you. We have customers who run everything on Kubernetes but leave the actual management of ClickHouse itself to us. We give them a cloud management plane that connects securely to the Kubernetes cluster. Setup is very simple: you just run a connector app that registers with Altinity.Cloud, and you’re now able to manage ClickHouse.
We also provide software. You’ve seen a few examples: the Altinity Kubernetes Operator for ClickHouse® is used in thousands of installations worldwide. Altinity Stable® Builds for ClickHouse® have a three-year maintenance window and are certified for production use. We have Altinity Backup for ClickHouse®, and we manage the Altinity Grafana Plugin for ClickHouse® which has 11 million downloads. We also have a Tableau Connector among others.
Finally, we have services. Getting these things to work if you’re not familiar with them is much easier when you have somebody who knows their way around. We know ClickHouse as well as anybody in the world, and we are sure the best people on Kubernetes. We can help you with proof-of-concepts and we’re building blueprints like the one I just showed you to bring these stacks up more efficiently.
[53:16] — Blueprint, Closing Remarks, and Q&A
Robert: Here’s how to get started with the blueprint. You can just clone the examples I showed you and run it yourself. It’s not fully complete; we’re still doing some work. One thing we really want is for the Grafana dashboards to come up automatically. We’re working on code this week; it should be fixed in a few days. But try it out. It really shows the power of Argo CD.
With that, we’re done. I’d like to thank you for listening in and I’d be delighted to answer any questions about this. If you haven’t used Kubernetes before, it can seem a little intimidating, but it’s a very powerful way to build these stacks. This modern analytics stack is widely used, and people still use cloud services as well. I hope this talk has given you hints about the trade-offs, how to get the best deal for your cloud analytics service, and how to build it yourself for specific applications.
Thanks again, and feel free to contact us at any time through the website. We’d love to talk to you.
Robert: You’re most welcome. Great to see you again. I think we don’t have questions queuing up. I hope this was helpful. We’ll go ahead and close the webinar for today. Thank you very much and have a great day.
FAQ
What makes ClickHouse a cost-effective alternative to Snowflake for real-time analytics?
ClickHouse separates storage and compute, allowing teams to independently scale each resource and swap VM types without migrating data. It stores data in a columnar format with high compression rates, often achieving 90% compression, meaning you pay for far less storage than with uncompressed databases. For workloads running 24×7, this architecture consistently undercuts the Snowflake per-compute-credit model by a significant margin.
How does the Altinity Kubernetes Operator for ClickHouse simplify cluster management?
The Altinity Kubernetes Operator for ClickHouse® implements the Kubernetes operator pattern, allowing users to define an entire ClickHouse cluster, including shards, replicas, storage configuration, and version, in a compact YAML manifest of roughly 12 to 49 lines. The operator then translates that manifest into the necessary Kubernetes resources automatically. Upgrades, scaling, and configuration changes can be triggered simply by updating the manifest and syncing via GitOps tools like Argo CD.
What is the difference between Snowflake’s pricing model and BigQuery’s on-demand pricing?
Snowflake charges primarily through compute credits tied to virtual data warehouse size, with minimal markup on object storage. BigQuery’s on-demand model charges $6.25 per terabyte of data scanned per query, which means a single large query can generate hundreds or thousands of dollars in charges. Snowflake tends to be predictable for always-on workloads, while BigQuery’s on-demand model can be unpredictable if query data volumes are hard to control.
What is a modern analytics stack built on Kubernetes and what components does it typically include?
A modern analytics stack built on Kubernetes uses managed Kubernetes (such as Amazon EKS) as the runtime, an open-source analytic database like ClickHouse as the query engine, the Altinity Kubernetes Operator for ClickHouse® for database lifecycle management, Prometheus and Grafana for observability, and Argo CD for GitOps-based deployment. Infrastructure is defined as code, checked into GitHub, and deployed automatically, allowing a small team to manage a full production-grade analytics platform efficiently.
How can you get discounts on cloud analytics pricing?
The most impactful tactic is to prepay for compute resources. On Amazon, prepaying can yield discounts of 50% or more on VM costs, and cloud vendors can pass those discounts on to customers. It also helps to ensure storage billing is based on compressed rather than raw data size, use cloud marketplaces to apply spend toward existing cloud commit obligations, and simply ask for a volume discount if you’re spending significant amounts monthly without one.
When should a business choose to build its own ClickHouse-based stack instead of using Snowflake or BigQuery?
Building your own stack makes strong sense when you need data sovereignty or GDPR compliance, when you’re running 24×7 real-time analytics where Snowflake or BigQuery compute costs would compound, when you need tenant-facing dashboards for potentially thousands of end customers, or when you want to avoid proprietary lock-in. The investment in a Kubernetes-based open-source stack pays off most clearly when the analytics workload is well defined and the team has or can acquire the necessary operational expertise.
© 2023 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.
ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.