Run ClickHouse® like a Cheapskate – 6 Ways to Save Money While Delivering Real-Time Analytics

Recorded: May 10 @ 10:00 am PT
Presenters: Robert Hodges & Altinity Engineering
In this cost-focused webinar, Altinity CEO Robert Hodges walks through six concrete strategies for reducing the operating cost of ClickHouse® analytics applications without sacrificing performance. He opens by cataloging the five principal cost drivers in analytic databases: proprietary licensing, storage, compute, networking, and human labor, then spends the session on practical techniques addressing the first four.
The first strategy is developing entirely on a laptop using 100 percent open-source software. Robert demonstrates three installation patterns including native packages, Docker, and Docker Compose for multi-service stacks, pointing out that the savings on licensing are literally 100 percent and that many large-scale production deployments started life exactly this way.
The second strategy is tuning applications to limit resource usage, covering schema design choices such as data type selection, codecs, and ORDER BY optimization, which can reduce storage by 80 percent or more. Robert explains how to use the ClickHouse system tables, particularly system.query_log, to analyze query resource consumption and shows two queries with radically different profiles to illustrate how rewriting queries can eliminate wasteful memory allocations.
The third strategy is using TTL rules to manage data growth. Robert covers both row-level TTL deletion and the more powerful TTL recompression feature, which progressively applies deeper ZSTD compression as data ages, yielding storage savings of around 50 percent at six months compared to fresh data while keeping the data accessible.
The fourth strategy is tiered storage, using ClickHouse storage configurations to organize disks into volumes and policies that automatically move aging data from fast local NVMe SSD to slower, cheaper block storage or S3 object storage based on TTL rules. Robert notes that S3-backed cold tiers are evolving quickly and represent the most significant long-term savings opportunity.
The fifth strategy is scaling compute down when it is not needed. Using Kubernetes and a provisioner like Karpenter, the VM type for a ClickHouse cluster can be changed with a single configuration update, or compute can be stopped entirely by setting replicas to zero, with block storage persisting untouched. Robert cites a real customer running at one million dollars per month, where a 50 percent compute reduction saves half a million dollars monthly.
The sixth strategy is simply using cheaper infrastructure: switching from gp2 to gp3 EBS storage for a quick 20 percent saving, committing to AWS Savings Plans for up to 54 percent off compute, benchmarking ARM-based Graviton instances which can be 15 to 20 percent cheaper at the same or better performance, and for stable workloads considering managed hosting providers like Hetzner, which can be up to 90 percent cheaper than major public cloud on-demand pricing.
The webinar closes with a summary table contrasting short-term versus long-term impact of each strategy, and an extended Q&A on build-versus-buy decisions for managed ClickHouse services, the role of open-source compatibility in vendor selection, and how to evaluate total cost of ownership when considering services with proprietary forks.
Here are the slides:
Key Moments (Timestamps)
Key moments generated with AI assistance.
- 0:00 – Introduction and housekeeping
- 1:12 – Robert Hodges background and Altinity overview
- 3:00 – ClickHouse overview: open-source, columnar, vectorized, Apache 2.0
- 5:29 – GraphCDN real-world ClickHouse application example
- 6:53 – Five principal operating cost categories in analytic databases
- 8:44 – Tip 1: Develop on your laptop with 100% open source
- 9:41 – Three laptop installation patterns: native packages, Docker, Docker Compose
- 12:00 – Docker Compose multi-service development walkthrough
- 14:37 – Savings from open-source development: 100% licensing cost reduction
- 15:31 – Tip 2: Tune applications to limit resource usage
- 16:10 – Schema design: data types, codecs, compression, ORDER BY
- 18:01 – Impact of schema tuning: 80% storage reduction example
- 18:01 – ClickHouse system tables as a performance tuning tool
- 19:01 – Analyzing queries with system.query_log: duration, memory, rows read
- 21:15 – Q&A: performance trade-offs of ZSTD vs. LZ4 compression
- 23:22 – Tip 3: Use TTLs to limit data growth and recompress aging data
- 24:49 – TTL row deletion: automatically drop rows after 12 months
- 24:49 – TTL recompression: progressively apply ZSTD at 1 and 6 months
- 26:53 – Tip 4: Use tiered storage to reduce storage cost over time
- 27:22 – Time-series access patterns and storage tier design
- 28:33 – Storage configurations: disks, volumes, policies explained
- 30:40 – Table definition with TTL MOVE between hot and cold volumes
- 31:27 – Adding S3 as the cold tier for maximum long-term savings
- 33:14 – Q&A: using system.query_log to determine optimal TTL intervals
- 34:22 – Tip 5: Scale compute down when not needed
- 34:41 – EBS block storage and VM resizing in the cloud
- 36:18 – Kubernetes and Karpenter: automatic VM provisioning on pod changes
- 37:41 – ClickHouse operator YAML: changing instance type with one field
- 39:01 – Patterns: cyclical scaling, tiered compute, and stopping clusters entirely
- 41:18 – Compute savings: up to 75% per resize, ~50% for real production systems
- 41:47 – Tip 6: Use cheaper infrastructure
- 42:27 – EBS gp3 vs. gp2: same performance, ~20% cheaper
- 43:08 – AWS Savings Plans: 27% with one-year commit, 54% with prepay
- 44:59 – Graviton ARM instances: 15–20% cheaper and often faster than Intel
- 46:30 – Managed hosting providers like Hetzner: up to 90% cheaper than public cloud
- 48:26 – Summary: short-term vs. long-term savings across all six tips
- 50:43 – Q&A: when to use a managed ClickHouse service vs. self-managed
- 53:50 – Q&A: importance of open-source compatibility in vendor selection
Webinar Transcript
[0:00] — Introduction and Housekeeping
Robert: Okay, everyone. I’d like to welcome you to “Run ClickHouse® Like a Cheapskate: Six Ways to Save Money While Delivering Real-Time Analytics.” My name is Robert Hodges and I’ll be your presenter today. I’m also backed up by Altinity engineering, who helped supply a lot of the knowledge that went into the presentation you’ll be hearing.
Before we get into this, just a few housekeeping items that will help you enjoy this webinar more. It is being recorded, and you probably heard the notice when you came in. We will post the recording as well as the slide links a short time after the webinar finishes. We do have time for questions. If you see things that interest you and you’d like to learn more, feel free to pop a question into the question-and-answer box on the Zoom bar at the bottom of your screen, or you can type it into the chat. If it’s something relevant to what I’m covering, I’ll just dive in and answer it right there. We’ll also have time at the end.
[1:12] — Speaker Introduction and Altinity Overview
Robert: Let me tell you a little more about my background. I’m a database geek. I’ve been working on databases for about 40 years. I started with pre-relational systems, worked through relational systems like Sybase, Oracle, and MySQL, and then I’ve landed in analytics systems. I love working with databases. It’s been the main thing I’ve done in my career. My current day job is Altinity CEO, so I don’t get to write as much code as I used to, but I do get to deal with a lot of really fun issues and see applications that people are building. We have a bunch of people who are database geeks like me, and together with centuries of experience, much of it focused on analytic databases, we have the best support team on the planet for ClickHouse. These are folks who’ve run enormous clusters and have a great deal of experience running very large analytic applications at scale.
As a company, we’re an enterprise provider for ClickHouse. We have a couple of hundred customers. One of the main things that we offer is Altinity.Cloud, a cloud platform for running ClickHouse. The interesting thing about what we do at Altinity is we let you run ClickHouse anywhere. It doesn’t matter where you’re running: you could be in our cloud, you could be managed by our cloud but running in your own VPCs, or you could be self-managed, for example in appliances that you’re shipping. We support all of it, and we have software to help you make it work. We’re the authors of the Altinity Kubernetes Operator for ClickHouse® and a bunch of other open-source projects.
[3:00] — ClickHouse Overview
Robert: Let’s introduce ClickHouse. For people on this call who haven’t heard of it, it’s a real-time analytic database. You can think of it as being in some ways like MySQL. It understands SQL, and in fact the dialect has a number of similarities to MySQL. Like MySQL, it runs pretty much anywhere: bare metal, cloud. Like MySQL, it is open source, though ClickHouse has a freer license. It’s licensed under Apache 2.0, so you can use it for any business purpose anywhere you want.
Where it differs from MySQL, and is more like a true analytic database, is that it is a column store. The data are stored in columns instead of rows. This allows us to compress those columns very effectively, and it means that if you don’t refer to a column in a query, we don’t need to read it. This is a standard feature in virtually all analytic databases today. It has a shared-nothing architecture, so you have a bunch of ClickHouse servers communicating with each other over the network, each typically taking control of a certain fraction of the storage. It is also very good at parallel and vectorized execution. By parallel execution we mean it can run a bunch of threads simultaneously. Vectorized execution means that we treat data we’re reading out of the columns as vectors, which line up with machine caches and can even be pushed onto SIMD registers on Intel processors as well as ARM. It scales to many petabytes. There are many applications running 10 petabytes or more on ClickHouse, and these capabilities have turned ClickHouse into a core engine for low-latency analytics, from web analytics, which was really the canonical use case, all the way to real-time marketing and managing financial market data.
[5:29] — ClickHouse in a Real Application: GraphCDN
Robert: Let’s have a quick look at how ClickHouse fits into a real application, because it illustrates how analytic databases are used.
This is actually from a picture of a customer of ours, GraphCDN. They did a blog article with us a couple of years back. Like many applications, ClickHouse and the analytic database sit in the middle between your sources of data and your consumers. In this case, GraphCDN reads performance telemetry from Fastly and puts it into Kafka event streams, which are then delivered to the analytic database, in this case ClickHouse. On the consumer side, they have a GraphQL API which serves it up in a JavaScript front end that allows people to see the data, manipulate it, and understand relationships. What GraphCDN is doing is providing metrics and performance monitoring for GraphQL APIs, giving people ways to quickly change caching across the internet to make those APIs extremely efficient. It can do this essentially in real time because of ClickHouse.
[6:53] — Five Principal Operating Cost Categories
Robert: With this in mind, we can start to think about what drives cost in this type of application, particularly when we look at the analytic database.
There are five things that make analytics expensive. One is proprietary licenses. It’s only recently that analytic databases have started to become available in open source. Storage is an obvious component because we’re dealing with large amounts of data. Compute is required to access that storage and deliver answers quickly. Networking can also be important, not in all cases, but if you’re running clusters that deliver data across distance or where cluster elements are in different data centers, networking costs can become a significant proportion of overall system cost. And then of course human labor: the effort required to set things up and build the system in the first place.
This talk we’re mostly going to focus on the first four. Human labor is a completely different topic with a whole bunch of different aspects. We’ll come back in a later webinar and talk about that part. For today, these first four areas are where we’re really going to focus.
[8:44] — Tip 1: Develop on Your Laptop with 100% Open Source
Robert: Let’s look at how to make things cheaper. One of the best ways to save money, particularly early on, is that with ClickHouse you don’t have to go anywhere to do development. You don’t need a license. You don’t need to go to a cloud. Most applications, or a significant chunk of them, you can just develop on your laptop using 100 percent open source. This is a key way to begin building on ClickHouse while spending little or nothing.
[9:41] — Three Laptop Installation Patterns
Robert: There are basically three patterns for how you can install ClickHouse and use it on a laptop. First, you could just install the ClickHouse packages directly. If you’re running Ubuntu, my favorite Linux distribution for development, I would set up the repo, run a simple sudo apt install, and install both the ClickHouse server and client. Those go straight onto the laptop. I just start it, and I have a running ClickHouse right in front of me.
If I’m on a different Linux distribution, on Mac OS, or anywhere else, what I can do is run it in Docker. With a container, that’s a pre-packaged form of the server. I just run a Docker run command, specify the name and the correct tag, it gets pulled from Docker Hub, and it’s up and running in about a minute with a decent network connection.
Finally, I can run a complete ClickHouse application with Docker Compose. This is actually the most interesting thing for development on a laptop, because typically applications that you’re developing have multiple services that interact together, and Docker Compose allows you to bring up multiple Docker containers which can connect to each other through a local network.
[12:00] — Docker Compose Multi-Service Development Walkthrough
Robert: Let me show you how you’d actually apply this. One of the things that’s really cool is that if you have an application, it’s very easy to containerize it. Here’s a totally simple example where I have a Dockerfile that builds a Docker image consisting of Ubuntu plus the curl program. Curl is the simplest possible program that lets you talk to ClickHouse. Three commands in my Dockerfile, and I just run docker build. That exact command will in a few minutes generate a Docker image available to run on my machine.
Next I create a Docker Compose file. I’m going to ask it to bring up a Docker container with ClickHouse in it, using the Altinity Stable Builds. These are builds that have three-year support and are also certified to be production-ready. You can also use the regular ClickHouse builds from clickhouse.com. In addition to the ClickHouse server, I’ll have the Ubuntu client, and I’ll refer to that image I just built. The final part is bringing it all up with docker compose up -d. You’ll see a bunch of messages, and if you then run docker ps to see which containers are running on your machine, you’ll see these two nice little containers. At that point you can connect to them. In this particular case I’m using docker exec, I go into the container, run a curl command, and demonstrate that I can connect to my ClickHouse server.
[14:37] — Savings from Open-Source Development
Robert: This is a really great development pattern. It’s using 100 percent open-source software. And speaking of open source, there are an enormous number of open-source components you can use. Most people who build things on ClickHouse and run them themselves tend to be 100 percent open source, or close to it. This illustrates the panoply of libraries, event streaming technology, ETL tools, and presentation tools available to you.
How much money can we save? If you’re developing on your laptop with 100 percent open source, chances are you’ve already got the laptop. The software is free. Your savings on licensing is 100 percent.
[15:31] — Tip 2: Tune Applications to Limit Resource Usage
Robert: Here’s another really wonderful way to develop cost-efficient applications: tune them to limit resource usage. I can’t stress how important this is. The same things that make your ClickHouse application fast are also the things that make it cheap. Because it runs fast, it needed fewer resources for a shorter period of time to get the same result done. That’s equivalent to saying it costs less.
[16:10] — Schema Design: Data Types, Codecs, and ORDER BY
Robert: There are several main areas, but I’m going to touch on the key ones. The first thing you should always look at is having a good schema design. ClickHouse, unlike a database like Snowflake, does not magically create an efficient schema for you. You actually have to make choices. When you develop a ClickHouse schema, for example for sensor data, you want to optimize your data types to make them small, use codecs and compression. Codecs will transform your data and squeeze the air out before it’s handed to the compression algorithm, in this case ZSTD. You also want to consider using aliases for computed columns instead of storing them, and you want to pay a lot of attention to your ORDER BY clause, because the ordering of data has an enormous effect on both compression and your ability to find things quickly without a full table scan.
Just to give you an illustration of the difference this makes: for the same table, the unoptimized version ended up taking 4.5 bytes per row. Going all the way to ZSTD with codecs reduces it to about 1 byte per row. That’s an 80 percent reduction. And the unoptimized table already had decent out-of-the-box compression. We can make it way smaller just by having a better schema design.
[18:01] — ClickHouse System Tables as a Performance Tuning Tool
Robert: Another thing you want to do as you’re tuning your application is to fall in love with the ClickHouse system tables. They’re really great. They are the best system tables of any database I’ve ever used. They’re filled with practical information that enables you to see everything from how well you’re compressing your columns to how much time your queries are taking and what resources they’re using.
The query I’m looking at right here goes to system.query_log and collects information about recent queries: how many seconds they took to run, how much memory they used, how many rows they read, how many bytes they read. This basic information, once you look at it, makes it very easy to understand what’s going on in ClickHouse.
[19:01] — Analyzing Queries with system.query_log
Robert: In particular, looking at system.query_log will allow you to get to the third thing you need to do: analyze your queries and make them more efficient.
Here’s an example of two queries that do different things and have very different resource profiles. The first query completes in less than a second with a kilobyte of RAM. The other takes three and a half seconds but uses almost a million times as much RAM: 2.4 gigabytes. The difference is that the longer one has more complex aggregates. The uniqExact function creates a hash table as it aggregates, and grouping by more keys means more hash tables. As a result it uses vastly more memory and more compute.
By understanding this and making suitable adjustments, for example substituting uniqExact with alternative functions that use less memory and run faster, or using fewer group-by keys, we can reduce resource usage and as a result lower the cost. The savings here can be huge. A schema that is poorly designed is the first place we look when customers come to us. You can make a huge difference just by tuning the schema. Similarly, you can go from queries that take 20 seconds down to one second or less just by optimizing them, selecting less data, and writing them more effectively.
[21:15] — Q&A: Performance Trade-offs of Compression Codecs
Kiara: Robert, there’s a question about performance impacts at query time from using this kind of compression. Is it a trade-off?
Robert: Yes, clearly. One of the things to compare is LZ4 versus ZSTD. ZSTD gives better compression but is more expensive, particularly to compress in the first place. If you’re ingesting a lot of data, ZSTD will slow down your ingestion because data has to be compressed on write. You’ll also pay that price again whenever blocks are merged, because they get read, reprocessed, and recompressed as a merged part.
So it’s always a question of: if you want snappy performance but don’t mind using a little more storage, go for LZ4. If you’re really looking for deeper compression, ZSTD is definitely the way to go. In particular, ZSTD is very good at compressing integers. In the example I showed, the LowCardinality string combined with ZSTD compression is actually a good combination, because ZSTD is better than LZ4 at compressing the output of LowCardinality encoding.
There’s no free lunch here, but in this particular case the compression win is definitely worth it. Thanks for that question.
[23:22] — Tip 3: Use TTLs to Limit Data Growth and Recompress Aging Data
Robert: Another great way to limit data growth in ClickHouse is to use TTLs. TTL stands for time to live.
Back when ClickHouse first added TTLs, they were a bit like MongoDB: basically row-level timeouts. You can see in the schema example highlighted in red that it says: take the time value in the row, and after 12 months, delete it. ClickHouse has a background process that fires at regular intervals, scans your table, finds rows that meet this condition, and drops them. At that point you’re no longer accumulating infinite storage as your time-series data keeps arriving.
[24:49] — TTL Recompression: Progressive Compression as Data Ages
Robert: There’s a really nice variation on this. Not only can you time out rows, you can recompress them with ZSTD after a month and then again at six months. TTLs have become a pretty rich feature. In this example, I’ve still got the TTL that deletes things at 12 months, but in fact what I’m doing is taking these rows and actually recompressing them with ZSTD level 1 after a month, and then with an even higher level of ZSTD at six months, halfway through their lifetime.
Let’s look at what the impact of that is on our data. Some test parts show exactly why you have to love the system tables. The oldest part, from August 2022, is 264,000 bytes compared to the newest data at over 600,000 bytes. We’ve compressed it by more than 50 percent. The difference between LZ4 and ZSTD is close to a 50 percent reduction.
This is a really great feature. The storage savings you get from using TTLs to limit data growth and also recompress are roughly 20 percent. The real trade-off with TTL delete is that you’re losing data, so you might not want to do this too aggressively, because you need to keep source data around. What you’d rather do is compress it heavily and then put a long TTL on it while still giving people access to it.
[26:53] — Tip 4: Use Tiered Storage to Reduce Storage Cost Over Time
Robert: We can improve on this, and that’s where we get to tiered storage. The recompression you can do with TTLs is a kind of tiering, but ClickHouse has a much more sophisticated way of handling tiers that will allow you to greatly decrease the cost of storage over time.
The basic idea here, which we see in many time-series use cases, is that recent data is the stuff you query the most. It’s the most valuable data. In markets, for example, 95 percent of queries might be on data from the last day, 4 percent on last month’s data, and 1 percent on last year’s data. This suggests that for that hot data, you want to put it on NVMe SSD with very high IOPS. As it ages, you’d like to put it on much cheaper storage, such as standard hard disk or conventional SSD, which has higher density but is of course much slower.
[28:33] — Storage Configurations: Disks, Volumes, and Policies
Robert: ClickHouse makes this pretty easy to set up. There are what are called storage configurations that organize devices into volumes and then into policies. This feature was added about three years ago and has become quite stable.
At the bottom level you define disks. These are mount points on your operating system, so you can have different types of storage mounted into the host. You can then organize them into volumes, which are collections of disks. And then you have a policy that ties the volumes together. For example, a policy called tier_ttl might have a fast volume and a slow volume.
Here’s how the configuration looks. You give each disk a name. The default disk is always present and comes from config.xml. Additional disks like data_2 and data_3 need explicit mount points. Once you have the disks defined, you build them into a policy. The policy lists the fast volume and the slow volume in order.
[30:40] — Table Definition with TTL MOVE Between Tiers
Robert: Here’s how you actually use this in a table definition. You’ve got a normal CREATE TABLE statement, and then you add a TTL clause. After one day, ClickHouse will move data to the slow volume. In the background it will find data that needs to move and then, after a year, delete it. This gives you a really nice division between hot and cold data, all automatic.
I’m showing an example here which is just using block storage for both tiers, but we can improve that further. Using object storage in your cold tier will get you way higher savings.
[31:27] — Adding S3 as the Cold Tier
Robert: There are some trade-offs with object storage support that you should be aware of. The support is evolving fairly quickly, and there are cases where you can run into issues. For example, there are known problems around zero-copy replication in S3 storage. These things are being stamped out very quickly, and long term this is a very viable way of organizing your data, with hot data on very fast local disk and S3 holding your long tail. If you’re not going too deeply into S3 support yet, you can still get roughly 30 percent storage savings from tiered block storage alone, and that 30 percent could be 30 percent of a very big number.
[33:14] — Q&A: Using system.query_log to Determine Optimal TTL Intervals
Robert: There’s a question coming up: is there a nice way of analyzing access patterns of data to determine optimal TTLs for moving between storage tiers?
The query log might be an option, and yes you can use it. There’s also the part_log, but that really tracks merges and activity related to parts. For access patterns you’re going to have to look in the query log and examine how much data you’re reading. That’s actually a great question. If you want to pose it on any of the Slack channels I’ll look into that and give you a full answer.
Here’s the key thing: like all things in ClickHouse, there’s nothing automatic about this. It’s really up to you to make the choices. What often determines TTL intervals is not so much query patterns but how much storage you have. If you have enough NVMe SSD to comfortably hold a day’s worth of data plus extra space for merges, you hold the day. It’s really driven by your available storage capacity.
[34:22] — Tip 5: Scale Compute Down When Not Needed
Robert: Here’s another great way to save on ClickHouse compute: scale your compute capacity down when it’s not needed. This is one of my favorite ways of saving money on ClickHouse. I use it constantly, literally every day.
[34:41] — EBS Block Storage and VM Resizing in the Cloud
Robert: One of the things that’s really cool about running in the cloud is EBS, Elastic Block Storage. It’s not very romantic, but it’s storage that is attached to your VM. When your VM comes up, the cloud provider automatically hooks it up and gets it mounted. One of the cool features of the cloud is that you can resize your VMs. For example, if we’re using an m5.4xlarge, we can kill that VM, create a smaller one, in this case an m5.xlarge, and point it back to the same block storage. We’ll effectively reduce our compute cost by 75 percent. We’re still paying for the storage itself, of course, but the point is you can dramatically reduce the compute.
[36:18] — Kubernetes and Karpenter: Automatic VM Provisioning
Robert: As a practical matter, how do you do this? Well, one answer is Kubernetes. Kubernetes makes this optimization super easy.
Here’s a picture of what happens inside a Kubernetes cluster that also includes the Altinity Kubernetes Operator for ClickHouse. When we bring up a ClickHouse cluster, for each server we allocate a pod, which is a container running a process. The pod has one or more persistent volume claims, which are requests for storage, and those attach to persistent volumes. What we can do is make a very simple configuration change in Kubernetes that says: the VM I’m currently running on is not suitable anymore, please reschedule me on a different VM. The magic is that you have a provisioner, in this case Karpenter, that watches over Kubernetes. It watches pod changes and automatically brings up new VMs as needed to match pod requirements. As pods go away, it deallocates those VMs. It all happens in the background.
[37:41] — ClickHouse Operator YAML: Changing Instance Type with One Field
Robert: How do we actually make that change? Here’s an example of a ClickHouseInstallation resource using the Altinity Kubernetes Operator for ClickHouse. It’s pretty simple. We’re defining a one-node cluster and we have what’s called a pod template. All we have to do is change the name of the instance type. That tells the provisioner: I need to run on this VM type. If I was something else previously, I don’t need that anymore. Please make me a new m5.xlarge and let me run there. There’s also an availability zone setting, which is useful for spreading things nicely across availability zones.
To apply this, you use kubectl, the standard Kubernetes tool. Within a few minutes your pod will be running on the new VM.
[39:01] — Patterns for Using Compute Rescaling
Robert: How would you use this? There are a variety of patterns.
One thing you can do is reduce capacity whenever demand is lower. If you have cyclical usage patterns, you can scale the cluster up for big events that come up from time to time and scale it back down in between. Another really interesting pattern that we’ve looked at in large systems is tiered compute. You can actually have a cluster where not only the storage is tiered but the compute is tiered too. A large customer we worked with built a cluster where the hot data not only needed fast storage but also needed a lot of compute, because that’s where most queries go. Every month they spawn a new set of nodes with a lot of compute. As the data ages, they scale the compute down on those nodes, reducing the amount of compute required on older data.
Another way to use this is simply to turn off the compute completely. This is something we do automatically in our cloud. You can set what’s called an uptime schedule. After a period of non-use, just turn off the VM. The block storage persists; it doesn’t go away. But there’s a Kubernetes setting where you set the number of replicas to zero, Kubernetes just makes the pods go away, and in turn your VMs are deallocated. Your storage remains the same, but at that point you’re paying zero for compute. This is particularly useful for development where you may use things for a couple of hours and then go work on something else for a couple of days. In the meantime, you’re not being charged for compute.
[41:18] — Savings from Compute Rescaling
Robert: How much can you save on this? Quite a lot. I showed the example of 75 percent, but in real production systems the savings on compute is probably around 50 percent, because you still need to have available resources. But it’s pretty good, and in a large system it can have a substantial impact.
[41:47] — Tip 6: Use Cheaper Infrastructure
Robert: Let’s look at another area where we can make a big difference: just use cheaper resources. We’ve talked about optimizations around making the application more efficient. But in the end, you can also just make your infrastructure cheaper.
[42:27] — EBS gp3 Storage: 20% Cheaper Than gp2
Robert: Here’s a very simple one. Use cheaper storage types. In Amazon, if you use gp3 storage you get the same performance, or actually better in some cases, but it’s about 20 percent cheaper than gp2 block storage. That’s a quick win. When you’re using Kubernetes, selecting storage is very simple: you just give the name, in this case gp3, in your storage class configuration. This is something you should constantly be looking at, because this is one of the easiest ways to optimize storage cost.
[43:08] — AWS Savings Plans: Up to 54% Off Compute
Robert: Another thing you can do in cloud systems is use AWS Savings Plans. This is not something you can necessarily apply as a developer since it’s a financial operation, but it’s so important to controlling costs that you should always look at it. Amazon has Savings Plans for compute. Here’s an example: an m5.4xlarge running in Oregon is $548 a month if you do nothing. If you set up a one-year Savings Plan, you’re committing to pay for this instance for the whole year. You don’t have to prepay anything, but you can now get it 27 percent cheaper. If you’re willing to do a prepay, you can actually double that savings. By prepaying upfront, Amazon will give you that VM for three years at roughly $258 a month. These are huge savings. One thing to point out is that they only really apply to compute. Network costs and storage don’t have equivalent Savings Plans.
[44:59] — Graviton ARM Instances: Cheaper and Often Faster
Robert: Yet another way to save is to use more efficient instances. Graviton, which is ARM-based, is an Amazon VM it’s really great. A comparison table based on our benchmarks shows the comparison between an m6i (Intel-based) versus the same general size running on Graviton and a newer generation of Graviton.
For example, going from an m6i.large to an m6g.xlarge, normalized for the speed difference, comes out 20 percent less expensive. Now, the m6g is a bit slower than the m6i, but it costs a lot less, so on balance you get 20 percent savings. If you want cheaper and faster, you can go to an m7g. These are 15 percent cheaper and they’re faster. Graviton is definitely something you should look at. ARM in general is something anyone using a lot of compute should be considering.
[46:30] — Managed Hosting: Hetzner at Up to 90% Less Than Public Cloud
Robert: And then the final thing, if you’re looking for the cheapest place to run, is that we’re all focused on the major public clouds, but remember there are managed hosting providers who are way cheaper, and we use Hetzner, as do many others. It’s kind of like hosting at the dollar store. The machines are great but you don’t have the full panoply of services you get on Amazon, things like RDS or Elastic Block Storage and all the other services. But they can be up to 90 percent cheaper than the on-demand prices you pay in the public cloud. This is a really powerful option if you have a stable workload and just need a lot of compute with attached NVMe SSD storage. Going to a place like Hetzner is a way to take a huge cut off the bill, and moreover that discount applies not just to compute but to any attached storage as well. We’ve been using them for years. It’s a really good service.
[48:26] — Summary: Six Tips and Their Long-Term Impact
Robert: With that, we’re at the end of the main discussion. Let’s have a kind of wallet-sized summary of different ways you can be a ClickHouse cheapskate.
One thing that’s really useful is to think about not just the short-term savings but also the long-term impact. For example, developing on a laptop cuts costs to zero for development. When you’re starting up, costs are nothing because your laptop is already there. But the actual long-term impact on your total cloud costs is likely pretty small. For example, our own development costs are about 10 percent of our overall cloud costs for large production systems.
On the other hand, tuning apps to limit resource usage can have a huge impact, as can using cheaper resources and tiered storage. Tiered storage, especially if you can get S3 to work for you, is where you can get huge cost savings over time. We’ll see more and more people using that as object storage becomes a better and more stable part of ClickHouse storage. Other things like using TTLs and scaling down compute capacity have more limited per-unit impact, but in the case of scaling down compute, if the system you’re looking at is running a million dollars a month, a 50 percent reduction takes you from a million to half a million. That’s a huge savings.
[50:43] — Q&A: Managed ClickHouse Service vs. Self-Managed
Robert: There’s a question: as a business, how should we decide whether a managed ClickHouse offering like Altinity.Cloud or DoubleCloud or ClickHouse Cloud is more or less suitable than rolling our own?
That’s a really great question, and there are a number of factors. If you want to get to market very quickly, and particularly if you’re integrating a lot of components, or you’re not sure whether ClickHouse is going to work for your use case, going to a cloud service is definitely the way to go.
But over time there are other things to take into consideration. One way to think about it is that cloud businesses are always going to have to make money. What they’re doing is charging you enough to cover not just infrastructure costs but also their own gross margins. Cloud services are looking for gross margins anywhere from 50 to 80 percent. What that means is that if the underlying infrastructure costs 20 dollars, in the extreme case you’re going to pay a hundred dollars for it. So ask yourself: is that cost difference a big deal in your overall cost of ownership, or is it something where the labor of doing the automation is more than you want to deal with and you want to get to market quickly? Then you can balance these two things.
[53:50] — Q&A: Open-Source Compatibility in Vendor Selection
Robert: One really important point that makes a big difference for many of our customers is that the stuff we run is 100 percent open source. This is something you want to look at really carefully with cloud vendors. If you’re on a cloud vendor that is using a different version of ClickHouse that has diverged significantly, your ability to move and your flexibility over the long term become much lower. In effect, there will be ClickHouse there, but as ClickHouse offerings evolve, many of them are going to look like Snowflake: once you’re on them, it’s effectively proprietary software and you can’t move off. So you want to balance that. Are you working with a vendor that’s 100 percent open source versus somebody that has a highly customized cloud? It may have great performance, but at the same time you’re trading that convenience for lock-in, which will give you less control over costs in the future.
That’s something everybody has to answer for themselves. One more thing I’ll say about Altinity is we actually don’t have a strong opinion on where you run. We allow people to run in our own cloud, to have things managed by our cloud but running in their own environments, or to run it themselves. For us it’s just: run where you want, we’ll support you no matter where you are. We’re kind of at the extreme end of that formula.
Okay. There’s more information here on the Altinity YouTube channel and blog. Of course there’s also the ClickHouse documentation, which is wonderful. The more you learn about ClickHouse, the more ways you can save money by making it more efficient.
With that, you are all now officially certified ClickHouse Cheapskates. You’ve seen a number of things we’ve been working on for years and have helped customers do. If you’d like to hear more, please contact us. We run Altinity.Cloud, we do enterprise support for self-managed deployments, we have Altinity Stable® Builds for ClickHouse®, and we’d be delighted to talk with you. Thank you very much, have a wonderful day.
FAQ
What is the single most impactful way to reduce ClickHouse storage costs?
Schema design tuning has the highest potential return. By choosing appropriate small data types, applying column codecs such as Delta or T64 combined with ZSTD compression, and ordering data correctly in the ORDER BY clause, it is possible to reduce storage by 80 percent or more compared to an unoptimized table that already has default compression. This directly reduces cloud storage bills and also speeds up queries by reducing I/O.
How do ClickHouse TTL recompression rules work and what do they save?
TTL recompression rules progressively apply higher levels of ZSTD compression to aging data as part of ClickHouse’s background merge process. For example, a table can be defined to recompress data with ZSTD level 1 at one month and ZSTD level 6 at six months. This can reduce storage by roughly 50 percent compared to freshly inserted data, while keeping the data fully queryable, since applications do not need to change how they access it.
What is ClickHouse tiered storage and when should I use it?
Tiered storage uses ClickHouse storage configurations to define disks, volumes, and storage policies that automatically move data between fast local storage and slower, cheaper storage such as block storage HDDs or S3 object storage, based on TTL move rules. It is most valuable for time-series applications where most queries target recent data, allowing hot data to stay on fast NVMe SSD while large volumes of cold historical data live on cheaper storage at four to five times lower cost per gigabyte.
How can Kubernetes help reduce ClickHouse compute costs?
Kubernetes, combined with a VM autoscaler like Karpenter, makes it easy to change the VM type running ClickHouse by simply updating the instance type in the ClickHouseInstallation manifest and applying it with kubectl. Since ClickHouse data lives on attached block storage rather than the VM itself, the pod can be rescheduled on a smaller and cheaper VM with no data loss. In the most aggressive case, replicas can be set to zero to stop all compute entirely, with storage remaining intact, which is particularly valuable for development clusters and non-production environments.
Is Graviton worth considering for ClickHouse workloads?
Yes. Altinity benchmarks show that AWS Graviton3 ARM instances such as the m7g family are 15 percent faster than Intel m6i equivalents and 15 percent cheaper, making them strictly better for most ClickHouse workloads on AWS. Even Graviton2 instances are 20 percent cheaper than Intel m6i, with only a modest performance trade-off. ARM instances are worth evaluating for any compute-intensive ClickHouse deployment.
What are the key questions to ask when choosing between a managed ClickHouse service and self-managed?
The primary cost question is whether the managed service’s gross margin of 50 to 80 percent over raw infrastructure cost is justified by the labor and time saved. The primary flexibility question is whether the vendor runs 100 percent open-source ClickHouse or a proprietary fork, since a diverged fork creates long-term lock-in that limits your ability to change vendors or optimize costs. For teams that need speed to market or lack ClickHouse expertise, a managed service is often the right starting point.
© 2023 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.
ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.