Snowflake, BigQuery, or ClickHouse? Pro Tricks to Build Cost-Efficient Analytics for Any Business

Recorded: September 12 @ 07:00 am PDT
Presenter: Robert Hodges, CEO @Altinity

Do you ever look at your bill for Snowflake or BigQuery and just sigh? This talk is for you. We’ll explain how pricing works for popular analytic databases and how to get the best deal. Then we’ll look at how to build an alternative using open-source ClickHouse data warehouses. As the pros say, open source may be free but it ain’t cheap! We’ll teach you the tricks to build your own ClickHouse analytic stack that’s less expensive and faster than Snowflake. Join us to become a wizard of cloud cost management.

Here are the slides:

Webinar transcript

Introduction to the Snowflake, BigQuery or ClickHouse? Webinar

Short introduction and housekeeping

0:00 Welcome to our webinar on Snowflake, BigQuery or ClickHouse. We’ll be talking about pro-tricks to build cost-efficient analytics for any business. My name is Robert Hodges and I’m CEO of Altinity… 

Let me tell you a couple things about this webinar that will help you enjoy it more. First of all, it’s being recorded. We will send out the link to the recording as well as these slides within 24 hours of the end of the webinar, so you don’t have to take furious notes. They’re going to be available to you in your email and on the Altinity website. Second thing,

You can go ahead and post questions in the Q&A box. You you can also put them in the webinar chat.

1:19 Let’s dive in and do a little bit more on the introduction. My name is Robert Hodges and my day job is I run Altinity. I’ve been working on databases 40 year…I’ve been work worked on about 20 different database types. I’ve also worked extensively in virtualization. I was at VMware for a while and that’s in fact where I began to learn how to use Kubernetes which is something we’re going to talk about later in this presentation. I’m backed up by an amazing engineering team. About two-thirds of our company of 45 people have an engineering background and they are database Geeks just like me except that many of them have even deeper experience in database and applications. 

Altinity: Managed service for ClickHouse and support for ClickHouse

2:10 As a company, we are enterprise providers for ClickHouse, an outstanding real-time analytic database. It’s open source. Among our offerings, we have Altinity.Cloud which is a cloud version of ClickHouse. We also do Altinity Stable builds which are built of ClickHouse designed for Enterprise users. They have three years of support. They’re certified for production use so on and so forth. 

Another thing that we do, and some of you on this call may already be aware of this, we’re the authors of the Altinity Kubernetes operator for ClickHouse, known as the clickhouse-operator for short. We’ve actually been working on getting ClickHouse to work Cloud native for almost five years at this point so that’s that’s something where we have deep experience on and you’ll see some we’ll be sharing some of the things that we’ve learned in that process today.

How Analytic Databases in the Cloud Got Started

What Amazon Redshift 

3:11 First part I’d like to jump in and talk a little bit about analytic database and explain how the cost models work particularly when you’re running them in the cloud.

So analytic databases in the cloud got started in 2013 with Amazon Redshift and it was a groundbreaking. Basically what Amazon did was they took a data warehouse which was an existing on-prem install data warehouse. They put it in the Amazon Cloud and basically allowed you with a credit card to start this data warehouse in 20 minutes or less. This was an amazing advance. It’s sort of like in the same way that we don’t fully appreciate how great Cobalt was because what came before it was assembler. It’s easy to forget just how great redshift was because prior to redshift. It could take up to six months to install a data warehouse because you had to make a deal with a vendor, you had to go get the hardware, you had to get the hardware stuck in a rack,  you had to get everything configured. And by the time all of that stuff worked through dealing with procurement and fighting with your managers, it was six months. So this was a huge advance.

Within a few years, there were several excellent data warehouses available in the cloud including BigQuery, Snowflake, Redshift and more recently data cloud services that operate Snowflake.

Analytic Databases and Cost Efficiency 

4:45 To get to the cloud efficiency question, it’s useful to to talk about how most developers particularly learn about cost efficiency. If you’re running BigQuery, you can run a query that costs you hundreds or even thousands of dollars. This is an example that appeared in Twitter earlier this year. It’s not unique by any means.

5:11 So this brings up a big question, well what is going on down there? and that’s what we’re going to do in this section of the talk we’re going to talk about how these data warehouses work at a high level and how they’re pricing works.

Snowflake Cost

How Cloud Businesses Work – Snowflake Pricing model

5:35 Let’s start with Snowflake. We’re not going to show you

architecture yet. We’re actually just going to talk about the business. This turns out to be a surprisingly useful way to understand what it really costs to operate these database systems in the cloud. From that, you’re being charged. 

Snowflake cost – based on their financials

5:47 For Snowflake, it’s easy [to figure out the markup]. It’s a public company and we can just go look at their financial results. For example, their results reported in the 31st of January this year (2023) showed a total revenue of 2 billion, of which the cost of revenue sometimes called the cost of goods sold was about 717 million. That’s an interesting number because that’s how much it costs Snowflake to deliver their service. What that means is the cost for running their cloud is hidden in there. If you just do the math, that’s about little over a third. What that means is if all of that were cloud cost and nothing more then that would mean that when Snowflake charges you and collect revenue, they’re roughly marking things up by about three times.

Snowflake’s more probable markup

7:00 Now in fact the actual delivery costs for a service like Snowflake are much wider and don’t just include running VMs and allocating storage in places like Amazon, they include things like paying SREs, they include things like hiring other services that are necessary to work all kinds of things that are necessary to make the whole service work completely. 

So in fact the markup on their cost is about 5x … that’s a useful number to keep in mind because you can then use it as kind of a benchmark for something really expensive. If it’s greater than 5x, it’s bigger than the average snowflake markup roughly. If it’s less than 5x, then it’s less so it’s a cheaper offering.

Let’s now look at the snowflake costs and figure out what they really look like in terms of the deal you’re getting.

Snowflake’s Virtual Data Warehouse Model and cost

Snowflake credits

8:11 Snowflake uses something called the virtual data warehouse.It’s a very innovative architecture stores the data in S3 object storage. They were, I believe, the first ones really to operate this model at scale. Then what happens is when you want to work on your data and run SQL queries, you create what are called virtual data warehouses and you can have a couple of them running together. You pay for this using credits and a credit you can think of as mapping to a host somewhere that has pulled in data from object storage and is a processing SQL query on it. We don’t know necessarily what these VMs are but if you go look at the pricing list, they’re on-demand pricing ranges from about two dollars to four dollars an hour for one of these VMs.

Estimating Snowflake’s compute

9:16 Now this would be the end of the cost analysis except that a little while ago there was a bug in Snowflake that if you issued a query that failed in the right way, it would print out the VM that they were actually running. It turns out that at least in some cases these things which I’ve marked credit and represent VMs actually map to c5d2x large instances (eight vcpus) which cost about currently 38 cents an hour to run. That’s an interesting number.

Estimating Snowflake’s storage

9:55 We know what the object storage is because Snowflake is storing it and we can see that Snowflake charges between 23 to 40 dollars per terabyte per month, which interestingly is not that different from the Amazon on-demand S3 costs. So they don’t really mark up the object storage very much at all. 

What they do markup is actually the virtual data warehouse and we can see from this that if you’re going to be offering cheap storage, you got to make it up somewhere else so what what Snowflake does. It makes it up in compute which can be anywhere from 5 to 10x more expensive than the corresponding posts. That basically gives you an idea of how the Snowflake virtual data warehouse model works 

Let’s have a look at another one which is quite different and that’s BigQuery.

BigQuery’s Serverless Query Model and Cost

BigQuery introduction

10:57 BigQuery is the other really large cloud database analytics service. BigQuery has a number of pricing models but I’m just going to focus on one What’s called the ‘serverless’ or ‘on-demand’ growing model. 

The way that this works is in when you’re using this model, you can store the data on distributed storage, it’s actually not object storage – it’s a different proprietary system at least up until recently – and it’s charged at about object storage rates so you can see these prices per gigabyte. If you go compare them to the prices that you get for allocated object storage, it is pretty comparable. One important thing to notice is these prices are not for compressed storage. That’s a big big thing. We’ll talk about that in a minute but 

BigQuery and how it charges you

11:58 The basic idea is you’ve got your data stored out here and when you run a query, BigQuery will allocate some number of compute nodes whatever it thinks is the right number. You don’t know because what it’s going to do and what’s going to appear on your bill. Is it going to charge you $6.25 for every terabyte of data that it scans in answering your query? 

It’s pretty easy to see from this how somebody could run up a 300 dollar bill. What you would have to do is just read enough terabytes of data out of this distributed storage and you get a 300 bill so that’s going to be probably about 15 terabytes if you have a large enough data set. This is not hard to do and you can also see how this might actually expand up into thousands of dollars if your data set were large enough.

So how does this compare with actual cloud resources? 

13:11 Let’s just pick a standard VM – this is n2d standard, 32 vCPUs. We can look at the cost of object storage, which as I mentioned is kind of comparable to what bigquery is charging. You do cloud block storage which is pretty expensive on Google it’s about 17 cents per gigabyte per month. But the problem is these numbers don’t really tell you anything. Depending on how your data is arranged, Bigquery could be anywhere from 10 times cheaper to 10 times more expensive.

If you only run the query once with this n2d standard 32, that thing is sitting there until you turn it off (i.e., you’re getting charged for it and if you’re keeping most of the storage on on block storage, that is more expensive and therefore you’re paying more. 

On the other hand, if you run constant queries (24×7 which is the case for many analytic applications), it could be way more expensive so you could end up spending vastly more on BigQuery.

It’s really hard to judge and you actually have to look at your workload and decide how you’re using.

RedShift Cost model

15:05 Let’s talk about one final model….there is another model for operating databases in the cloud, and it was actually pioneered by RedShift when they first opened up 10 years ago. I call it the Buy-the-Box model. If you go and look at the pricing and set up of Redshift, you can still see this same model 10 years later.

15:29 Basically what you do is you hire out a VM – for example a dc28x large, it has attached block storage, it’s SSD and they’ll charge you for 4.80 cents an hour in AWS. How does that compare to the actual costs? Luckily for us we can sort of figure out what this VM is. Unless things have changed,

It’s probably an i38x large VM which is a one of the older instance types in Amazon and it has local SSD. 

Pricing for Redshift’s “Buy-the-Box” approach

16:10: Here’s an example of the pricing for that. It’s going to be about $2.50 an hour and it has very comparable as you can see almost exactly the same amount of RAM and that’s why it’s pretty clear that it’s either this VM or something very like it.

How does that then compare in terms of markup?

Well Redshift in this case, is more costly, but it gives you 66 percent less storage so you can get a better deal. It’s interesting the markup on Redshift is not super bad. It’s always had a lower markup than Snowflake which really dings you on compute. The reason being because they they allow you to have this relatively cheap object storage.

Now we can improve on this model significantly. How to do that? 

Introduction into ClickHouse and Database Architecture

ClickHouse Database architecture

17:06 Let’s introduce ClickHouse. What is ClickHouse? Well, ClickHouse is a popular open source, real-time analytic database. As I mentioned above, here’s the basic architecture so you can have two types of storage but the key thing that you’re going to have is ClickHouse servers which are going to be connected to each other over a network. They can replicate if you define replicated tables in which case if you add data to a table on one host, it will automatically replicate to the other. 

ClickHouse storage

17:39 The storage is columnar like all of the databases we’re dealing with today. That means the data is stored in arrays, than those compressed. They’re also cheaper to query because you don’t have to read as much storage. When ClickHouse was originally developed, it could only store data in block storage but it has since it’s been improved to be able to store the data in S3 compatible object storage (that is to say any object storage which has an S3 API).

ClickHouse keeper and zookeeper

18:13 It also includes what’s called clickhouse keeper or zookeeper. This is a cluster which is used to maintain consensus about the data that needs to be replicated between servers. This is the database architecture.

ClickHouse and Modernized ‘Buy-the-Box’ Cloud service

ClickHouse architecture and cost 

18:33 We can map this as a cloud service to something that I call modernized Buy-theBox. Here on the left, you have your original Redshift architecture. On the right, we have a new version of it where instead of using underneath what turns out to be I3 instances, we can use m6is which are a Intel based. They are very fast – at least a third faster than the old i3s and the other thing is it mounts storage on EBS. EBS is like a sand based storage. The elastic block storage is really versatile and it turns out that with the newest versions on Amazon (called a gp3 storage) you can control the bandwidth and the throughput.

ClickHouse comparison to Redshift 

19:28 So if you dial it up to get a throughput of a thousand megabytes per second with this nice VM -which is roughly comparable to what you’re running on Redshift – is going to run you about 2.64 cents an hour. That’s still about the same markup but there’s actually another thing that’s really important which is that by separating the storage and compute, you can now easily adjust the relative amount you spend on computers storage 

Effect Of Storage & Compute on ClickHouse Prices

20:00 Let me show you an example of this because this is exactly what we do in Altinity.Cloud when we manage ClickHouse. By having separation of storage and compute at any time, you can change the CPUs that you are using, the VMS that you’re using, and basically just remap them to the same storage. What that means is that depending on what you’re doing, you can then scale your costs up and down just by changing the VM types.

See the separation of storage and compute in Altinity.Cloud, a managed service for ClickHouse

20:38 This graph right here shows you the all-in on-demand cost for Altinity.Cloud when it’s running ClickHouse, you can see that if we use that m6i 12x large, it’s going to cost six dollars an hour to operate which is more expensive than Redshift. But what’s important is that: 1) ClickHouse is really fast. Two m6is are faster than the Redshift’s. When we deploy, we would probably recommend that you use smaller instance sizes which means your costs are corresponding a little bit lower and you can of course adjust them. But this illustrates the kind of benefits you get from this separation of storage and compute.

21:24 So these are three models for how you can host analytic databases in the cloud and the effect that they have on your costs.

Quick Comparison of Models (Snowflake vs ClickHouse vs BigQuery)

21:39 Let’s do a quick comparison. This wallet size table shows the different ways that these models compare to each other. You have Buy-the-box which give you relatively cheap compute but more expensive storage because you’re using block storage (that’s on average about five times more expensive than object storage).

Snowflake and BigQuery give you cheap storage but ding you in other ways. For example, in the Snowflake model, we looked at how compute is relatively expensive … up to 10x times what you would really pay for it yourself! BigQuery’s on-demand query model can be extremely expensive if you have queries that scan a lot of data. I want to be totally clear: BigQuery also allows you to price it by compute. These are things called slots that you can reserve and then you have a model which looks closer to Snowflake’s. 

You can play around with it, but what that means is that if you look at these models and what they’re most effective for, this Buy-the Box is really good for customer facing analytics where you can’t control what the customers are going to do or when they’re going to do it or how often they’re going to do it . Buy the box means that you won’t get cost overruns, whereas the virtual data warehouse model can be very expensive if you need to allocate storage to do complex queries.

23:09 BigQuery of course can be very expensive if you unexpectedly scan more data than you really want it and that’s something that’s difficult to control up front…

Let’s dig little deeper in and look at how you get the best deal for these cloud services. Given that the Snowflake’s markup is so high, let’s zero in a little bit on what value you’re getting from Snowflake. 

What Snowflake Does Well

Snowflake is a good general purpose database

23:47 Snowflake is very good general purpose database. In fact, this is one of the best things about it: you can just put practically any analytic app on Snowflake, and it’s going to work. It might be more expensive than other models, but if you’re operating across an entire company and you just want to pick one thing that will work for everything, Snowflake is probably it. 

Snowflake can handle many use cases

24:17 Some of the things on this slide illustrate that it can handle vastly different use cases. It has a very good SQL implementation has great integration with tools. One of the things about Snowflake that’s really good is people who’ve worked with BI in the past – on things like teradata or vertica – can convert to Snowflake without having to understand too much about what’s going on underneath. They can build applications that work and get out on time.

What Snowflake doesn’t do is the following.

What Snowflake Does Not Do

Why you wouldn’t want to use Snowflake

24:54 Snowflake doesn’t keep the data in your account (if you care about that). It doesn’t help minimize the cost, especially for compute intensive 24×7 or real-time analytics. It’s not a very particularly good solution for tenant facing. If you have a thousands of tenants, and you have tenant facing dashboards, Snowflake is not necessarily good one for that. And then finally, it’s completely proprietary. If you want to get off it, you’re talking months at the very least to Port it to a different implementation.

That said a lot of people choose Snowflake. Again, a classic example is a company which wants to make one choice that works across the entire company and Snowflake is a good choice [for that]. If you’re going to use Snowflake or any other cloud service, the question is what can you do to get a better price?

How Can You Get a Better Price on Cloud Analytics?

25:51 We [Altinity] spend a lot of time thinking about pricing because we’re offering a service ourselves, so we know some key things to look for. 

Decoupled storage and compute

26:01 Decoupled storage and compute. You don’t want to buy into databases that can’t scale up or down. You can do it two ways: people tend to get fixated on object storage and that it is the only way to scale compute. But in fact, block storage and VMs scale very very well and turn out to be super well Suited for many analytic applications, particularly 24×7 real-time analytics. 

Another thing is to be really careful to make sure that if you’re charged for storage that it’s compressed right. Snowflake compresses the storage. ClickHouse compresses the storage. BigQuery doesn’t necessarily do it. It’s called logical versus physical so you want to double check. With columnar databases, you can often get 90% compression. So you really want to make sure that you’re getting this.

Ask for discounts

26:55 Another thing is if you’re spending a lot of money per month, you shouldn’t even have to ask for a discount. Your vendor should give it to you. If you’re not getting it, ask for that. 

Speaking of discounts, if you’re going to negotiate, look for price breaks that align with the vendor’s discounts. Unless you’re really big, it’s kind of hard to get discounts on storage in the cloud, but compute is heavily discounted. You can get discounts on Amazon VMs of 50 or more simply by prepaying. Your vendor can get that discount and what that means is if you do a prepay, you can get you reduce that part of the cost and that could be a major savings (a third or more!)

Cloud Marketplace

Then finally, a classic option is to buy on the cloud marketplace (ex Amazon Marketplace) and apply that price to your own commits. That’s that’s an advantage of marketplaces and even though it doesn’t lower the cost, it lowers your overall cost because then it allows you to meet your obligations to Amazon or Google or others so that you can get other types of discounts.

When Is Cloud Analytic Database Pricing a Good Deal?

28:18 When Is Cloud Analytic Database Pricing a Good Deal? When is it cost efficient? It really depends on your business…. I would just point two things to look out for in pricing for cloud services

Two things that I think of when I look at pricing for cloud services. One is that as your revenue grows, the cost of running analytics should scale at or less than your revenue growth. What you’re seeking yourself you build services is that your marginal cost of adding new users goes down over time. That’s what makes SAS applications work. If you have an analytic database that gets more expensive per customer as you add more customers, that’s a problem. Do the math. Figure out if that is the case.

Costs in Cloud services – 1st thing to look for 

28:44 Two things that I think of when I look at pricing for cloud services. One is that as your revenue grows, the cost of running analytics should scale at or less than your revenue growth. What you’re seeking yourself you build services is that your marginal cost of adding new users goes down over time. That’s what makes SAS applications work. If you have an analytic database that gets more expensive per customer as you add more customers, that’s a problem. Do the math. Figure out if that is the case.

Costs in Cloud services – 2nd thing to look for

29:28 Another thing to look at is to see whether vendor pricing is in line with revenue….In the case of Snowflake, they’re doing a lot of investment so that will probably equal out just fine but if somebody is giving you something that’s essentially at cost, that could be a big problem because they’re really charging you less than they can truly afford. At some point, gravity will reassert itself and the prices will change.

Let’s work through an example. Let’s think about what would happen if this wasn’t a good if a cloud service wasn’t a good deal….

It help helps to pick a specific problem. For example let’s say you’re in Europe, and you decided that that your destiny is to build a GDPR compliant replacement for Google Analytics. Lots of people are in fact doing this including customers of Altinity so this has a bunch of features everything from analytic queries to data pipelines to visualization and consumption. We’re going to build the core platform using open source.

So first of all, let’s do a reality check again.

Reality Check Against Snowflake

31:05 Why don’t we use Snowflake? Snowflake has strengths which we showed. In this particular case, it’s nice that it’s serverless. In other words, you don’t have to worry too much about what’s going on down there with the instances. It’s all automated for you. It has a UI with SQL editing and management and lots of tools on top of it. Those things are great, but a lot of the strengths of Snowflake don’t really matter – like having standards compliance SQL. That’s nice but you’re not porting this and you’re just building a specific application. That’s not a particularly big big deal.

Snowflake’s weaknesses

31:30 In fact, the Snowflake weaknesses vastly outweigh the benefits. For example, just being able to keep the data in your own cloud account. That is necessary to allow you to efficiently meet GDPR. You’re also paranoid. You don’t want data lock in. Then of course there’s other things in between. Snowflake may not be the right solution.

Alternative solution for analytics 

Kubernetes and open source technology

31:55 What we’re going to do is introduce an alternative. What we see a lot of people doing in the marketplace is building what we call the modern analytics stack which is based on Kubernetes. 

The idea is you’re going to build a custom platform which will deliver the exact analytics that you need for your particular application. The key thing we see is people use cloud native technology specifically Kubernetes which is very good for for distributed applications. We see people using open source. We see them using infrastructure as code, meaning defining the whole stack as something that you can check in to Gitlab or to GitHub, using GitOps to deploy it. Basically taking that checked in code and then blasting it out to one or more environments and then finally of course operating using Kubernetes and clouds as the runtime.

Basic design on the modern analytic stack

33:00 There’s a pattern to these. They’re kind of cookie cutter. This is a diagram that shows typical layering. But you have the GitOps, the CI/CD (continuous integration, continuous delivery) and Kubernetes as the base assumptions. We’ll show you specifics on this and then you have layers within it so you have a management layer that covers things like operators. We’ll talk about those in a second.

You have storage, which is your database engines. Orchestration which is your data pipelines, data capture (ETL and reverse ETL) and then finally consumption: BI tools, your own applications, and APIs. These tend to be glommed in together.

33:43 I’ve sort of listed them in the order of their dependencies so how do we actually start building this platform. What we’re going to do is we want to actually fill in the layers so that we can get something that’s deployable.

Choose a Kubernetes Distribution

34:00 First thing to do is choose a Kubernetes distribution.The best practice is to go ahead and use managed Kubernetes. If you got to run Kubernetes, let somebody else run it for you. You’ve heard that Kubernetes is complicated. That is true, but it’s mostly complicated to run. Actually building applications on it is not anywhere near as hard and the thing about the managed Kubernetes that’s available from the major public clouds is it’s amazingly cheap. In a truly amazing coincidence, they all charge 10 cents an hour. So for example Amazon EKS that works out to about 72 dollars a month. You’d really be kind of nuts to not do this. 

34:46 We use this in our own cloud. We run hundreds of clusters and we use managed Kubernetes. That’s the first thing, you just pick one because what we’re going to describe here will work on any of it.

Choose the Right Analytic Database

34:57 Second is to pick the right open source analytic database. There are many options out there and they are less general purpose than Snowflake or BigQuery so in our particular case ClickHouse is absolutely the right choice because it does real-time analytics (in fact, web analytics was the original use case for ClickHouse when it was developed 15 or so years ago.

Pick an Operator to Run the Database

35:27 The third thing is if you’re running databases and Kubernetes, you want to use an operator to manage it. In the case of ClickHouse, there’s a really good one which is the Altinity operator, also known as the clickhouse-operator for short. 

The way that operators work is they define what’s called a custom resource definition which makes the database a new type of resource in Kubernetes that you can just apply. There’s a program called Cube cuddle. If you’ve used Kubernetes, you’re very familiar with it, you just apply a piece of YAML that has a definition of the database in a custom format but very compact (12 to 24 inches of YAML) that gets fed into Kubernetes. 

Kubernetes will recognize this is a resource that the Altinity ClickHouse operator needs to manage. It’ll hand it over and then the operator will adjust reality. It will define new resources in Kubernetes which will make that database turn into reality. so that’s the the third thing fourth thing

Choose the Observability Platform

36:30 The fourth thing is you need to choose your observability platform. There are many of these available but one that’s particularly powerful in Kubernetes is to use Prometheus. That’s virtually universal for doing analytics. It will collect your data out of things like the Altinity Operator which will automatically export metrics and then you use Grafana to build dashboards. Grafana is a very powerful for operational dashboard. It allows you to zoom in look at particular series to change the time scale It’s super useful, it has a great plugins for ClickHouse. We maintain clickhouse-grafana. It has about 11 million downloads. It’s used in thousands of installations. There’s also a very good plugin of course for Prometheus.

Pick a Kubernetes GitOps Implementation

37:36 The final thing is to pick a GitHub GitOps implementation so what you want to be able to do is check in the code that defines the stack and be able to apply it to Kubernetes clusters in different locations. There are several ways to do this but one that we’ve seen used very successfully and we use ourselves is ArgoCD.

38:03 What ArgoCD does is it will take projects that are located in GitHub and they can define the different services that we’re going to run in Kubernetes in a variety of ways. A simple one is you have a YAML file which we call a manifest. There’s also a project in Kubernetes called customize which allows you to tailor the manifest, depending on which environment they’re going to. Finally, the handle the charts which are kind of like apt install or you know sort of Debian packages or RPM packages but for Kubernetes apps, Argo CD can process all of them and map them to what are called apps out in Kubernetes It will grab the code out of GitHub and then it will sync it to Kubernetes which actually causes your application to stand up and run.

How does that look like? When it’s fully deployed, you store the project in GitHub, you have Argo CD running in at least one of your clusters and then by issuing various sync commands, you can also build in automatic synchronization using GitHub actions. Any change will then be applied and then blasted out to whichever Kubernetes cluster you’re syncing with.

This has been a whirlwind tour, but when you follow these steps, you can now put together a full stack. We don’t have all the layers here because we’re just building the base, but we’ve got Kubernetes of course.

39:52 I’m going to demo this in a second using EKS which is the Amazon managed Kubernetes. We use GitHub to store our code. We use ArgoCD to implement GitOps to map it to Kubernetes clusters. Then we have the management and storage layers so we have our observability. We have our ClickHouse operator and then we have ClickHouse with Zookeeper and then just to make things fun after an in-cloud Beaver which is a nice cloud-based SQL editor that is a containerized version of D Beaver.

That’s the stack. This may have sounded complicated but when you actually come to do it, it’s pretty easy to apply. Let me just demonstrate so let’s go ahead. 

40:42 – 50:00 DEMO

Best Practices for Do-It-Yourself Modern Analytic Stacks

50:12 Build on managed Kubernetes. We talked about picking the GitOps implementation. I’m using ArgoCD in this case but I would say an even larger number of our customers are using Terraform…One important point is that by developing the applications in this way, you’re not precluding using cloud services when it’s appropriate. The stack that I developed didn’t include Kafka in fact many of our customers build the reporting stack completely locally but they actually might use a cloud version of Kafka. So you don’t have to make that choice and you can go back and forth.

Altinity and the Modern Analytic Stack

51:16 Just a little bit about how we help enable this model of building these modern analytics stacks and allow you keep the data on-prem to meet these additional requirements and operate it cost efficiently at large scale. 

We have the Altinity.Cloud platform for ClickHouse and one of the cool features of it is if you’re running the stack in Kubernetes, we can actually manage the ClickHouse part for you. We have customers that run everything on Kubernetes but they leave them the actual management of ClickHouse itself to us. We give them a cloud management plane. It is connected securely to the Kubernetes cluster. It’s very simple to set up the connection. You just run this connector app which then registers with Altinity.Cloud.

52:12 Another thing we do is provide software. You’ve seen a few of them like the Altinity Kubernetes operator. That’s used in thousands of installations worldwide. Altinity Stable Builds – these are builds that with three-year maintenance and certified for production use. We maintain clickhouse-backup. 

We manage the grafana plug-in which has 11 million downloads and a tableau connector among others. 

53:00 Finally we have services and help you get these things to work if you’re not familiar with it. It is very helpful to have somebody who knows their way around it. We know ClickHouse as well as anybody in the world and we’re the best people for running ClickHouse on Kubernetes. We’ve been doing it for years. We can help you with ClickHouse Proof of Concepts and what we’re doing is building blueprints like the one I just showed you to help bring these stacks up more efficiently.

That brings us to the blueprint. Here’s how to get started with it. You can just clone the the examples that I showed you and you can go run it yourself. It’s not complete, I will tell you, we’re still doing some work….. (see our blog article and the project for updates)

54:11 Conclusion