Webinars

Making the Journey to FedRAMP: Cisco Umbrella, Altinity, and ClickHouse®-based Analytics

Recorded: June 20 @ 10:00 am PT
Presenters: Pauline Yeung, Data Engineer & SecDevOps @Cisco Umbrella, and Robert Hodges, CEO @Altinity

This webinar brings together Pauline Yeung, Software Engineer at Cisco, and Altinity CEO, Robert Hodges, to walk through the real-world journey of deploying ClickHouse®-based analytics in a FedRAMP Moderate environment. Cisco Umbrella runs one of the largest known ClickHouse deployments, with a production cluster of over 500 nodes storing 2.2 petabytes and 45 trillion rows of security telemetry from DNS, firewall, web gateway, and cloud access security services.

Pauline explains the full scope of FIPS 140-2 compliance across the stack: every component, including ClickHouse, CHProxy, clickhouse-backup, and ZooKeeper, must be compiled with a certified cryptographic library, all traffic must run over TLS, and even AWS services like the application load balancer must have FIPS explicitly enabled in GovCloud. Robert then walks through what Altinity built to support this: FIPS-compatible Altinity Stable Builds using the BoringSSL source code certified on June 29, 2022, a startup known-answer test that prevents a modified binary from running, a fips.xml configuration file that locks down ciphers and TLS versions, and the specific port decisions needed to restrict ClickHouse to FIPS-compliant listeners only.

The session concludes with four hard-won lessons: make everything FIPS-compliant from the start, switch from ZooKeeper to ClickHouse Keeper for long-term maintainability, automate everything with Ansible and Terraform, and treat rigorous automated testing as essential because crypto configuration failures are silent.

Here are the slides:

Key Moments (Timestamps)

Key moments generated with AI assistance.

  • 0:05 – Introduction and housekeeping
  • 1:16 – Speaker introductions: Pauline Yeung and Robert Hodges
  • 2:03 – What is FedRAMP? Levels, scope, and GovCloud
  • 3:55 – Cisco Umbrella overview and how it uses ClickHouse
  • 4:36 – Cisco Umbrella cloud architecture: SSE layers of defense
  • 5:21 – DNS, firewall, web gateway, CASB, and zero-trust layers explained
  • 7:48 – Cisco Umbrella ClickHouse cluster scale: 500+ nodes, 2.2 PB, 45T rows
  • 9:11 – Introduction to FIPS 140-2 and why it applies to every component
  • 10:16 – FedRAMP ClickHouse cluster: what must be FIPS-compiled
  • 12:46 – Jenkins CI pipeline inside GovCloud boundary; Artifactory and Datadog
  • 14:10 – FIPS compliance checklist slide summary
  • 14:54 – Audio interruption; Robert resumes
  • 15:02 – What are Altinity Stable Builds for ClickHouse?
  • 17:51 – Four steps to making ClickHouse FIPS-compatible
  • 20:22 – Introducing FIPS-compatible Altinity Stable Builds: 22.8, BoringSSL, KAT
  • 21:41 – How Altinity tests FIPS builds: ClickHouse tests, regression suite, code scans
  • 22:46 – How to get the builds: builds.altinity.cloud or build yourself
  • 23:31 – How to configure FIPS-compatible operation: fips.xml, port shutdown, startup verification
  • 24:53 – Know your ClickHouse ports: which are FIPS-green, which must be shut off
  • 26:44 – What does fips.xml look like: cipher list, TLS 1.2 only, verificationMode
  • 28:22 – ClickHouse clusters add complexity: distributed queries, part fetches, ZooKeeper
  • 30:09 – ZooKeeper vs. ClickHouse Keeper for FIPS compliance: pros and cons
  • 32:45 – FIPS-compatible ClickHouse Keeper: Raft library update, coming in Altinity Stable 23.3
  • 34:15 – Certificate chains: handling intermediate certificates in ClickHouse TLS config
  • 37:04 – Broader ClickHouse hardening: what comes after crypto
  • 37:55 – Cisco deployment: Terraform, Ansible playbooks, Jenkins in GovCloud
  • 39:35 – Cisco Ansible code: GPG key prefetch, repo installation, package install
  • 39:35 – Addressing security: redacting passwords, closing non-FIPS ports, EBS encryption
  • 41:03 – Conclusions: four key lessons learned from the FedRAMP journey
  • 44:51 – Q&A: Vitali Aksionov on the importance of testing and known-good configuration

Webinar Transcript

[0:05] — Introduction and Housekeeping

Robert: Welcome, everybody. Today’s webinar will be talking about making the journey to FedRAMP with Cisco Umbrella, Altinity, and ClickHouse®-based analytics. It is my pleasure to be presenting with Pauline Yeung from Cisco. My name is Robert Hodges.

Before we get into the bios and the meat of this presentation, let me give you a little bit of housekeeping. First, as you’ve probably just heard, this webinar is being recorded. For everyone who signed up, you will get a link to the recording along with a link to the slides at the end of the webinar, sent to you through email, probably within the next 24 hours. Second, we have plenty of time for questions. You can post them into the chat or the question-and-answer box. Just as something comes up, feel free to post it and we’ll get to it when we can. If it’s something relevant, we may answer it as we go along; otherwise we’ll catch all questions at the end.

[1:16] — Speaker Introductions

Robert: Pauline, would you like to go ahead and introduce yourself?

Pauline: Yes. This is Pauline. I’m a software engineer at Cisco. I’ve been doing data engineering DevSecOps for the past decade with Cisco.

Robert: Once again my name is Robert Hodges. I’ve been working on database systems for 40 years. Most recently, of course, ClickHouse, but systems going back to pre-relational databases. I’ve been involved in Kubernetes, security, and other topics as well. My day job is CEO at Altinity.

[2:03] — What Is FedRAMP?

Robert: Let’s jump in. Our talk today will be mostly technical, but I think it’s important to start with the obvious question: what is FedRAMP?

FedRAMP is a security compliance program that the U.S. government has created. It’s designed for systems that process government data. It is widely used. Many large vendors have experience with it, and the reason is that it’s a pretty standard requirement for doing business with the U.S. government, including intelligence agencies, the Department of Defense, and many other agencies. This is obviously a very large market.

There are multiple levels of compliance. We won’t go into full detail, but at the highest level of FedRAMP compliance there are literally hundreds of requirements to meet, for systems that process very sensitive data. For commercial applications, Moderate is a fairly common requirement with a more constrained set of requirements. It’s very common to run these systems in GovCloud. And then there’s Low, which we won’t talk about today. Most of the people we’ve had contact with don’t implement that level.

As you’ll see from this talk, FedRAMP-compliant systems commonly run in GovCloud or similar secure cloud environments. In the particular case we’re talking about today, we’ll be talking a fair bit about GovCloud itself.

[3:55] — Cisco Umbrella Overview

Robert: With that, let me turn this over to Pauline, who will introduce Cisco Umbrella, how it uses ClickHouse, and tell the story of how they’re building an analytics system that runs under FedRAMP.

Pauline: Thanks, Robert. Cisco Umbrella is working on qualifying for FedRAMP Moderate for its security services. Cisco Umbrella logs all its security service activities to ClickHouse, and those logs are used for security reports, activity search, and threat intelligence.

[4:36] — Cisco Umbrella Cloud Architecture

Pauline: So what is Cisco Umbrella? Umbrella runs at the edge of the cloud. It protects customers when they access cloud services from their devices. These cloud services include things like Office 365, Salesforce, your corporate cloud, and any other cloud services. The identities that access those cloud services can come from headquarters, branch offices, mobile roaming devices, or network devices.

[5:21] — Cisco Umbrella Secure Service Edge (SSE)

Pauline: Cisco Umbrella provides multiple layers of defense based on customer-configured policy as well as threat intelligence collected by Cisco Talos.

The first layer is DNS security, used to block domains, detect DNS tunneling, and provide DNS DDoS protection. If anything makes it through, the second layer is Firewall as a Service, which provides intrusion protection and can block tunnels, protocols, IPS ports, and user or group identities.

If bad actors make it through that, they reach the Secure Web Gateway and Remote Browser Isolation, which can block websites, prevent browser-based exploits, and prevent drive-by downloads. Beyond that is the Cloud Access Security Broker, which provides malware protection, non-web traffic inspection, and data loss prevention. One example: preventing data leaks by blocking emails that contain credit card information from a corporate network.

And if the identity made it through all of those security services, they still have to go through Zero Trust Network Access, which requires the identity to authenticate its trustworthiness before being granted access.

You might wonder: with all these layers, will my requests be slowed down? The answer is no, because Cisco is a networking company and we do network peering with all the major cloud service providers, which means we do not incur additional latency.

[7:48] — Cisco Umbrella ClickHouse Scale

Pauline: All these security services log all their security activities to ClickHouse. The largest volume of activity logs comes from DNS, and at peak this reaches almost 800 megabytes per second. We therefore have a very large ClickHouse cluster to store all these logs. All the rest of the logs combined add up to about 120 megabytes per second.

Our large ClickHouse cluster was created back in 2019. As you can see it was running version 19.4. It provides one year of data retention and has since grown to over 500 nodes, storing 2.2 petabytes of data with 45 trillion rows. The smaller cluster was created in 2020 with one month of retention, 27 nodes, 56 terabytes, and 240 billion rows. We also store one month of data in S3 for offline processing to feed our threat intelligence system.

As I mentioned, the ClickHouse logs are used for security reports and activity search, which requires intensive aggregation. As a result we cannot serve more than about 80 requests per second.

[9:11] — Introduction to FIPS 140-2

Pauline: Now we have to go to FIPS. FIPS 140-2 is the Federal Information Processing Standard. It is used in the US as well as Canada and Japan. You will see the word FIPS repeated over and over in the next few slides because that is the main thing we are trying to achieve compliance with. The certification process itself is very long, and if the software that achieves compliance changes you have to be recertified.

[10:16] — FedRAMP ClickHouse Cluster: FIPS Compliance Requirements

Pauline: Since all components have to be FIPS-compliant, in our GovCloud ClickHouse cluster all communication between components must go over TLS. That means all software has to be compiled with a FIPS-certified cryptographic module.

In our GovCloud account, FIPS is not enabled by default. You have to make a special request to AWS Support to enable it. When you create the Application Load Balancer, you have to use a special tag to enable FIPS, because even an ALB standing up in GovCloud is not FIPS-enabled by default. The CHProxy has to be compiled with the BoringSSL library, which is FIPS-compliant. The ClickHouse software itself has to be compiled with BoringSSL, as does clickhouse-backup. It also has to point to an S3 FIPS endpoint, which you achieve by setting the AWS_USE_FIPS_ENDPOINT environment variable. And ZooKeeper requires the FIPS-compliant Bouncy Castle JAR.

[12:46] — Jenkins CI Pipeline Inside the GovCloud Boundary

Pauline: To build, deploy, and configure all components, we use Jenkins. CHProxy is built in Jenkins inside the boundary. Anything you do inside the boundary cannot be moved out of the boundary, so we host it in JFrog Artifactory, which is allowed in GovCloud. At deploy time we pull CHProxy from JFrog Artifactory. We use Datadog as a monitoring system because Datadog is allowed in GovCloud, so we had to migrate from our Graphite and Grafana system. Since Altinity is building the ClickHouse packages for us, we are allowed to pull those into Jenkins for deployment and configuration. The infrastructure-as-code we use is Terraform and Ansible, hosted in GitHub outside the boundary.

[14:54] — Robert Resumes: What Are Altinity Stable Builds for ClickHouse?

Robert: Sorry about the audio disruption there. Let’s dive back in.

FIPS is pervasive throughout this system, so what we did was produce a FIPS-compatible build of ClickHouse. But before I explain that, I need to explain what kind of builds we do. We create Altinity Stable® Builds for ClickHouse®, which are open-source builds of ClickHouse designed for enterprise users in production systems.

They are based on the long-term support releases of ClickHouse. If you use ClickHouse you know that about every six months there is an LTS release, typically in March and August. The community supports it for up to a year and it’s meant to be used and continue to receive patches for months after initial release. In our fork of ClickHouse on GitHub, we copy that in and add selected bug fixes and features. Our customers may need specific bug fixes that haven’t made it into the official LTS builds, so we put those in. We also back-port certain features so people can benefit from new capabilities without having to do a major upgrade. And of course we do patches for CVEs.

We vet them very thoroughly for production use. When you use an Altinity Stable Build you’ll find we’re always a bit behind the leading edge of ClickHouse development, and there’s a reason for that. It’s the same reason that Red Hat Enterprise Linux isn’t identical to the latest off-the-presses Linux builds: you need to let things stabilize, make sure they’re really ready to go, do upgrade testing, and ensure they won’t crash in production. We offer three years of support, and in fact longer in some cases. They are 100 percent open source. There’s no holdback, no open-core model. It’s just another build of ClickHouse with different support. And we strive for full compatibility with upstream ClickHouse so you can use these builds interchangeably.

[17:51] — Four Steps to Making ClickHouse FIPS-Compatible

Robert: Now, when you have an application like Cisco Umbrella’s, the question is how do you take the existing application and make it FIPS-compatible? There are basically four steps.

First, as you saw in Pauline’s slides, pretty much everything in this whole system uses BoringSSL. ClickHouse also switched to BoringSSL a couple of years ago. So the first step is to switch away from the BoringSSL included in standard ClickHouse and go to a version of BoringSSL that has actually been verified for FIPS 140-2. Several versions have been verified. We take the latest one, grab the source code, and compile and link against that version so we have crypto that is in every respect compatible with the verified code.

Second, we make sure ClickHouse itself properly supports FIPS. This requires some modest API changes, a different version number, and what’s called a FIPS self-check, also known as a Known Answer Test (KAT). When the FIPS-compatible software starts up, it runs a test to make sure the crypto modules have not been modified or tampered with. If they have been changed, the server simply stops. So you get a completely different binary with FIPS support and slightly different behavior.

Third, we provide configuration information. A large part of FIPS operation is simply configuring your software correctly. We’ll show examples of that.

Fourth, and this is one of the biggest things about security in general: you test the daylights out of it. Crypto setup is very persnickety, and if you don’t set things up perfectly they simply don’t work, and they don’t give you very good error messages. So we test exhaustively to make sure not only that the software but also the configuration works.

[20:22] — Introducing FIPS-Compatible Altinity Stable Builds

Robert: The FIPS-compatible Altinity Stable Builds for ClickHouse started in January 2023, based on ClickHouse version 22.8. The upcoming Altinity Stable 23.3 will also support them. We maintain them in a separate branch. They are identical to mainline ClickHouse except for some very small changes: the self-check KAT, the software version, and some extensions. The main place we made changes was to ClickHouse Keeper, which we’ll discuss in a minute.

We use the BoringSSL source code that was certified on June 29th of 2022. An important part of being FIPS-compliant is that BoringSSL has a specific build procedure you have to follow exactly. We make changes to the ClickHouse build procedure accordingly. The specific crypto behavior is verified using an Altinity test suite that covers both single-server operation and cluster operation, with thousands and thousands of test cases.

[21:41] — How Altinity Tests FIPS Builds

Robert: When we release these builds, we run all available ClickHouse tests, which are abundant: a very large number of unit and integration tests from the ClickHouse repo. We have our own regression tests in a separate GitHub repo that anybody can see. We do code scans using Snyk and Scout, which are very good at scanning containers. They test your packages and implicitly verify that nothing problematic has been included. We make sure these are clean before release.

There are specific tests for crypto behavior: one that tests the crypto between applications and a single ClickHouse server, and another that tests the crypto functions in ClickHouse clusters. You can look at them and see exactly what we do. We add to them constantly.

[22:46] — How to Get the Builds

Robert: To get these builds, go to builds.altinity.cloud. There’s a section on the build page with a channel for picking up RPMs or Debian packages. You can also build it yourself if you want. The build process uses CMake, though be prepared to wait a couple of hours.

[23:31] — How to Configure FIPS-Compatible Operation

Robert: A really important part of running FIPS, as I mentioned, is configuration. Our documentation for FIPS-compatible Altinity Stable Builds consists of three parts. First, shut off the non-FIPS ports. Not every listener on the ClickHouse server supports FIPS crypto, and we’ll see an example in a minute. Second, add a specific configuration file called fips.xml that sets TLS versions and specifies allowed ciphers. Third, start the server and verify that it works by looking for the FIPS mode message in the log.

Here’s an example of what that startup verification looks like. You grep for “FIPS mode” in the ClickHouse log. You’ll see a message saying “Starting in FIPS mode, KAT test result: 1.” That means ClickHouse came up, checked the FIPS libraries, confirmed they haven’t been tampered with, and the server continues running. If they have been changed, the server crashes immediately. This is a feature: you know right away if something is wrong.

[24:53] — Know Your ClickHouse Ports

Robert: Turning off ports is important because ClickHouse offers a wide variety of network protocols, which means you can have 10 or 15 ports potentially open on a ClickHouse server. As you look at the server, it’s critical to ensure you’re only using ports that have FIPS-supported crypto.

The green ports are the ones you want. For the native TCP protocol, port 9440 is the secure port you want open. For HTTPS connections, port 8443. For inter-server replication, port 9010. For ClickHouse Keeper, there are two ports: one for client communication with the Keeper ensemble, and one for the Raft consensus protocol that Keeper nodes use to stay synchronized.

The red and gray ports cover things like the MySQL compatibility protocol, PostgreSQL protocol, Prometheus metrics, and others. These are not guaranteed to support FIPS crypto at this time, so you shut them off. The goal is to ensure green ports are on and all others are off.

[26:44] — What Does fips.xml Look Like?

Robert: Here’s the fips.xml file. The first thing you see is remove="true" for the HTTP port. That shuts it down even if the default config.xml turns it on. This is the standard removal trick for ClickHouse XML configuration.

Within the openSSL section, which controls both server and client communications, the items critical for FIPS operation include a cipher list showing the accepted ciphers under FIPS. If a client cannot speak these ciphers, the connection is refused. We also turn off all protocol versions other than TLS 1.2, which is the only protocol version supported by FIPS. The verificationMode in this particular case is set to relaxed, meaning clients can connect without presenting a certificate. You can easily change this to strict, in which case clients must also display a certificate signed by an authority that ClickHouse recognizes.

[28:22] — ClickHouse Clusters Add Complexity

Robert: So that’s the server configuration. Now let’s talk about something more complicated: enabling a FIPS-compatible ClickHouse cluster. Clusters add complexity and increase the attack surface.

In a real system like Cisco’s, you have multiple ClickHouse servers. Applications connect to them, of course, but within the ClickHouse servers you also have distributed queries: one server sends sub-queries to other servers for sharded data. You have part fetches between servers to collect parts for replicated tables. To support replication, ClickHouse contacts ZooKeeper for leader election during merges and to coordinate which servers have which parts. And ZooKeeper nodes themselves communicate via ZooKeeper Atomic Broadcast to stay in sync.

For the distributed query and replicated part fetch connections, this is actually fairly straightforward. If you set up the SSL config and enable ports correctly, the inter-server cluster security for replicated part fetches and distributed queries will automatically use FIPS-compliant crypto.

The more interesting challenge is the centralized coordination part.

[30:09] — ZooKeeper vs. ClickHouse Keeper for FIPS

Robert: A big question that comes up, and one we asked ourselves in implementing this, is whether to use ZooKeeper or ClickHouse Keeper, which is a built-in replacement for ZooKeeper.

You can actually use both. As part of this project we did work on ZooKeeper to enable it to use the FIPS-compliant Bouncy Castle libraries. It works, passes our tests, and works in the Cisco environments. But there are downsides.

The first is that it’s hard to maintain. For internal reasons, ZooKeeper has used Bouncy Castle libraries to implement many of its security tests, and the library version it uses is incompatible with the FIPS-approved Bouncy Castle libraries. This makes it very difficult to run tests. If you try to run the ZooKeeper test suite with the FIPS-compatible libraries, the build kind of breaks. So if you have to change something in ZooKeeper, it’s difficult to verify that you got it right.

The second concern is more general: for ClickHouse, ClickHouse Keeper is the future. For new projects there are still some cases where ZooKeeper is more reliable, but Keeper is improving quickly, and we want to be on a path that’s maintainable going forward.

[32:45] — FIPS-Compatible ClickHouse Keeper Is on the Way

Robert: So what we’ve done is take the extra step of ensuring ClickHouse Keeper can be fully FIPS-compliant. You can compile it with the FIPS-verified BoringSSL. The same build that generates the ClickHouse server also generates Keeper, which can run inside the server. That part was pretty straightforward.

But we also had to update the NuRaft library, which is the library that implements the Raft protocol to keep Keeper nodes coordinated. The biggest change was having it use the same SSL context used for configuring crypto on the ClickHouse server. We made those changes, and we’re now testing exhaustively, because if you don’t get this perfect it doesn’t work and it doesn’t tell you much about why not.

FIPS-compatible Keeper is coming as part of Altinity Stable 23.3, the next LTS release. That release includes a FIPS version, and you will now be able to run Keeper in a FIPS-compatible way.

[34:15] — Certificate Chains in ClickHouse TLS Configuration

Robert: There are a couple of other short topics that come up with FedRAMP. One is certificates, and the other is thinking about broader ClickHouse hardening.

One issue we ran into in the Cisco environment is that they don’t use a simple root certificate plus server certificate. They have a certificate chain with intermediate certificates. This is fairly common. In ClickHouse, certificate chain behavior is not well tested, which is not to say it doesn’t work, but it takes effort to set it up correctly.

One common way to handle this: if you have a root certificate with intermediate certificates, you can glom them together in a single CRT file, which we call root_chain.crt. You keep your server.crt, which is your ClickHouse server certificate, in a separate file alongside the private key. In your ClickHouse configuration, the caConfig tag, which tells it where to find your root certificate and any certificates in the chain, points to that combined file at the full path.

That works in many cases. However, there are cases where clients don’t have access to the other certificates in the chain, either because they don’t have them loaded or because they’re using a different code path. In that case, the opposite approach works: keep the root certificate as root.crt, and glom the intermediate certificates together with your server certificate into server.crt. That combined file then gets downloaded to the client and the client can verify the chain properly.

These are examples of the ins and outs you’ll encounter with certificate configuration. For more detailed guidance on locking down ClickHouse networking and certificate chains, the Altinity blog covers these topics in depth.

[37:04] — Broader ClickHouse Hardening

Robert: Once you have all of this set up, you’ve basically covered the crypto part of protecting ClickHouse network connectivity. You’ve got good certificates, your connections are encrypted, and you’ve done away with unnecessary open ports. But there are a number of other things you need to cover to operate safely in a FedRAMP environment and meet compliance. These include user hardening, access control, storage encryption, and more. For that, I’ll turn it back to Pauline to talk about some of the automation she’s put together to set up an operational system.

[37:55] — Cisco’s Deployment and Configuration

Pauline: Cisco uses Terraform to deploy the ClickHouse cluster on EC2 instances. These instances run Ubuntu 20.04 and are FIPS-enabled and hardened. We use three different Ansible playbooks for deployment: first we deploy and configure ZooKeeper and bring it up, then we configure ClickHouse and bring it up, and finally we configure and bring up CHProxy.

Here’s a small example of our Ansible code from the pre-configuration playbook. We prefetch the GPG key from the GitHub repo and store it in a file so that at deploy time we can install the key to the keychain without having to reach outside the boundary.

[39:09] — Ansible: Repo and Package Installation

Pauline: We then edit the repository location into our repository list and install the packages. Those three steps, the key fetch, the repo add, and the package install, are the Ansible equivalent of the command-line procedure shown in the Altinity installation documentation.

[39:35] — Addressing Security in the Deployment

Pauline: To address the security concerns Robert raised, we did quite a few things. We added filters to redact the ClickHouse password when running Ansible from Jenkins, because the password can appear in the clear during the configuration step when we template the configuration file and insert the password.

We close all non-FIPS ports but still allow port 9000 locally so the Datadog agent can access metrics and send them to the Datadog server. We only send error logs to Datadog for centralized logging, because regular query logs may contain customer-sensitive information. And we encrypt our EBS volumes so we have encryption at rest.

[41:03] — Conclusions: Four Key Lessons Learned

Robert: We’re at the conclusion. We’re still learning on this journey, but there are some things that are very clear so far.

First: make everything FIPS-compliant from the start. This was something where we’re very grateful for guidance from Cisco, who has done this kind of thing many times. Getting ClickHouse to be one of the cars on this train was a big part of making progress.

Second: ZooKeeper is really not a long-term solution for FIPS-compliant ClickHouse. The reason is maintenance. ZooKeeper works great, and about half or more of our customers still use it because they’ve been running it for a long time. But it’s not a long-term solution, and this is yet another good reason to switch to ClickHouse Keeper. We can maintain it and it’s more consistent with the rest of ClickHouse in terms of setup.

Third: having automation in place, as Pauline already had with Ansible, made a huge difference. You can modify your playbooks and deploy quickly, efficiently, and consistently. That kind of repeatability is essential in compliance work.

Fourth: the QA people are gods on this project. You have to test everything. It’s not just about preventing attackers from finding hidden doorways. The more practical issue is that crypto setup is complex and delicate. As one of our developers put it, there are many more ways to configure it incorrectly than correctly. The connections will simply fail, often without a useful error message, or the application just won’t work. The best way we found to ensure we had the configuration right was to read it out of a working regression test. Once we had it in a test we could document what the test did, and those documents became our configuration guidelines. Documentation and configuration guidelines are absolutely essential to success.

Going forward, you’ll see presentations like this one, but we’re also adding considerably to our documentation headed toward providing baselines for operation in FedRAMP and CIS environments. CIS is another big compliance standard popular in Europe.

[44:51] — Q&A: Vitali Aksionov on Testing and Configuration

Robert: Vitali, you and your team were deeply involved in getting this to work. Do you have any thoughts to share?

Vitali: I think you’ve made the main points, Robert. Anything to do with security is tricky. Setting up SSL certificates and configuration is not straightforward, so even debugging while we were developing the tests was very productive, and an adventure, because many things didn’t work. There are so many ways to mess up, which is why having known working configurations and configuration guidelines is critical to get this right.

Robert: We use a testing scheme called TestFlows, developed by Vitali. It steps through tests in a very detailed, step-by-step manner. At each step it says: I’m doing this, and I expect that. You get very long logs, but you know the exact steps down to the smallest detail of what you’re changing and what you expect. You can then reverse-engineer those logs into your configuration guides. That’s been enormously helpful on this project.

Robert: If there are no further questions, I think we can close. Thanks again, Pauline. This has been a pleasure presenting with you and summarizing what we’ve learned so far. When we get further along and have even more to share, we should do a follow-up, perhaps at a conference.

Pauline: Thank you, Robert and Vitali, for all your help. Altinity helped make our FedRAMP journey smoother and faster, and filled in a few potholes.

Robert: Thank you very much. Thanks everybody for attending. Have a great day.

FAQ

What is FedRAMP and why does it require FIPS 140-2 compliance?

FedRAMP is a U.S. government security compliance program that standardizes the security assessment and authorization of cloud services used by federal agencies. FedRAMP Moderate and High environments require FIPS 140-2 compliance because it mandates that all cryptographic operations use certified libraries, approved TLS versions and cipher suites, and protection against binary tampering. Any software that handles data in a FedRAMP environment must be compiled with a FIPS-certified cryptographic module, which for ClickHouse means building against the specific version of BoringSSL that has passed FIPS 140-2 Level 1 certification.

What are FIPS-compatible Altinity Stable Builds for ClickHouse and how are they different from standard builds?

FIPS-compatible Altinity Stable Builds are identical to regular Altinity Stable Builds except for three changes: they are compiled against the FIPS-certified BoringSSL source code rather than the standard BoringSSL version included in upstream ClickHouse, they include a Known Answer Test (KAT) that runs at server startup to verify the cryptographic libraries have not been tampered with, and they include a modified version of ClickHouse Keeper that uses the same FIPS-compliant SSL context as the rest of the server. If the KAT fails, the server stops immediately rather than running with potentially compromised crypto.

How do you configure ClickHouse for FIPS-compatible operation?

Configuration consists of three steps. First, add a fips.xml configuration file that removes all non-FIPS listener ports, sets the openSSL cipher list to FIPS 140-2 approved ciphers only, and restricts the TLS protocol version to TLS 1.2. Second, verify the FIPS startup log message confirming the KAT passed with result code 1. Third, check that only the FIPS-approved ports remain open: port 9440 for native TCP, port 8443 for HTTPS, port 9010 for inter-server replication, and the relevant ClickHouse Keeper ports. All MySQL compatibility, PostgreSQL compatibility, and other non-FIPS ports must be removed.

Why is ZooKeeper not a good long-term option for FIPS-compliant ClickHouse clusters?

While it is possible to configure ZooKeeper with FIPS-compliant Bouncy Castle libraries, ZooKeeper internally uses an older version of Bouncy Castle for its own test infrastructure that is incompatible with the FIPS-approved version. This makes it effectively impossible to run ZooKeeper’s own test suite with the FIPS libraries, which means any changes to ZooKeeper are very difficult to verify. ClickHouse Keeper, being built into ClickHouse and compiled with the same FIPS-certified BoringSSL, is maintainable and consistent. Altinity has added FIPS-compatible Keeper support starting with the Altinity Stable 23.3 release.

How did Cisco deploy FIPS-compatible ClickHouse in AWS GovCloud?

Cisco used Terraform to provision EC2 instances running FIPS-enabled and hardened Ubuntu 20.04, and three Ansible playbooks to sequentially configure and bring up ZooKeeper, ClickHouse, and CHProxy. All component binaries were built inside the GovCloud boundary using Jenkins, with artifacts stored in JFrog Artifactory. The AWS account itself required a special support request to enable FIPS, the Application Load Balancer required a specific FIPS-enabling tag, the S3 endpoint was set to FIPS via the AWS_USE_FIPS_ENDPOINT environment variable, and EBS volumes were encrypted. Monitoring migrated from Grafana to Datadog, which is approved for use in GovCloud.

What is the most important lesson learned from this FedRAMP journey?

Testing is essential and cannot be shortcut. Crypto configuration has far more ways to fail than to succeed, and failures are often silent: applications simply do not connect, with little or no useful error output. The most effective approach Altinity used was to develop a comprehensive regression test suite using TestFlows, where each test explicitly declares what configuration it is applying and what behavior it expects. Once a working configuration was captured in a test, it could be directly translated into documentation and configuration guidelines. Without this test-driven approach, verifying correct FIPS operation across distributed ClickHouse clusters would have been extremely difficult.


© 2023 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.

Share

ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.

Related:

Leave a Reply

Your email address will not be published. Required fields are marked *