Delivering Insight on GraphQL APIs with ClickHouse at Stellate (GraphCDN)

Scaling public API response is a major headache for SaaS application developers. Stellate (formerly GraphCDN) has an innovative solution for GraphQL, a popular species of web service API. They offer a flexible content delivery network (CDN) that can deliver API responses to clients up to 200x faster while reducing load on backend services–all without code changes. This the most productive kind of scaling for developers.

CDN management–however powerful–would be incomplete without understanding what the CDN is actually doing. Stellate analytics are the secret sauce that enable API developers to analyze the effect of policy changes and iterate rapidly to achieve optimal performance. It is an example of embedded analytics, a key feature in cutting edge SaaS applications. 

This article digs into Stellate analytics and the journey they followed to implement using ClickHouse and Altinity.Cloud. We believe it provides useful lessons on how to use ClickHouse effectively in SaaS applications as well as the benefits embedded analytics offer to SaaS users.

Stellate Analytics, Explained

Let’s start with what users see when they go into the analytics tab on Stellate. Here’s a picture of the main screen, which shows aggregated API responses and cache hit rates as a time series.

Actual performance numbers may vary significantly between different API calls. Stellate therefore allows developers to drill down into detailed API responses as well. This encourages a Pareto-based approach where developers focus on hotspots to deliver the greatest overall performance improvement by fixing problems in order of impact on performance. 

Stellate measurements help with more than just CDN operation. Developers can use the measurements to identify calls that need further performance work as well as check for performance regressions in new releases. Stellate analytics also track different response types including errors. Developers can drill down to individual responses to identify and diagnose problems. 

In short, analytics offer a powerful tool for developers to introspect GraphQL API behavior and improve user experience. The resulting insights add significantly to the overall value of Stellate.

The Journey to ClickHouse

Stellate uses the Fastly CDN to distribute query results from source API services to clients spread across the globe. Fastly collects metrics automatically and stores them in an analytic database for display to developers, who can use them to guide optimization of their APIs.  

The following picture illustrates the Stellate architecture related to analytic collection, display, and use.

The Stellate team arrived at ClickHouse in a couple of steps. The initial implementation used Amazon Timestream, a managed database for time stream data. Timestream is easy to set up but the Stellate team quickly discovered that it was not fast enough to handle analytics on CDNs, which generate enormous numbers of events per day. 

The team began to search for an alternative solution that would scale more efficiently but still enable a full product launch by June 2021. They quickly found the famous Cloudflare blog article on ClickHouse by Alex Bocharev. It provided compelling metrics on ClickHouse performance as well as key implementation details such as data ingest via Kafka. 

After confirming the Cloudflare account with other ClickHouse users, the Stellate team began migration to ClickHouse in earnest. They chose Altinity.Cloud, which offers managed ClickHouse on Amazon Web Services (AWS). Altinity offers baked-in enterprise support for all accounts. Support engineers have extensive experience in operating ClickHouse, and are available via shared Slack channels. 

The entire migration to hosted ClickHouse took almost exactly two months. Stellate started its first ClickHouse cluster in Altinity.Cloud on 16 April 2021. The service launched publicly later on 17 June 2021.  During that interval the Stellate and Altinity teams worked closely together. To illustrate the depth of teamwork, one Slack thread related to materialized view implementation generated 89 replies over the course of about a day.

The final deployment architecture for Stellate analytics is shown below.

The implementation architecture takes advantage of several important ClickHouse capabilities. 

  • Kafka Table Engine — ClickHouse tables can read data directly from Kafka topics. This eliminates the need for an extra pipeline component to transfer data from Kafka to ClickHouse. 
  • Compression — The event table applies encodings and compression to reduce storage size by 90% in the main event table. In addition, Stellate uses TTL expressions to truncate table data automatically after a set period of time, so that storage does not grow infinitely. 
  • Materialized views — Stellate uses materialized views to summarize aggregates for fast response to users. This reduces response time for common queries to milliseconds. 
  • Replication — Tables are replicated across two servers to ensure high availability. Doing so also increases read performance by spreading load. 

Altinity.Cloud hides administration from Stellate developers, who were able to focus on application features rather than housekeeping tasks like setting up and caring for Zookeeper. This has a knock-on effect of focusing ClickHouse support cases to improving the Stellate application, thereby sparing valuable developer time.

Benefits for Users

Because Stellate users are able to analyze which specific GraphQL operations are not cached, they can improve their cache rules and thus overall cache hit rate. Customers see cache hit rates all the way up to 99% for certain use cases. End user performance boosts are correspondingly large. As a best case example, responses on a 3 second uncached response can drop to 15 milliseconds via CDN. That’s a gain of 200x.

The embedded GraphQL analytics are the key to unlock the benefits of the service, including infrastructure cost savings of up to 80%, origin traffic reduction of up to 99% as mentioned above, and of course the major performance improvements across cached but also uncached requests.

Lessons Learned

Stellate’s use of ClickHouse is a model for any SaaS application that adds value through embedded analytics.  Here are three important takeaways. 

  1. ClickHouse fully lived up to its replication for speed and economy. 
  2. Altinity.Cloud simplified development by abstracting away details of operating ClickHouse. It was simple to spin up new clusters and get started. 
  3. Altinity support was critical to meet Stellate’s deployment deadline.  Support engineers were constantly available on Slack and acted like members of the dev team rather than a separate organization. They helped Stellate engineers quickly work through Kafka integration, replication choices, and schema optimization, to name just a few issues.

At Altinity we’re delighted to see another company validate the power of ClickHouse in benefiting SaaS users. We enjoy working with the Stellate team and look forward to helping them make their offerings even better through high-performance analytics.  If you have questions, please contact Altinity at info@altinity.com or Stellate at stellate.co.  We both look forward to hearing from you. 

Note: Altinity would like to thank Max Stoiber, Tim Suchanek and the Stellate team for their help in preparing this article. Congratulations on your new release!