Tales of Restaurants and Food Trucks: Merging Real-time Analytics with Data Lakes

If you want to impress me with your car,
it better be a food truck.
– Reddit User
Not so long ago, if you wanted to eat great food prepared by talented chefs, you went to a restaurant. But what if you are at the park or stuck at work? What if your friends all want something different? That’s where food trucks come in. Food trucks with great tacos, dumplings, wood-fired pizzas, burgers, barbecue, whatever. You no longer need to travel to a fixed place that serves one thing. Food trucks bring fast, fresh, delicious food to you and your friends, in any combination you want.
The distinction between centralized restaurants and mobile food trucks that go anywhere and in any combination got us thinking. We’ve been working on applying real-time databases to data lakes. Food trucks provide an illuminating metaphor for integration that is far greater than the sum of its parts.
If you love ClickHouse®, Altinity can simplify your experience with our fully managed service in your cloud (BYOC) or our cloud and 24/7 expert support. Learn more.
Analytic databases are centralized. That’s a problem.
Besides protecting data, the most important characteristic of databases is speed. The latest great leap forward in the speed department is databases that offer real-time response on datasets fed by high capacity event streams. Even better, many of them are open source and freely usable for any business purpose. ClickHouseⓇ, DruidⓇ, and DuckDB™ have enabled a flood of creative applications to optimize website traffic, protect against security attacks, and measure the performance of cloud systems.
The key to their high performance is a “closed model” that delivers query and storage in a tightly coupled design. The reward is cleaner data and 100x boosts in performance over text data formats like JSON or CSV. Putting it in a cloud service that serves many users at once also contributes to efficiency and ease of use. The architecture is like a restaurant. Data and users come to the database, not the other way around.
This centralized design is hitting hard limits. Data volumes are approaching 100 petabytes for observability data sets. Machine learning and data science applications need the same source records. Loading copies into cloud services like SnowflakeⓇ adds cost, complexity, and lock-in. Companies increasingly want to hold a single copy of data in their own cloud accounts and make it available to all applications at once. Centralized database services look like a roadblock.
Data lakes solve the problem of globally accessible data
Feeding applications from a single copy of data has been a dream for decades. In 2011 James Dixon coined the term “data lake” to describe this vision. But early efforts like Hadoop did not live up to expectations. Many users ran into roadblocks around data cleaning, schema management, and operating Hadoop itself. By 2014 authorities like Mike Stonebraker were criticizing data lakes as data swamps and promoting the advantages of traditional data warehouses.
So why are data lakes in favor again? Three advances changed the balance. Each one pulls storage management features previously embedded in databases into public interfaces that applications can use directly. Together they enable shared storage of data for analytics.
S3 object storage
Over a decade ago analytic database design began to converge on shared storage with decoupled compute to enable scalable query processing. Shared storage access was embedded, which meant applications needed to go through the data warehouse to use it. Amazon S3 provided an alternative: a globally accessible file service that has evolved to be cheap, scalable, and fast as well. Efficient client bindings enable any application to read and write object storage, making it a favorite for AI and data science. Every public cloud now offers capable object storage that is similar or even offers identical S3 APIs.
Parquet file format
Analytic databases similarly converged on columnar file formats with compression, indexing, and built-in metadata. Like storage, most database file formats are only accessible internally. Instead, ML, data science, and other applications turned to open file formats like Parquet and ORC. They have similar capabilities but are designed for direct access by applications. Over time, Parquet has evolved into the dominant file format. It is supported by a wide range of libraries and tools, such as Apache Arrow, making it possible to read and write from any client language.
Open table formats and catalogs
Finally, SQL databases use tables to group files together and track changes in an organized manner. Open table formats like Iceberg, Delta Lake, and Hudi evolved to do exactly this in data lakes. They provide metadata that enables multiple applications to access files safely, make changes to schema, locate indexes, etc. SQL databases further organize tables into “databases” or “schemas.” Catalogs perform the same function for open table formats: they are centralized repositories that list available tables and the location of their metadata. Of the available formats, Iceberg has the most user traction, followed by Delta Lake. The fact that Databricks brought the Iceberg designers in house makes it increasingly possible that Iceberg and Delta Lake formats will converge over time. They may even be hidden inside catalogs like Unity.
Data lakes need real-time query engines that are less like restaurants and more like food trucks
The advances in data lakes make it possible to store and query shared data. They don’t make it fast. Existing real-time data warehouses are of course building features to fill the gap. But in many cases there’s a fundamental mismatch between their restaurant design and the actual needs of data lake users. To go fast, users need query engines that are like food trucks. They should operate anywhere and provide very flexible menus of features. Let’s explore why this is so.
Bringing query to the data
Mobility is an important property for data lake query, because there’s abundant evidence that data lakes will be widely distributed. Distribution happens in two ways.
Geographic distribution – Regulations like GDPR prevent data from leaving specific regions, to the extent that this is a standard term in many EU customer contracts. In addition, cloud egress can cost up to $100 USD per terabyte. With discounts, moving a petabyte of source data over the public Internet can run up to $50K.
Distribution across accounts – Most users simply don’t want to load large amounts of data into managed services for expense and lock-in reasons. In our experience, a large fraction of such users choose to host in their own cloud accounts or even local data centers.
The obvious conclusion: query engines need to go where the data lives. A lot like food trucks delivering food to you at work.
Creatively mixing real-time and data lake capabilities
Real-time databases have a set of core functions that include ingesting data, compaction (aka merging), transformation, and query. There are also native data structures like indexes, materialized views, and tiered storage, not to mention highly optimized query pipelines that use them. We can build faster and more capable systems if we mix and match these features with data lakes. It’s like combining food from different food trucks and adding your own as well to make a tasty lunch. Let’s look at a few examples.
Lightweight data marts – Imagine a system where log messages from software services drop into Parquet files on Iceberg. Real-time databases offer extremely space-efficient ways to pre-compute aggregates using algorithms like HyperLogLog to estimate distinct occurrences. Using native materialized views plus direct query against the data lake, users can search for outliers and then query source records in Parquet for fault diagnosis. Such data marts store a tiny fraction of the source data, which makes them easy to manage. They don’t touch source data, which makes it possible to deploy them quickly without disturbing other consumers.
Real-time response on new events – Real-time databases offer optimized ingest and tiered storage that archives data automatically to cheaper devices. Imagine a system that holds security event data in native format inside the real-time store and archives it to the data lake after a short period of time. Users see both hot and cold data from a single table in real-time thanks to tiered storage. That enables fast reactions to incoming events while also generating a clean, well-organized data exhaust in Parquet that enables ML and data science applications.
Efficient batch transformation – Real-time data warehouses can function as generalized transformation engines. We envision services that traverse the data lake to project, downsample, recompress, or sort data into more efficient forms. These are highly optimized capabilities with proven usefulness. They can help data lakes become more efficient for all consumers, not just real-time applications.
Following the food truck metaphor leads us to look beyond single applications and to seek synergies that raise performance for all users. Databases have a rich thesaurus of features that are potentially available for this effort.
Building the food truck model to enable real-time data lakes
It’s clear now that real-time databases have the potential to make data lakes faster and more efficient. Let’s focus on three capabilities necessary to turn today’s open source building blocks into food trucks that enable real-time response.
First class support for the entire data lake stack
Making Parquet on object storage fast is permission to play. So is support for Iceberg and other table formats. The same is true for catalogs. As we’ve shown in the Altinity Blog, basic improvements like adding a simple cache for Parquet metadata deliver order of magnitude speed-ups in queries. Adding secondary indexes and other data structures similarly speeds up access. Using catalogs to detect new data, coordinate transactions, and handle schema changes is another area of work that promises great benefits.
There are obvious challenges. That said, management of data on S3 is well understood, Parquet is a capable file format, and specifications like Iceberg make it possible to implement SQL table models, especially as efforts like Unity Catalog develop further. What’s required is hard and plodding work, which will take time.
The bigger problem is breaking out of the comfortable assumptions of monolithic databases. Case in point: The Parquet “standard” is not just the Apache spec but what client libraries actually write. The data might be badly sorted. It might contain different columns and data types across files in combinations that are difficult to handle. It might contain data structures that trigger bugs. First class data lake integration will require innovation in database code and testing to guarantee stable, correct behavior at reasonable cost. Products that stick with cumbersome testing mechanisms will founder on this reef.
Scalable, pop-up services for core real-time analytic functions
Data warehouses query, ingest, transform, and compact data. In the best database implementations, such as ClickHouse and DuckDB, there’s a single binary that handles all four functions, which contributes enormously to ease of use. The food truck model encourages real-time databases to keep the single binary but split functions into separate roles that can be located anywhere, scaled independently, and mixed with each other to taste. This goes beyond just separating compute and storage. Splitting compute functions keeps them from contending with each other in ways that make management and scaling hard.
Real-time services must also imitate food trucks in another critical way: convenience. Food trucks drive up, park, and deliver food. That means keeping things simple. As far as possible, services should operate on a serverless model. Developers should be able to stand up real-time query on data lakes without complex distributed state management. The vision of Iceberg, Delta Lake and catalogs built on them was to move these capabilities into a central location. We should use this wherever possible to simplify applications.
Automated delivery of real-time query capabilities to any location
We’ve emphasized that data lakes can be located in any region and in any account, which means the query engine must be able to run anywhere. To make this work at scale we need to automate not just the setup up of real-time databases themselves but the pools of compute, storage, and networking resources on which they operate.
This is where advances in cloud-native computing come in. Kubernetes provides a platform for container-based applications that can run in practically any location. The last mile for making the food truck model work is to integrate with popular deployment automation tools like Terraform, Helm, and Argo CD. These tools work together to set up Kubernetes clusters with auto-scaling and secure networking, configure query engines, and connect them to data lakes. It’s what makes mobile, serverless execution actually feasible. We need ambitious design ceilings, too. Existing data warehouse deployments already run to thousands of query engines in extreme cases. This will be the case with data lakes as well.
That last point is particularly true as multi-tenant SaaS services adopt data lakes to meet the needs of their customers. We fully get the irony that distributed data lakes can deliver more centralized monoliths. Flexible real-time query is exactly the open source building block that developers need to make such design decisions at higher levels in the stack.
Is your database a restaurant or a food truck?
Today’s analytic databases deliver outstanding performance across a wide range of applications. They have been successful thanks to tight integration of storage and compute in monolithic services. The rise of large pools of data in data lakes is challenging that model.
It is now time to merge cheap, shared data lake storage with the capabilities of real-time databases. To do this, real-time databases need to operate less like restaurants and more like food trucks with mobile, flexible capabilities. In the best cases, open source projects will offer both models. They will also work synergistically to make *all* data lake consumers more efficient, not just applications that traditionally depend on databases.
There is a clear opportunity to build a new generation of real-time analytic databases that are faster, cheaper, and more scalable. We hope this article sparks ideas from others. What’s your vision of real-time data lakes? We look forward to the conversation!
Acknowledgements
Thanks to Alexander Zaitsev, Alkin Tezuysal, Arnaud Adant, Arthur Passos, Doug Tidwell, Josh Lee, Minh Duc Nguyen, Ondrej Smola, and Pranav Aurora for review of this article. Your suggestions and corrections are much appreciated.
ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.