July 30, 2019
A recent blog post from Gartner caught our attention at Altinity. The title is The Future of Database Management Systems is Cloud! and it makes the not-so sensational claim that public cloud is now the default platform for managing data. The article is based on research authored by Donald Feinberg, Merv Adrian, and Adam Ronthal, who are industry veterans with long experience in the DBMS market.
What’s interesting is that the article makes two further claims that deserve very careful scrutiny.
- New DBMS innovation is now only in the cloud or at least cloud first. If you are not in the cloud you’ll miss out.
- Pricing models that avoid capital expense in favor of operational expense are driving the move. In other words, this is a long-term trend driven by basic economics so we can expect it to continue and perhaps even intensify over time.
These claims are potentially misleading. If you are designing systems to manage data, taking them at face value could lead to serious strategic mistakes. You may miss out on competitive technology and also limit the profitability of your business.
We’ll restrict the argument to the publicly accessible summary of Gartner’s research, so that you can read it yourself. Now let’s look at the facts.
Open source software is fundamental to innovation in data management
We were a little hurt that Gartner’s market share ranking does not mention ClickHouse, which is a great data warehouse. But we don’t feel so bad, since many other great open source technologies were also left out. Time series databases like InfluxDB and TimescaleDB are missing. Spark is missing. MySQL and PostgreSQL are missing. This latter omission is notable as both databases are linchpins of Amazon RDS, one of the most successful public cloud data services.
More surprisingly AI technologies are not called out. Machine learning and deep learning represent the biggest advance in data analysis of the past decade. Besides close coupling of AI pipelines with databases, training and execution of models is starting to integrate directly into the DBMS itself. Any current enumeration of AI toolkits would need to include Scikit-Learn, Tensorflow, Torch, Keras, and many other open source frameworks. Projects like Apache Arrow show promise for new ways to integrate them with DBMS without inefficient copying from storage to execution pipelines. This is a space to watch very carefully, especially as much of the innovation is occuring in open source.
Finally, we cannot overlook the emerging role of Kubernetes in data management. It confers much of the high utilization and ease of management that public clouds offer today. Our own experience building the ClickHouse Kubernetes Operator–as well as the experience of our customers–shows that Kubernetes is a viable environment for large-scale analytic applications. Kubernetes runs equally well in cloud and bare metal environments, enabling users to run portable open source projects like ClickHouse easily in both. Kubernetes is also open source.
We do not discount the outstanding innovations of public cloud services like Amazon RDS and Amazon Redshift. Both have been gamechangers in lowering cost of entry and easing administration overhead. Similarly, services like Google BigQuery can operate on an enormous scale by efficiently marshalling resources of the cloud. All of these innovations are worthy of emulation. They also make existing cloud data services great choices for many business problems.
That said, if you are making choices about future systems you must watch open source carefully. Over the past two decades many of the most disruptive data management technologies have emerged from open source projects. An active venture capital industry ensures that the best projects quickly turn into enterprise products. The rapid evolution of analytics, AI, and Kubernetes through collaborative open source projects proves this trend will continue.
In summary, data management professionals who take their eyes off open source technology risk being badly surprised. It’s key to much of the innovation in the field.
Public cloud economics are a poor fit for many data management use cases
It is indisputable that public cloud services work brilliantly for a lot of businesses. Low up-front costs, the fact that vendors handle complex system administration, and economies of scale make use of public clouds a no-brainer for many purposes, not just data management. Cloud revenue growth numbers amply prove the appeal.
But does it follow that public cloud is right for every use case? The answer is emphatically no. What if you have a business with the following characteristics?
- Large quantities of data
- High and constant resource utilization
- High cost sensitivity
This profile describes most large SaaS vendors as well as social media companies like Facebook. If the cloud is universally good for these businesses we should expect to see most of them running there. Yet the actual record is very mixed. Companies like Lyft and Pinterest are major users of public cloud services. But other vendors are not or have moved away from the public cloud as they grew larger.
SalesForce has a small percentage of operations on AWS but largely uses data centers it manages directly. Facebook has a long history of building and operating its own data centers from scratch. Dropbox originally hosted on AWS but largely moved file storage away from Amazon to their own data centers. Along the way they claimed cost savings of $74.6 million.
This last figure gets to the point. Cloud services are expensive. Why does the stock market love the Amazon and Azure clouds? Simple: they have excellent gross margins, which is the money left over after delivering the service to customers. Recent Microsoft earnings reports indicate MS Azure gross margins are at least 50%. Amazon does not break out AWS gross margins but shows consistent operating margins (i.e. with other expenses like sales rolled in) of 25% or more. Let’s therefore assume that AWS gross margins are also at least 50%.
What that 50% means is simple for users. On average if you spend $100M a year on public cloud as Lyft does, $50M goes onto the cloud provider balance sheet. If you run the services in your own data centers, $50M goes to your own balance sheet. We can check this math by looking at the Dropbox numbers carefully. They show 2016 savings of around 43%, so we’re in the ballpark. For businesses that are big, fully utilize resources, and are cost-sensitive the economic incentives are obvious and grow larger over time.
The incentives also play out in a more subtle way. Rather than moving off Amazon or Azure, just avoid the most expensive services. That includes data management services like Amazon RDS. Running a db.m5.12xlarge instance on RDS MySQL can be 80% more costly than a plain m5.12xlarge instance in the same region. (Estimate based on standard 3 year term, us-west-2 region). You could just run open source MySQL on basic compute and storage instead. That lowers costs and preserves freedom to move elsewhere in the future. Interestingly enough, that’s exactly what companies like Slack appear to be doing.
In summary the incentives to operate in the cloud fade or even reverse as business revenue increases. Regardless of business model, operating cost for IT resources tends to revert to the mean at scale, which correlates to cloud provider gross margins. Even if you stay in the cloud there are declining incentives to use cloud data services. These are basic economics that affect any data-driven business.
Edge computing is creating new use cases outside the cloud
Longer term the explosive growth of data from IoT devices will promote management outside public clouds. By one estimate, a single autonomous test car generates data at 3,000 times the rate of Twitter. For reasons that include network bandwidth constraints, security, storage limits, and need to respond in real-time much of this data will be cleaned, analyzed, and used locally. Only a small fraction will ever reach the cloud.
It’s common in cloud data management to speak of data gravity as a reason for applications moving to the cloud. Edge computing and IoT create a new kind of data gravity outside public clouds. At Altinity we envision a future that may include data in hundreds of millions of platforms ranging from automobiles to medical systems to agricultural equipment. In many cases such local data will reach volumes previously only seen in centralized data centers.
We therefore expect that capabilities like high speed messaging, streaming query and data warehouses with efficient compression will appear in edge environments. Some of these will be the same products and platforms used in public clouds. It’s one of the reasons we believe that portability is still a major consideration for data management technology. But we also expect new innovation focused on processing data rapidly in remote environments. Some of that innovation is already visible from initiatives like the UC Berkeley RISE Lab, which includes secure, real-time AI. Many others are working on this problem.
Conclusion: Think beyond the cloud
At Altinity we fully agree with Gartner that the cloud is important for data management. It should be a consideration in every new deployment decision, especially in cases where speed and flexibility trump cost.
At the same time, system designers must seek out new open source data management projects like ClickHouse, which can confer disruptive advantages to early adopters. Designers as well as business leaders also need to understand that cloud economic incentives change significantly as business grows. Finally, edge computing and IoT will fuel a new wave of technology for data management. Many innovations will be applicable not just at the edge but across all data-driven businesses.
As engineers we often talk about designing systems for scale. Scalable data management enables scalable business. To reach the goal you must think beyond the cloud.