This final article completes our tour of array capabilities. We’ll survey functions for array map and reduce operations, demonstrating behavior and commenting on performance. This is an opportunity to dig further into lambdas, which are critical for using arrays effectively.
ClickHouse arrays combine neatly with GROUP BY aggregation. We show how arrays track sequences and offer a couple of ways to do funnel analysis.
ClickHouse contributors regularly add analytic features that go beyond standard SQL. This design approach is common in successful open source projects and reflects a bias toward solving real-world problems creatively. Arrays are a great example.
Mutable data is generally unwelcome in OLAP databases. ClickHouse is no exception to the rule. Like some other OLAP products, ClickHouse did not even support updates originally. Later on, updates were added, but like many other things they were added in a “ClickHouse way.”
Even now, ClickHouse updates are asynchronous, which makes them difficult to use in interactive applications. Still, in many use cases users need to apply modifications to existing data and expect to see the effect immediately. Can ClickHouse do that? Sure it can.
A common use case in time series applications is to get the measurement value at a given point of time. For example, if there is a stream of measurements, one often needs to query the measurement as of current time or as of the same day yesterday and so on. Financial market data analysis and all sorts of monitoring applications are typical examples.
Databases have different ways to achieve this task and ClickHouse is not an exception here. In fact, ClickHouse offers at least 5 different approaches. In this article, we will review and compare them.
Jan 1, 2020
Cost-efficiency and performance are critical for big data analytics. For this reason a recent blog post from ScyllaDB guys caught our attention. They collected over 500 billion data points and were able to query it with 1B rows/sec query scan performance. The test rig was a beefy and expensive packet.com cluster: 83 n2.xlarge.x86 instances, 28 cores and 384RAM each. This is a nice demo of ScyllaDB cluster management. But looking at the numbers we realized it’s not very impressive as an example of efficient analytics. We can prove that using ClickHouse.
July 10, 2019
Modern analytical databases would not exist without efficient data compression. Storage gets cheaper and more performant, but data sizes typically grow even faster. Moore’s Law for big data outperforms its analogy in hardware. In our blog we already wrote about ClickHouse compression (https://altinity.com/blog/2017/11/21/compression-in-clickhouse) and Low Cardinality data type wrapper (https://altinity.com/blog/2019/3/27/low-cardinality). In this article we will describe and test the most advanced ClickHouse encodings, which especially shine for time series data. We are proud that some of those encodings have been contributed to ClickHouse by Altinity.
This article presents an early preview of new encoding functionality for ClickHouse release 19.11. As of the time of writing, release 19.11 is not yet available. In order to test new encodings ClickHouse can be built from source, or a testing build can be installed. We expect that ClickHouse release 19.11 should be available in public releases in a few weeks.
May 23, 2019
ClickHouse offers incredible flexibility to solve almost any business problem in a multiple of ways. Schema design plays a major role in this. For our recent benchmarking using the Time Series Benchmark Suite (TSBS) we replicated TimescaleDB schema in order to have fair comparisons. In that design every metric is stored in a separate column. This is the best for ClickHouse from a performance perspective, as it perfectly utilizes column store and type specialization.
Sometimes, however, schema is not known in advance, or time series data from multiple device types needs to be stored in the same table. Having a separate column per metric may be not very convenient, hence a different approach is required. In this article we discuss multiple ways to design schema for time series, and do some benchmarking to validate each approach.
In our previous articles we demonstrated that ClickHouse — a general purpose analytics DB — can easily compete with specialized DBMSs for time series data: TimescaleDB and InfluxDB. There were, however, certain queries, pretty typical for time series, where ClickHouse seemed at first glance to be at a disadvantage. The most notable example is returning the latest measurement for particular device. We will take this query and demonstrate how ClickHouse advanced features, namely materialized views and self-aggregating tables, can dramatically improve performance.
Dec 4, 2018
Our previous take on time series benchmarks attracted a lot of interest so we decided to dig into more details. We conducted 3 different ClickHouse scalability tests using the same TSBS dataset and benchmarking infrastructure. In this article we present results that happen to be quite interesting.
Nov 15, 2018
Once upon a time we spotted TSBS (https://github.com/timescale/tsbs) — Time Series Benchmark Suite, started by InfluxDB engineers and polished to perfection by TimescaleDB team. The suite allows to compare apples-to-apples when testing different databases: it is a framework to generate test data, load it to different databases, run test queries, and collect statistics to analyse. We could not resist adding ClickHouse to the list of supported databases. It turned out that ClickHouse — being a general purpose analytical DBMS — stands very well against proven time series databases. Those benchmarks highlighted strengths and weaknesses of different technologies. Interested? Let’s dig into details.