We continue benchmarking ClickHouse. In this article, we discuss a benchmark against Amazon RedShift.
In our previous test we benchmarked ClickHouse database comparing query performance of denormalized and normalized schemas using NYC taxi trips dataset. Here we continue to use the same benchmark approach in order to have comparable results.
First, we compare the performance of ClickHouse at Amazon EC2 instances against private server used in the previous benchmark. Then we try the same benchmarks Amazon RedShift.
Amazon Hardware Benchmark
In order to measure the speed of the AWS virtual server, we configured r4.xlarge EC2 instance (Intel Xeon E5-2686 v4 (Broadwell), 4 vCPU, 30.5GB RAM, EBS storage)
After that we installed and configured ClickHouse there, run our test queries and compared results.
*1,2,3,4,5 are test queries
As you can see AWS virtual server is 3 times slower than our dedicated server (Intel i7-6700 Quad-Core Skylake, 32GB RAM, 2x4TB SATA). With that in hand, we could compare the results of RedShift separately.
Using AWS account we started a RedShift cluster based on ds2.xlarge profile with only one node at the beginning. ds2.xlarge is somewhat comparable to r4.xlarge, though we could not find the accurate specs. What is good about cloud platform is that you can add as many nodes as you want afterward.
To get RedShift working to receive data one has to configure special IAM role and setup security group for accessing S3 (another AWS service) so the date can get in.
Data Load into RedShift
The next step in our benchmark is to prepare DB schema and populate the data. We have used the same schema as in our ClickHouse benchmarks changing column data types when required.
In order to import the data into already prepared database schema we used updated script from our previous test that imports trip data. This script executes RedShift COPY statement for every TSV source file that should be preliminary placed to S3 storage.
A different approach was used in order to import weather and taxi zones data. We used CSV files for source data and then a couple of custom scripts that uses ODBC library in order to load data into RedShift.
RedShift performance Benchmark
Here we used the same test queries with dictionaries as we did for the previous test for ClickHouse and original PostreSQL queries with table joins for RedShift. After executing our tests at a single node server we also scaled the cluster up to 3 nodes and re-ran the tests again.
Here are the results:
Tests confirm that ClickHouse is usually significantly faster (3-5 times) than RedShift when running on comparable Amazon instances. Even when scaling up to 3 nodes RedShift is still behind. In one test case, however, ClickHouse did run slower.
We were surprised and looked closer. For that query RedShift used efficient column predicate to filter out the data, while ClickHouse had to run dictionary lookup for every row. ClickHouse dictionaries are very fast but at 1.3 billion rows their contribution is visible.
With that said we decided to optimize the ClickHouse query. The original query used string comparison between a value from external dictionary and a string constant:
SELECT count() FROM yellow_tripdata_staging WHERE dictGetString('taxi_zones', 'zone', toUInt64(pickup_location_id)) = 'Midtown East'
The idea of optimization is to pre-cache dictionary keys or IDs (it can be done once), then filter ids in subselect and finally apply efficient ‘in’ filter to the main query. Here is how optimized query looks like:
SELECT count() FROM yellow_tripdata WHERE pickup_location_id IN ( SELECT pickup_location_id FROM pickup_location_ids WHERE dictGetString( 'taxi_zones', 'zone', toUInt64(pickup_location_id) ) = 'Midtown East' )
The result showed that ClickHouse performing much faster with that optimization than it was before. The query time dropped down from 8 to 1.5 second.
Our benchmarks show that dedicated servers are still significantly faster than Amazon instances for analytic DBMS workloads. ClickHouse is significantly faster than RedShift in most scenarios, in some cases ClickHouse queries require optimizations.
The test is not scientific, it is certainly possible to make more efficient RedShift installation and perform some optimizations on that end, but the same is true for ClickHouse. Both databases run with default settings.
P.S. For those who wish to try existing RedShift dataset at ClickHouse you might be interested in the following article: http://www.3manuek.com/redshiftclickhouse. It explains how to migrate data from RedShift to ClickHouse that turns out to be pretty easy.
P.P.S. We added one more benchmark against RedShift with the different dataset: https://www.altinity.com/blog/2017/7/3/clickhouse-vs-redshift-2