What’s Up with Parquet Performance in ClickHouse?

We speak with the lip, and we dream in the soul,
Of some better and fairer day

– Friedrich Schiller

Apache Parquet is a popular open source cross platform data format. It is of course supported in ClickHouse. It is often used as a storage format in the Hadoop ecosystem, but there are other use cases. For example it paves a convenient way to move data from SnowFlake to ClickHouse

Over the last year there were numerous enhancements in compatibility and performance of Parquet in ClickHouse from Altinity, ClickHouse Inc. and other committers. When Alexey Milovidov presented Parquet performance improvements during ClickHouse 23.4 release webinar, I immediately gave it a try in Altinity.Cloud. It sounded very tempting to query Parquet files at S3 with the same efficiency as with MergeTree tables. The first test runs were very encouraging, so I continued testing. 

Why Parquet?

Some users may wonder why Parquet is interesting for ClickHouse at all? ClickHouse is famous for its performance for real-time analytics. Parquet emerged from the heavyweight Hadoop ecosystem. Is there anything in common?

Actually, there is quite a lot. Parquet is a columnar data format, similar to ClickHouse MergeTree. Both are designed for high performance interactive processing. And both work well for petabyte scale data volumes.

There is an important difference though: unlike MergeTee, Parquet is a cross platform format. Parquet data can be produced and consumed by totally different systems. For example, ETL processes may generate Parquet data and place it on S3. Then databases, like ClickHouse or Oracle, ML frameworks and BI tools, may start working with this data instantly! No migration is needed. 

This openness and built-in performance characteristics of Parquet make it unique. There is a vast amount of data in Parquet format already. Imagine, if one day ClickHouse could work with Parquet data with the same efficiency as with MergeTree tables. It would make ClickHouse adoption much wider, opening yet another big door to Big Data.

Let’s see how far is ClickHouse from this dream in May 2023. 

Preparing Parquet Files

In order to run performance comparison tests we need to prepare data in Parquet format first. We will use the “ontime” airlines dataset that contains 200M rows, and Altinity.Cloud demo server. Use demo/demo if you want to run those examples by yourself.

The dataset already exists on a demo server, so inserting Parquet into an S3 bucket can be done with a simple INSERT statement:

INSERT INTO FUNCTION s3('https://s3.us-east-1.amazonaws.com/altinity-clickhouse-data/airline/data/ontime_parquet/{_partition_id}.parquet', /* credentials */ 'Parquet')
PARTITION BY Year
SELECT * FROM ontime

The size of the files in the bucket is 4.6GB. This is the first surprise! The size of the source MergeTree table is 13.3GB, exactly 2.5 times more. Parquet applies data type specific encodings automatically to every column, and compression on top of that. The same can be done in ClickHouse as well, but it requires the user to define encodings manually for every column. The defaults are much less efficient.

If you are interested in Parquet format internals, please refer to the very informative ClickHouse article on this topic. See also an Appendix for per column statistics.

Running Queries

If you missed that, all SELECT queries can be run on the Altinity.Cloud demo server. Use demo/demo as credentials.

We will use the first 3 test queries from the benchmark. The MergeTree table is stored on an EBS disk. It would be probably interesting to compare MergeTree on S3 as well, but the goal is to compare “native” MergeTree with the “native” Parquet. 

Since the data from block storage is usually cached in OS page cache, we will also add ‘SETTINGS min_bytes_to_use_direct_io=1’ to every MergeTree table query. That will make sure data is always loaded from the network: EBS for MergeTree table, and S3 for Parquet.

Before running benchmark queries, let’s test how fast ClickHouse can read the full table. It can be done with this query:

SELECT count() FROM ontime WHERE NOT ignore(*) SETTINGS min_bytes_to_use_direct_io=1

It took 42s, not bad.

In order to query Parquet files, we will use the S3 table function.

SELECT count() FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet') WHERE NOT ignore(*) 

16.6s! Reading from Parquet is 2.5 times faster than the MergeTree table!!! It seems too good to be true, but results are on the screen. So let’s start testing queries.

Query 1: This is a full-scan of two columns.

SELECT avg(c1)
FROM (
    SELECT Year, Month, count(*) AS c1
    FROM ontime
    GROUP BY Year, Month)
SETTINGS min_bytes_to_use_direct_io=1

It takes just 80ms, ClickHouse is very fast.

SELECT avg(c1)
FROM (
    SELECT Year, Month, count(*) AS c1
    FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet')
    GROUP BY Year, Month)

Ouch! Our high hopes just came crashing back to Earth. Query time from Parquet takes anywhere from 800 to 1600 ms. Usually the first run is slower, and then following ones are faster, but it may also get slower later as well. Even the fastest 800ms is still 10 times slower than MergeTree, but it is still a great result given that the data is in files on S3. 

Query 2: This query includes a filter on the partitioning column.

SELECT DayOfWeek, count(*) AS c
FROM ontime
WHERE Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC
SETTINGS min_bytes_to_use_direct_io=1

23ms. Outstanding!

SELECT DayOfWeek, count(*) AS c
FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet')
WHERE Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC

600ms. This is under the second range, but 25 times slower than MergeTree!

It is worth repeating that S3 performance varies on retries. The S3 variance may be up to 100%, while EBS performance is more consistent. This is probably because EBS uses a dedicated network interface, while S3 data is being transferred over the common network inside AWS us-east-1 region.

Query 3. This query includes a filter on non-partition column:

SELECT DayOfWeek, count(*) AS c
FROM ontime
WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC
SETTINGS min_bytes_to_use_direct_io=1

80ms. ClickHouse can not slow down.

SELECT DayOfWeek, count(*) AS c
FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet')
WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC

2000ms. This is where Parquet becomes significantly slower. 

In general, we can see that Parquet query performance is excellent for remote data of this size. One can quickly explore a 200M rows dataset, but it is still an order of magnitude slower than working with a MergeTree table.

For usability’s sake, we can even hide the S3 function call under a view, like follows:

CREATE VIEW ontime_parquet AS
SELECT * FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet')

Now, we can run queries against the ‘ontime_parquet’ view without caring where the data actually resides.

More Tests

There are a few other experiments that we can conduct with this Parquet dataset.

Maximum S3 Throughput

We can measure maximum throughput of the S3 table function if we remove any parsing on ClickHouse side. It can be done with RawBLOB format:

SELECT count() FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet', RawBLOB) WHERE NOT ignore(*) 

The result is 4.9s for 4.95GB of data, so ClickHouse can load from S3 as fast as 1GB/s, which is outstanding throughput! The server network is 12 gigabit, so it is close to network saturation. If we add ‘SETTINGS max_threads=32’ in order to increase threads to be maximum, we can eveb load it in 4 seconds. There are instances with 15, 25, 50, 75 and even 100 gigabit network interfaces that can be tested. I tried one with a 15GB network instance and could get 3.16 seconds for the query above, which is 1.5GB/s. ClickHouse team did great work boosting S3 performance!

Maximum EBS Throughput

As we showed above, scanning the MergeTree table fully takes 42s. The table size is 13.3GB, so we can calculate the average retrieval speed as 316MB/s. Unlike Parquet, we can not isolate network speed from data processing speed. The demo server has a stacked volume configuration with 3 old gp2 volumes, 250MB/s each. It gives us up to 750MB/s throughput for queries touching a lot of data, but may drop down to 250MB/s for smaller ones. In Altinity.Cloud we can easily bump the storage to 1000MB/s per volume using gp3 EBS volumes even without stacking. That lowers down the full scan to 29s, that is still almost twice as long as Parquet 16.6s. 

Unfortunately, there is no easy way to measure maximum EBS throughput from ClickHouse size, but we can do the following. Remember, that we disabled OS page cache. If we enable it back we will get speed of processing MergeTree data that is in RAM:

SELECT count() FROM ontime WHERE NOT ignore(*)

Running this query for the first time will fill the page cache, and the second run gives us 6.8s. So we may conclude that loading from EBS adds 29 – 6.8 = 22.2s. That gives us an estimated throughput 13.3GB/22.2 = 600MB/s, which is lower than expected so there could be some other overhead that we missed.

S3 Virtual Columns

The S3 table function supports virtual columns, like _file and _path. What if we select those only? 

SELECT DISTINCT _file, _path
FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet')
ORDER BY _file

This query runs in 2000ms. This is very slow, since ClickHouse does not have to parse Parquet files at all, but it looks like it does. Optimization is needed here.

Partition Granularity

Let’s test how ClickHouse works with Parquet files having more granular partitioning. We will partition by Year and Month this time:

INSERT INTO FUNCTION s3('https://s3.us-east-1.amazonaws.com/altinity-clickhouse-data/airline/data/ontime_parquet2/{_partition_id}.parquet', /* credentials */ 'Parquet') 
PARTITION BY toYYYYMM(FlightDate)
SELECT * FROM ontime

Now, there are 408 Parquet files compared to 35 in the previous test. Does it make any difference? Apparently, it does. Here are the query times after several retries:

Query 1: 5.5s
Query 2: 5.5s
Query 3: 10s

All queries started to be at least 5 times slower! This happens because ClickHouse has to do more calls to S3, and establishing a connection is expensive compared to file access. This is also the reason why MergeTree does not perform very well on S3 and requires a local cache for fast query time.

For comparison, if we partition MergeTable the same way, and have 408 partitions, the query results are the following:

Query 1: 142ms
Query 2: 38ms
Query 3: 130ms

It is slightly slower, but not that much. You can check it running test queries from the ‘ontime_2’ table.

Scalability

All the tests above were executed on a pretty powerful 32 vCPU m6g AWS instance. What if we run the same queries on a smaller machine instead? The performance should degrade, but how much? This is very easy to test in Altinity.Cloud since scaling the instance up or down takes just a couple of minutes.

Accordingly, on a 4 vCPU instance we get:

Query 1: 13s
Query 2: 9s
Query 3: 40s

Note, results were very inconsistent between retries, especially for the longest Query 3. On average, performance degradation was 17 times.

For comparison, here are ClickHouse MergeTree results for 4 vCPU machine. They are close to linear:

Query 1: 740ms
Query 2: 17ms – even faster than 32 vCPU!
Query 3: 500ms

So Parquet query performance degrades non-linearly and in an unpredicted way. On smaller machines queries from MergeTree still work under a second, but Parquet becomes very sluggish.

Summary of ‘ontime’ Tests

Here is a table summarizing results of different tests on 32vCPU instances. The yellow highlighting means better results.

MergeTree on EBSParquet on S3
Compressed data size13.3GB4.6GB
Full table scan42s (29s on a better volume)16.6s
Full table scan w/cache6.8sn/a
Maximum throughput600MB/s*1000MB/s
Full table parsing6.8s11.7s
Query 180ms800ms
Query 223ms600ms
Query 380ms2000ms
408 vs 35 partitionsx1.8 slowerx5-7 slower
Scale down 32->4 vCPUsx5.5 slowerx17 slower

* – note, that this is what we’ve got from a single volume. Multiple volumes can give better throughput.

SSB Benchmark over Parquet Files

For the next experiment I have uploaded the Star Schema Benchmark (SSB) dataset to an S3. We use this dataset quite regularly for tests, e.g. when testing new Graviton instances on AWS recently. The dataset contains 600M rows. 

The procedure to upload to the S3 bucket is similar to what we did with the ‘ontime’ table. The data has been partitioned with the same expression as the MergeTree table ‘toYYYYMM(LO_ORDERDATE)’.

INSERT INTO FUNCTION s3('https://altinity-clickhouse-data.s3.amazonaws.com/ssb/data/lineorder_wide_parquet/*.parquet', /* credentials */ 'Parquet') 
PARTITION BY toYYYYMM(LO_ORDERDATE)
SELECT * FROM lineorder_wide

The load resulted in 80 Parquet files. You can check the table as follows:

SELECT count() FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/ssb/data/lineorder_wide_parquet/*.parquet', 'Parquet')

The size of the files in the bucket is 59.8GB. That is 15% smaller compared to the source MergeTee table, which is 68.6GB.

Once the data has been uploaded, I’ve run test queries using the benchmark. At first, I discovered a bug: LO_ORDERDATE column of Date type has been incorrectly stored as UInt16 in Parquet during INSERT, so queries could not run. An explicit type conversion is needed. For example, here is the Q1.1 query from the benchmark. Note the toDate conversion:

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/ssb/data/lineorder_wide_parquet/*.parquet', 'Parquet')
WHERE toYear(toDate(LO_ORDERDATE)) = 1993 
  AND LO_DISCOUNT >= 1 AND LO_DISCOUNT <= 3 AND LO_QUANTITY < 25

The query runs in 4 seconds on a demo server.

In order to make it easy to execute the benchmark script without re-writing the queries, I’ve created a view with the correct type mapping. 

CREATE VIEW lineorder_wide_parquet AS SELECT * FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/ssb/data/lineorder_wide_parquet/*.parquet', 'Parquet', 'LO_ORDERKEY UInt32, LO_LINENUMBER UInt8, LO_CUSTKEY UInt32, LO_PARTKEY UInt32, LO_SUPPKEY UInt32, LO_ORDERDATE Date, LO_ORDERPRIORITY LowCardinality(String), LO_SHIPPRIORITY UInt8, LO_QUANTITY UInt8, LO_EXTENDEDPRICE UInt32, LO_ORDTOTALPRICE UInt32, LO_DISCOUNT UInt8, LO_REVENUE UInt32, LO_SUPPLYCOST UInt32, LO_TAX UInt8, LO_COMMITDATE Date, LO_SHIPMODE LowCardinality(String), C_CUSTKEY UInt32, C_NAME String, C_ADDRESS String, C_CITY LowCardinality(String), C_NATION LowCardinality(String), C_REGION Enum8(\'ASIA\' = 0, \'AMERICA\' = 1, \'AFRICA\' = 2, \'EUROPE\' = 3, \'MIDDLE EAST\' = 4), C_PHONE String, C_MKTSEGMENT LowCardinality(String), S_SUPPKEY UInt32, S_NAME LowCardinality(String), S_ADDRESS LowCardinality(String), S_CITY LowCardinality(String), S_NATION LowCardinality(String), S_REGION Enum8(\'ASIA\' = 0, \'AMERICA\' = 1, \'AFRICA\' = 2, \'EUROPE\' = 3, \'MIDDLE EAST\' = 4), S_PHONE LowCardinality(String), P_PARTKEY UInt32, P_NAME LowCardinality(String), P_MFGR Enum8(\'MFGR#2\' = 0, \'MFGR#4\' = 1, \'MFGR#5\' = 2, \'MFGR#3\' = 3, \'MFGR#1\' = 4), P_CATEGORY String, P_BRAND LowCardinality(String), P_COLOR LowCardinality(String), P_TYPE LowCardinality(String), P_SIZE UInt8, P_CONTAINER LowCardinality(String)')

After that, I ran the benchmark script using both the original MergeTree table, and a view on Parquet files in S3.

To make it short, here are the total query times using 32vCPU m7g.8xlarge instance:

Total for all queries using EBS storage: 0.824s
Total for all queries using Parquet at S3: 117s

So with more complex queries including multiple columns and various filter conditions, the performance of Parquet on S3 is more than 100 times slower compared to a MergeTree table on network block storage.

Conclusion

ClickHouse 23.4 can read Parquet files very quickly. It can also run SQL queries directly on Parquet data in S3. This can be already used for data exploration purposes, when waiting several seconds for query results is acceptable.

However, running queries on Parquet data is still 10 to 100 times slower compared to using the MergeTree table. The performance is sometimes unpredictable. For fast analytics data needs to live natively in ClickHouse.

The ClickHouse team is focused on Parquet performance, and we can expect more news on that in the upcoming releases. For example, the recently added ParquetMetadata is a significant step to understanding Parquet internals. In the future, ClickHouse may start using Parquet statistics and indices, which should dramatically improve query performance. We are watching the progress here closely, and maybe our dreams will come true. Stay tuned!


Appendix: ClickHouse vs Parquet Compression

ParquetMetadata allows us to compare compression between ClickHouse and Parquet. Feel free to run the scary query below and compare the compressed size of every column between ClickHouse MergeTree and Parquet.

SELECT column, type, parquet_compressed, ch_compressed, round(ch_compressed / parquet_compressed, 2) as delta_pct
FROM
(
SELECT column, sum(num_rows), sum(column_uncompressed_size) parquet_uncompressed, sum(column_compressed_size) as parquet_compressed FROM
(WITH arrayJoin(columns) as c
SELECT _file, num_rows, metadata_size, total_uncompressed_size, total_compressed_size, tupleElement(c, 'name') as column, tupleElement(c, 'total_uncompressed_size') as column_uncompressed_size, tupleElement(c, 'total_compressed_size') as column_compressed_size
FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet', 'ParquetMetadata')) t
GROUP BY column
) a
LEFT JOIN (SELECT name as column, type, data_compressed_bytes ch_compressed, data_uncompressed_bytes ch_uncompressed FROM system.columns WHERE table='ontime') b
USING (column)
ORDER BY delta_pct
FORMAT PrettyCompactNoEscapesMonoBlock

Results are in the table below. It is evident that ClickHouse has something to learn from Parquet.

columntypeparquet_compressedch_compresseddelta_pct
FirstDepTimeString465913689622551.92
Div1WheelsOnString234723650196452.14
Div1WheelsOffString195023544213282.27
DivActualElapsedTimeString164431441082952.5
DivArrDelayString158652540528182.55
Div1TailNumString179308145813682.56
TotalAddGTimeString271142670472102.6
LongestAddGTimeString270151270412122.61
FlightNumString1287261883536941962.75
CarrierDelayInt32388277641112618972.87
LateAircraftDelayInt32382270371105325072.89
NASDelayInt32402009031259248893.13
WeatherDelayInt327071239223185263.16
CRSArrTimeInt321445598304586717293.17
Div1TotalGTimeString129101941136003.19
Div1AirportString133457942798673.21
DistanceGroupUInt8322954671044750323.23
Div1LongestGTimeString124710140721413.27
CancellationCodeFixedString(1)4249913150412883.54
DepDelayMinutesInt321662422156158133243.7
ArrDelayMinutesInt321780013776704501303.77
DivertedUInt8163481063716793.9
FlightDateDate131072351510383.93
DepTimeInt3228313108111351906514.01
WheelsOnInt322319220619331037874.02
ArrTimeInt3228440043711453338494.03
WheelsOffInt322316196489335394104.03
DivDistanceString93255537921204.07
DestStateFixedString(2)495646672039388544.11
OriginStateFixedString(2)482019061979643514.11
CancelledUInt86082800256319034.21
DepartureDelayGroupsString820683553695145064.5
DepDelayInt3222656292510278155734.54
CRSDepTimeInt321123913955151416874.58
OriginStateFipsString483050672246074134.65
DestStateFipsString496681652322379504.68
ArrDelayInt3223886989211353454784.75
ActualElapsedTimeInt3223873820011492490784.81
TailNumString1996361099768863844.89
DayofMonthUInt872369635414994.89
DivReachedDestString75204836790744.89
AirTimeInt321894609229263634634.89
TaxiOutInt321626252288864284505.45
TaxiInInt321429977028218690475.75
DivAirportLandingsString78648945291495.76
CRSElapsedTimeInt321091259056527835185.98
DistanceInt32839482235437767696.48
DayOfWeekUInt851628236346547.04
Div1AirportIDInt32131276195677017.29
Div1AirportSeqIDInt32132234197277487.36
OriginFixedString(5)590339664539900387.69
DestFixedString(5)610417174698788037.7
ArrivalDelayGroupsInt32963304787876601358.18
OriginCityMarketIDInt32610255795444466618.92
DestCityMarketIDInt32627823375614252428.94
OriginAirportSeqIDInt32589422615299299528.99
DestAirportSeqIDInt32609015555493097709.02
OriginAirportIDInt32589493135639468799.57
DestAirportIDInt32609559945866758399.62
DepDel15Int322549146427374703210.74
DestCityNameString6243124670943084111.36
OriginCityNameString6046442968821434511.38
ArrDel15Int322644339630491820011.53
SecurityDelayInt32720296831710011.55
OriginWacInt324832006762976083413.03
DestWacInt324968316564775996613.04
DestStateNameString5000485666312651413.26
OriginStateNameString4864120564592900013.28
DepTimeBlkString73005404108936355914.92
ArrTimeBlkString74077944112012845415.12
MonthUInt8115684185158316.01
Div2WheelsOnString90690183681520.25
QuarterUInt885466181667921.26
Div2AirportString85696183143321.37
Div2TotalGTimeString84321182717121.67
Div2LongestGTimeString84055182709121.74
Div2TailNumString72555181294424.99
Div2WheelsOffString71051181081825.49
Div3WheelsOnString61113179842229.43
Div3AirportString61081179838829.44
Div3TotalGTimeString61023179832829.47
Div3LongestGTimeString61023179832829.47
Div3WheelsOffString60828179809029.56
Div3TailNumString60832179809429.56
Div4TailNumString60810179807229.57
Div5TailNumString60810179807229.57
Div5LongestGTimeString60810179807229.57
Div4WheelsOffString60810179807229.57
Div5AirportString60810179807229.57
Div5WheelsOffString60810179807229.57
Div5TotalGTimeString60810179807229.57
Div4TotalGTimeString60810179807229.57
Div4AirportString60810179807229.57
Div4WheelsOnString60810179807229.57
Div5WheelsOnString60810179807229.57
Div4LongestGTimeString60810179807229.57
YearUInt1670202360359951.33
CarrierFixedString(2)66818360746053.99
Div2AirportIDInt32104658720462368.84
Div2AirportSeqIDInt32104709720794068.84
AirlineIDInt3289216725595781.33
Div3AirportIDInt3284463717713884.97
Div3AirportSeqIDInt3284463717717084.97
Div4AirportSeqIDInt3284290717687685.15
Div4AirportIDInt3284290717687685.15
Div5AirportSeqIDInt3284290717687685.15
Div5AirportIDInt3284290717687685.15
FlightsInt3284290724995186.01
UniqueCarrierFixedString(7)8715012099204138.83

Share

Related:

One Comment

Comments are closed.