What's Up with Parquet Performance in ClickHouse®?

We speak with the lip, and we dream in the soul,
Of some better and fairer day
– Friedrich Schiller

Apache Parquet is a popular open source cross platform data format. It is of course supported in ClickHouse. It is often used as a storage format in the Hadoop ecosystem, but there are other use cases. For example it paves a convenient way to move data from SnowFlake to ClickHouse.

Over the last year there were numerous enhancements in compatibility and performance of Parquet in ClickHouse from Altinity, ClickHouse Inc. and other committers. When Alexey Milovidov presented Parquet performance improvements during ClickHouse 23.4 release webinar, I immediately gave it a try in Altinity.Cloud. It sounded very tempting to query Parquet files at S3 with the same efficiency as with MergeTree tables. The first test runs were very encouraging, so I continued testing.

Why Parquet?

Some users may wonder why Parquet is interesting for ClickHouse at all? ClickHouse is famous for its performance for real-time analytics. Parquet emerged from the heavyweight Hadoop ecosystem. Is there anything in common?

Actually, there is quite a lot. Parquet is a columnar data format, similar to ClickHouse MergeTree. Both are designed for high performance interactive processing. And both work well for petabyte scale data volumes.

There is an important difference though: unlike MergeTee, Parquet is a cross platform format. Parquet data can be produced and consumed by totally different systems. For example, ETL processes may generate Parquet data and place it on S3. Then databases, like ClickHouse or Oracle, ML frameworks and BI tools, may start working with this data instantly! No migration is needed.

This openness and built-in performance characteristics of Parquet make it unique. There is a vast amount of data in Parquet format already. Imagine, if one day ClickHouse could work with Parquet data with the same efficiency as with MergeTree tables. It would make ClickHouse adoption much wider, opening yet another big door to Big Data.

Let’s see how far is ClickHouse from this dream in May 2023.

Preparing Parquet Files

In order to run performance comparison tests we need to prepare data in Parquet format first. We will use the “ontime” airlines dataset that contains 200M rows, and Altinity.Cloud demo server. Use demo/demo if you want to run those examples by yourself.

The dataset already exists on a demo server, so inserting Parquet into an S3 bucket can be done with a simple INSERT statement:

INSERT INTO FUNCTION s3('https://s3.us-east-1.amazonaws.com/altinity-clickhouse-data/airline/data/ontime_parquet/{_partition_id}.parquet', /* credentials */ 'Parquet')
PARTITION BY Year
SELECT * FROM ontime

The size of the files in the bucket is 4.6GB. This is the first surprise! The size of the source MergeTree table is 13.3GB, exactly 2.5 times more. Parquet applies data type specific encodings automatically to every column, and compression on top of that. The same can be done in ClickHouse as well, but it requires the user to define encodings manually for every column. The defaults are much less efficient.

If you are interested in Parquet format internals, please refer to the very informative ClickHouse article on this topic. See also an Appendix for per column statistics.

Running Queries

If you missed that, all SELECT queries can be run on the Altinity.Cloud demo server. Use demo/demo as credentials.

We will use the first 3 test queries from the benchmark. The MergeTree table is stored on an EBS disk. It would be probably interesting to compare MergeTree on S3 as well, but the goal is to compare “native” MergeTree with the “native” Parquet.

Since the data from block storage is usually cached in OS page cache, we will also add ‘SETTINGS min_bytes_to_use_direct_io=1’ to every MergeTree table query. That will make sure data is always loaded from the network: EBS for MergeTree table, and S3 for Parquet.

Before running benchmark queries, let’s test how fast ClickHouse can read the full table. It can be done with this query:

SELECT count() FROM ontime WHERE NOT ignore(*) SETTINGS min_bytes_to_use_direct_io=1

It took 42s, not bad.

In order to query Parquet files, we will use the S3 table function.

SELECT count() FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet') WHERE NOT ignore(*)

16.6s! Reading from Parquet is 2.5 times faster than the MergeTree table!!! It seems too good to be true, but results are on the screen. So let’s start testing queries.

Query 1: This is a full-scan of two columns.

SELECT avg(c1)
FROM (
    SELECT Year, Month, count(*) AS c1
    FROM ontime
    GROUP BY Year, Month)
SETTINGS min_bytes_to_use_direct_io=1

It takes just 80ms, ClickHouse is very fast.

SELECT avg(c1)
FROM (
    SELECT Year, Month, count(*) AS c1
    FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet')
    GROUP BY Year, Month)

Ouch! Our high hopes just came crashing back to Earth. Query time from Parquet takes anywhere from 800 to 1600 ms. Usually the first run is slower, and then following ones are faster, but it may also get slower later as well. Even the fastest 800ms is still 10 times slower than MergeTree, but it is still a great result given that the data is in files on S3.

Query 2: This query includes a filter on the partitioning column.

SELECT DayOfWeek, count(*) AS c
FROM ontime
WHERE Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC
SETTINGS min_bytes_to_use_direct_io=1

23ms. Outstanding!

SELECT DayOfWeek, count(*) AS c
FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet')
WHERE Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC

600ms. This is under the second range, but 25 times slower than MergeTree!

It is worth repeating that S3 performance varies on retries. The S3 variance may be up to 100%, while EBS performance is more consistent. This is probably because EBS uses a dedicated network interface, while S3 data is being transferred over the common network inside AWS us-east-1 region.

Query 3. This query includes a filter on non-partition column:

SELECT DayOfWeek, count(*) AS c
FROM ontime
WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC
SETTINGS min_bytes_to_use_direct_io=1

80ms. ClickHouse can not slow down.

SELECT DayOfWeek, count(*) AS c
FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet')
WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC

2000ms. This is where Parquet becomes significantly slower.

In general, we can see that Parquet query performance is excellent for remote data of this size. One can quickly explore a 200M rows dataset, but it is still an order of magnitude slower than working with a MergeTree table.

For usability’s sake, we can even hide the S3 function call under a view, like follows:

CREATE VIEW ontime_parquet AS
SELECT * FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet')

Now, we can run queries against the ‘ontime_parquet’ view without caring where the data actually resides.

More Tests

There are a few other experiments that we can conduct with this Parquet dataset.

Maximum S3 Throughput

We can measure maximum throughput of the S3 table function if we remove any parsing on ClickHouse side. It can be done with RawBLOB format:

SELECT count() FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet', RawBLOB) WHERE NOT ignore(*)

The result is 4.9s for 4.95GB of data, so ClickHouse can load from S3 as fast as 1GB/s, which is outstanding throughput! The server network is 12 gigabit, so it is close to network saturation. If we add ‘SETTINGS max_threads=32’ in order to increase threads to be maximum, we can eveb load it in 4 seconds. There are instances with 15, 25, 50, 75 and even 100 gigabit network interfaces that can be tested. I tried one with a 15GB network instance and could get 3.16 seconds for the query above, which is 1.5GB/s. ClickHouse team did great work boosting S3 performance!

Maximum EBS Throughput

As we showed above, scanning the MergeTree table fully takes 42s. The table size is 13.3GB, so we can calculate the average retrieval speed as 316MB/s. Unlike Parquet, we can not isolate network speed from data processing speed. The demo server has a stacked volume configuration with 3 old gp2 volumes, 250MB/s each. It gives us up to 750MB/s throughput for queries touching a lot of data, but may drop down to 250MB/s for smaller ones. In Altinity.Cloud we can easily bump the storage to 1000MB/s per volume using gp3 EBS volumes even without stacking. That lowers down the full scan to 29s, that is still almost twice as long as Parquet 16.6s.

Unfortunately, there is no easy way to measure maximum EBS throughput from ClickHouse size, but we can do the following. Remember, that we disabled OS page cache. If we enable it back we will get speed of processing MergeTree data that is in RAM:

SELECT count() FROM ontime WHERE NOT ignore(*)

Running this query for the first time will fill the page cache, and the second run gives us 6.8s. So we may conclude that loading from EBS adds 29 – 6.8 = 22.2s. That gives us an estimated throughput 13.3GB/22.2 = 600MB/s, which is lower than expected so there could be some other overhead that we missed.

S3 Virtual Columns

The S3 table function supports virtual columns, like _file and _path. What if we select those only?

SELECT DISTINCT _file, _path
FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet')
ORDER BY _file

This query runs in 2000ms. This is very slow, since ClickHouse does not have to parse Parquet files at all, but it looks like it does. Optimization is needed here.

Partition Granularity

Let’s test how ClickHouse works with Parquet files having more granular partitioning. We will partition by Year and Month this time:

INSERT INTO FUNCTION s3('https://s3.us-east-1.amazonaws.com/altinity-clickhouse-data/airline/data/ontime_parquet2/{_partition_id}.parquet', /* credentials */ 'Parquet') 
PARTITION BY toYYYYMM(FlightDate)
SELECT * FROM ontime

Now, there are 408 Parquet files compared to 35 in the previous test. Does it make any difference? Apparently, it does. Here are the query times after several retries:

Query 1: 5.5s
Query 2: 5.5s
Query 3: 10s

All queries started to be at least 5 times slower! This happens because ClickHouse has to do more calls to S3, and establishing a connection is expensive compared to file access. This is also the reason why MergeTree does not perform very well on S3 and requires a local cache for fast query time.

For comparison, if we partition MergeTable the same way, and have 408 partitions, the query results are the following:

Query 1: 142ms
Query 2: 38ms
Query 3: 130ms

It is slightly slower, but not that much. You can check it running test queries from the ‘ontime_2’ table.

Scalability

All the tests above were executed on a pretty powerful 32 vCPU m6g AWS instance. What if we run the same queries on a smaller machine instead? The performance should degrade, but how much? This is very easy to test in Altinity.Cloud since scaling the instance up or down takes just a couple of minutes.

Accordingly, on a 4 vCPU instance we get:

Query 1: 13s
Query 2: 9s
Query 3: 40s

Note, results were very inconsistent between retries, especially for the longest Query 3. On average, performance degradation was 17 times.

For comparison, here are ClickHouse MergeTree results for 4 vCPU machine. They are close to linear:

Query 1: 740ms
Query 2: 17ms – even faster than 32 vCPU!
Query 3: 500ms

So Parquet query performance degrades non-linearly and in an unpredicted way. On smaller machines queries from MergeTree still work under a second, but Parquet becomes very sluggish.

Summary of ‘ontime’ Tests

Here is a table summarizing results of different tests on 32vCPU instances. The yellow highlighting means better results.

	MergeTree on EBS	Parquet on S3
Compressed data size	13.3GB	4.6GB
Full table scan	42s (29s on a better volume)	16.6s
Full table scan w/cache	6.8s	n/a
Maximum throughput	600MB/s*	1000MB/s
Full table parsing	6.8s	11.7s
Query 1	80ms	800ms
Query 2	23ms	600ms
Query 3	80ms	2000ms
408 vs 35 partitions	x1.8 slower	x5-7 slower
Scale down 32->4 vCPUs	x5.5 slower	x17 slower

* – note, that this is what we’ve got from a single volume. Multiple volumes can give better throughput.

SSB Benchmark over Parquet Files

For the next experiment I have uploaded the Star Schema Benchmark (SSB) dataset to an S3. We use this dataset quite regularly for tests, e.g. when testing new Graviton instances on AWS recently. The dataset contains 600M rows.

The procedure to upload to the S3 bucket is similar to what we did with the ‘ontime’ table. The data has been partitioned with the same expression as the MergeTree table ‘toYYYYMM(LO_ORDERDATE)’.

INSERT INTO FUNCTION s3('https://altinity-clickhouse-data.s3.amazonaws.com/ssb/data/lineorder_wide_parquet/*.parquet', /* credentials */ 'Parquet') 
PARTITION BY toYYYYMM(LO_ORDERDATE)
SELECT * FROM lineorder_wide

The load resulted in 80 Parquet files. You can check the table as follows:

SELECT count() FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/ssb/data/lineorder_wide_parquet/*.parquet', 'Parquet')

The size of the files in the bucket is 59.8GB. That is 15% smaller compared to the source MergeTee table, which is 68.6GB.

Once the data has been uploaded, I’ve run test queries using the benchmark. At first, I discovered a bug: LO_ORDERDATE column of Date type has been incorrectly stored as UInt16 in Parquet during INSERT, so queries could not run. An explicit type conversion is needed. For example, here is the Q1.1 query from the benchmark. Note the toDate conversion:

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/ssb/data/lineorder_wide_parquet/*.parquet', 'Parquet')
WHERE toYear(toDate(LO_ORDERDATE)) = 1993 
  AND LO_DISCOUNT >= 1 AND LO_DISCOUNT <= 3 AND LO_QUANTITY < 25

The query runs in 4 seconds on a demo server.

In order to make it easy to execute the benchmark script without re-writing the queries, I’ve created a view with the correct type mapping.

CREATE VIEW lineorder_wide_parquet AS SELECT * FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/ssb/data/lineorder_wide_parquet/*.parquet', 'Parquet', 'LO_ORDERKEY UInt32, LO_LINENUMBER UInt8, LO_CUSTKEY UInt32, LO_PARTKEY UInt32, LO_SUPPKEY UInt32, LO_ORDERDATE Date, LO_ORDERPRIORITY LowCardinality(String), LO_SHIPPRIORITY UInt8, LO_QUANTITY UInt8, LO_EXTENDEDPRICE UInt32, LO_ORDTOTALPRICE UInt32, LO_DISCOUNT UInt8, LO_REVENUE UInt32, LO_SUPPLYCOST UInt32, LO_TAX UInt8, LO_COMMITDATE Date, LO_SHIPMODE LowCardinality(String), C_CUSTKEY UInt32, C_NAME String, C_ADDRESS String, C_CITY LowCardinality(String), C_NATION LowCardinality(String), C_REGION Enum8(\'ASIA\' = 0, \'AMERICA\' = 1, \'AFRICA\' = 2, \'EUROPE\' = 3, \'MIDDLE EAST\' = 4), C_PHONE String, C_MKTSEGMENT LowCardinality(String), S_SUPPKEY UInt32, S_NAME LowCardinality(String), S_ADDRESS LowCardinality(String), S_CITY LowCardinality(String), S_NATION LowCardinality(String), S_REGION Enum8(\'ASIA\' = 0, \'AMERICA\' = 1, \'AFRICA\' = 2, \'EUROPE\' = 3, \'MIDDLE EAST\' = 4), S_PHONE LowCardinality(String), P_PARTKEY UInt32, P_NAME LowCardinality(String), P_MFGR Enum8(\'MFGR#2\' = 0, \'MFGR#4\' = 1, \'MFGR#5\' = 2, \'MFGR#3\' = 3, \'MFGR#1\' = 4), P_CATEGORY String, P_BRAND LowCardinality(String), P_COLOR LowCardinality(String), P_TYPE LowCardinality(String), P_SIZE UInt8, P_CONTAINER LowCardinality(String)')

After that, I ran the benchmark script using both the original MergeTree table, and a view on Parquet files in S3.

To make it short, here are the total query times using 32vCPU m7g.8xlarge instance:

Total for all queries using EBS storage: 0.824s
Total for all queries using Parquet at S3: 117s

So with more complex queries including multiple columns and various filter conditions, the performance of Parquet on S3 is more than 100 times slower compared to a MergeTree table on network block storage.

Conclusion

ClickHouse 23.4 can read Parquet files very quickly. It can also run SQL queries directly on Parquet data in S3. This can be already used for data exploration purposes, when waiting several seconds for query results is acceptable.

However, running queries on Parquet data is still 10 to 100 times slower compared to using the MergeTree table. The performance is sometimes unpredictable. For fast analytics data needs to live natively in ClickHouse.

The ClickHouse team is focused on Parquet performance, and we can expect more news on that in the upcoming releases. For example, the recently added ParquetMetadata is a significant step to understanding Parquet internals. In the future, ClickHouse may start using Parquet statistics and indices, which should dramatically improve query performance. We are watching the progress here closely, and maybe our dreams will come true. Stay tuned!

Appendix: ClickHouse vs Parquet Compression

ParquetMetadata allows us to compare compression between ClickHouse and Parquet. Feel free to run the scary query below and compare the compressed size of every column between ClickHouse MergeTree and Parquet.

SELECT column, type, parquet_compressed, ch_compressed, round(ch_compressed / parquet_compressed, 2) as delta_pct
FROM
(
SELECT column, sum(num_rows), sum(column_uncompressed_size) parquet_uncompressed, sum(column_compressed_size) as parquet_compressed FROM
(WITH arrayJoin(columns) as c
SELECT _file, num_rows, metadata_size, total_uncompressed_size, total_compressed_size, tupleElement(c, 'name') as column, tupleElement(c, 'total_uncompressed_size') as column_uncompressed_size, tupleElement(c, 'total_compressed_size') as column_compressed_size
FROM s3('https://altinity-clickhouse-data.s3.amazonaws.com/airline/data/ontime_parquet/*.parquet', 'ParquetMetadata')) t
GROUP BY column
) a
LEFT JOIN (SELECT name as column, type, data_compressed_bytes ch_compressed, data_uncompressed_bytes ch_uncompressed FROM system.columns WHERE table='ontime') b
USING (column)
ORDER BY delta_pct
FORMAT PrettyCompactNoEscapesMonoBlock

Results are in the table below. It is evident that ClickHouse has something to learn from Parquet.

column	type	parquet_compressed	ch_compressed	delta_pct
FirstDepTime	String	4659136	8962255	1.92
Div1WheelsOn	String	2347236	5019645	2.14
Div1WheelsOff	String	1950235	4421328	2.27
DivActualElapsedTime	String	1644314	4108295	2.5
DivArrDelay	String	1586525	4052818	2.55
Div1TailNum	String	1793081	4581368	2.56
TotalAddGTime	String	2711426	7047210	2.6
LongestAddGTime	String	2701512	7041212	2.61
FlightNum	String	128726188	353694196	2.75
CarrierDelay	Int32	38827764	111261897	2.87
LateAircraftDelay	Int32	38227037	110532507	2.89
NASDelay	Int32	40200903	125924889	3.13
WeatherDelay	Int32	7071239	22318526	3.16
CRSArrTime	Int32	144559830	458671729	3.17
Div1TotalGTime	String	1291019	4113600	3.19
Div1Airport	String	1334579	4279867	3.21
DistanceGroup	UInt8	32295467	104475032	3.23
Div1LongestGTime	String	1247101	4072141	3.27
CancellationCode	FixedString(1)	4249913	15041288	3.54
DepDelayMinutes	Int32	166242215	615813324	3.7
ArrDelayMinutes	Int32	178001377	670450130	3.77
Diverted	UInt8	1634810	6371679	3.9
FlightDate	Date	1310723	5151038	3.93
DepTime	Int32	283131081	1135190651	4.01
WheelsOn	Int32	231922061	933103787	4.02
ArrTime	Int32	284400437	1145333849	4.03
WheelsOff	Int32	231619648	933539410	4.03
DivDistance	String	932555	3792120	4.07
DestState	FixedString(2)	49564667	203938854	4.11
OriginState	FixedString(2)	48201906	197964351	4.11
Cancelled	UInt8	6082800	25631903	4.21
DepartureDelayGroups	String	82068355	369514506	4.5
DepDelay	Int32	226562925	1027815573	4.54
CRSDepTime	Int32	112391395	515141687	4.58
OriginStateFips	String	48305067	224607413	4.65
DestStateFips	String	49668165	232237950	4.68
ArrDelay	Int32	238869892	1135345478	4.75
ActualElapsedTime	Int32	238738200	1149249078	4.81
TailNum	String	199636109	976886384	4.89
DayofMonth	UInt8	723696	3541499	4.89
DivReachedDest	String	752048	3679074	4.89
AirTime	Int32	189460922	926363463	4.89
TaxiOut	Int32	162625228	886428450	5.45
TaxiIn	Int32	142997702	821869047	5.75
DivAirportLandings	String	786489	4529149	5.76
CRSElapsedTime	Int32	109125905	652783518	5.98
Distance	Int32	83948223	543776769	6.48
DayOfWeek	UInt8	516282	3634654	7.04
Div1AirportID	Int32	1312761	9567701	7.29
Div1AirportSeqID	Int32	1322341	9727748	7.36
Origin	FixedString(5)	59033966	453990038	7.69
Dest	FixedString(5)	61041717	469878803	7.7
ArrivalDelayGroups	Int32	96330478	787660135	8.18
OriginCityMarketID	Int32	61025579	544446661	8.92
DestCityMarketID	Int32	62782337	561425242	8.94
OriginAirportSeqID	Int32	58942261	529929952	8.99
DestAirportSeqID	Int32	60901555	549309770	9.02
OriginAirportID	Int32	58949313	563946879	9.57
DestAirportID	Int32	60955994	586675839	9.62
DepDel15	Int32	25491464	273747032	10.74
DestCityName	String	62431246	709430841	11.36
OriginCityName	String	60464429	688214345	11.38
ArrDel15	Int32	26443396	304918200	11.53
SecurityDelay	Int32	720296	8317100	11.55
OriginWac	Int32	48320067	629760834	13.03
DestWac	Int32	49683165	647759966	13.04
DestStateName	String	50004856	663126514	13.26
OriginStateName	String	48641205	645929000	13.28
DepTimeBlk	String	73005404	1089363559	14.92
ArrTimeBlk	String	74077944	1120128454	15.12
Month	UInt8	115684	1851583	16.01
Div2WheelsOn	String	90690	1836815	20.25
Quarter	UInt8	85466	1816679	21.26
Div2Airport	String	85696	1831433	21.37
Div2TotalGTime	String	84321	1827171	21.67
Div2LongestGTime	String	84055	1827091	21.74
Div2TailNum	String	72555	1812944	24.99
Div2WheelsOff	String	71051	1810818	25.49
Div3WheelsOn	String	61113	1798422	29.43
Div3Airport	String	61081	1798388	29.44
Div3TotalGTime	String	61023	1798328	29.47
Div3LongestGTime	String	61023	1798328	29.47
Div3WheelsOff	String	60828	1798090	29.56
Div3TailNum	String	60832	1798094	29.56
Div4TailNum	String	60810	1798072	29.57
Div5TailNum	String	60810	1798072	29.57
Div5LongestGTime	String	60810	1798072	29.57
Div4WheelsOff	String	60810	1798072	29.57
Div5Airport	String	60810	1798072	29.57
Div5WheelsOff	String	60810	1798072	29.57
Div5TotalGTime	String	60810	1798072	29.57
Div4TotalGTime	String	60810	1798072	29.57
Div4Airport	String	60810	1798072	29.57
Div4WheelsOn	String	60810	1798072	29.57
Div5WheelsOn	String	60810	1798072	29.57
Div4LongestGTime	String	60810	1798072	29.57
Year	UInt16	70202	3603599	51.33
Carrier	FixedString(2)	66818	3607460	53.99
Div2AirportID	Int32	104658	7204623	68.84
Div2AirportSeqID	Int32	104709	7207940	68.84
AirlineID	Int32	89216	7255957	81.33
Div3AirportID	Int32	84463	7177138	84.97
Div3AirportSeqID	Int32	84463	7177170	84.97
Div4AirportSeqID	Int32	84290	7176876	85.15
Div4AirportID	Int32	84290	7176876	85.15
Div5AirportSeqID	Int32	84290	7176876	85.15
Div5AirportID	Int32	84290	7176876	85.15
Flights	Int32	84290	7249951	86.01
UniqueCarrier	FixedString(7)	87150	12099204	138.83

Get in touch with ClickHouse experts.

What’s Up with Parquet Performance in ClickHouse®?

Why Parquet?

Preparing Parquet Files

Running Queries

More Tests

Maximum S3 Throughput

Maximum EBS Throughput

S3 Virtual Columns

Partition Granularity

Scalability

Summary of ‘ontime’ Tests

SSB Benchmark over Parquet Files

Conclusion

Appendix: ClickHouse vs Parquet Compression

Related:

One Comment

Get in touch with ClickHouse experts.

Why Parquet?

Preparing Parquet Files

Running Queries

More Tests

Maximum S3 Throughput

Maximum EBS Throughput

S3 Virtual Columns

Partition Granularity

Scalability

Summary of ‘ontime’ Tests

SSB Benchmark over Parquet Files

Conclusion

Appendix: ClickHouse vs Parquet Compression

Related:

Migrating Data from Snowflake to ClickHouse® using S3 and Parquet

Snowflake, BigQuery, or ClickHouse®? Pro Tricks to Build Cost-Efficient Analytics for Any Business

Moving Big Data from Snowflake to ClickHouse® for Fun and Profit

One Comment