Blog

Do data types still matter? How column based compression can make data types irrelevant

Working with databases for over twenty years typically generates a few habits you hardly get rid of. One of the first things I always do when optimizing performance is to take a look at the used data types.

It may appear like a big difference if a single number takes up 1,2,4 or 8 bytes of storage, but if you want to store billions of numbers, a the factor between one byte and 8 bytes will have a big impact on the final space you need… and therefore have a direct impact on your database cost and also speed of queries. But what about if the data is compressed before it’s stored on disk?

Lately, I had a short discussion with a potential customer.
They were asking “if choosing the smallest data type is still important when working with ClickHouse®?”

Their reasoning behind this was that ClickHouse, as well as other column based DBMS, compress not rows, but single columns, and therefore the compression algorithm should be able to compress the numbers, no matter if we were choosing the smallest possible type, or just use 64-Bit integer all the time.

Without a second thought I instantly said yes, as even in column based DBMS data is not compressed all the time.
But still… it stayed in my mind to evaluate in more detail, if data types still matter in a column based context.

To give away the TL;DR: Yes, data types still matter, but not if you’re only focusing on disk usage.

What are data types?

Before diving into the details and benchmarking results, let’s start with a brief recap of what data types are.

Basically data types describe the way to represent data within a DBMS (in our case).

If you want to store for example, the number 57, you have multiple possibilities to store this number in ClickHouse.

You can use some unsigned integer type like UInt8 or UInt64, other possibilities are signed integers like Int16 or Int32.

Maybe not instantly in your focus but also possible would be the String type or to make use of the decimal type.

The primary difference between all these types is, how big the numbers can grow and how much space a single number takes up.

UInt8 for example means you can store unsigned integers with a maximum of 2⁸ (256). Each number takes up 8 bits (hence the name UInt8) or 1 byte of space.

Int32 on the other hand allows you to store signed numbers between -2³¹ and 2³¹-1, but each number you store takes up 32 bits or 4 bytes of space.

So if you want to store our example number 57, when storing that number into a Int32 column you are actually storing the 57 in one byte and a lot of additional zeros to fill up the other 3 bytes.

If you want to store the number 57 in a string column, then the single digits would be converted into their ASCII representation and the total length of the string would be added to the storage, hence we would store the information string of length 2 and two bytes representing the digit 5 and the digit 7.

This would make up a total of at least 3 bytes (assuming the length is stored in one byte) used for storing the number 57 in a String field.

Regardless of whether we add extra zeros or account for the number’s length, using larger data types than necessary results in redundant data. This is where compression comes into play.

How does compression help with storing data?

As we’ve seen above, selecting the right data type can have a big impact on the amount of data we need to store on disk, even for a single number.

But what if we don’t want to only store single numbers, but thousands, millions or even trillions of numbers?

If we want to store them directly, the impact of the data type is exactly the same as if storing a single number.

If one number takes up 3 bytes then one million numbers take up 3 million bytes, and if a single number takes up only 1 byte then a million number takes up only 1 million bytes.

That’s where compression kicks in.

I won’t go into details about how compression works, as this could fill several books (which are already written as well for some parts), but I just want to explain some generics.

Compression tries to remove the amount of redundancy of data.

For example if you have a list of numbers: 5,5,5,5,5,5,5,5,8,8,8,9,1 you could write it in a compressed way: 8x5,3x8,9,1

As we can see in that simple example, the big amount of fives could be compressed quite well. The original list took 13 characters (without commas), the new one only 8.

But what about different data types in this regard?

When choosing a non optimal data type, you’re basically adding additional space that a single value would take up.

But as seen before, this space is mostly filled with zeros, so if you have millions of values, you’re actually adding multiple millions of zeros to your data, which means you are adding a lot of redundant data, and compression would remove redundancy.

To go back to our example, let’s say in our list of numbers each number takes up 3 bytes, then the list would look as follows:

005,005,005,005,005,005,005,005,008,008,008,009,001 which would result in 39 characters to store.

But the compressed list would look like this: 8x005,3x008,009,001 or 16 characters.

So even as the raw data takes up 3 times more space as we are choosing a different data type, the compressed data only takes up 2 times more.

Don’t explain basics to me I want to see real data

As described above, choosing bigger data types than needed mainly adds redundant data to the uncompressed blocks, but when writing data to disk, the redundancy should be eliminated again thanks to compression.

To test this, we use a simple test setup.

We’ll store 100 million rows, containing random values between 0 and 100 in different data types.

Then we’ll look at the amount of space these take up on disk, and later on we’ll do some simple queries to see if different data types have an impact on performance.

Test setup

First we define our test table choosing ClickHouse default compression and a lot of different data types to store the numbers 0 — 100 in it:

CREATE TABLE dtypes.data_default (
  id UInt32,
  u8 UInt8,
  u16 UInt16,
  u32 UInt32,
  u64 UInt64,
  u256 UInt256,
  udec3 Decimal(3,0),
  udec30 Decimal(30,0),
  ulstr LowCardinality(String),
  ustr String,

  i8 Int8,
  i16 Int16,
  i32 Int32,
  i64 Int64,
  i256 Int256,

  idec3 Decimal(3,0),
  idec30 Decimal(30,0),
  ilstr LowCardinality(String),
  istr String
) engine=MergeTree
order by id;

To check if there is a difference when storing signed values, I also added the same columns again but prefixed with an i.

To get really signed values but keep the same data distribution, we’ll subtract the value 50 from the unsigned value, resulting in values from -50 to 50.

Now we’ll fill the table with 100 million rows, generating random numbers between 0 and 100. We explicitly want to store random values, to make sure our dataset itself compresses quite badly, as we only want to see if the redundancy of the data type will be compressed.

INSERT INTO dtypes.data_default
SELECT number, rand() % 100 AS u, u, u, u, u, u, u, 
  toString(u), toString(u), 
  rand() % 100 - 50 AS i, i, i, i, i, 
  i, i, toString(i), toString(i)
FROM numbers(100000000);

After insertion I manually ran an OPTIMIZE TABLE data_default FINAL to make sure the data is in as less parts possible, to see the strongest effect of the compression

First result

So how much space does each data type now take up on disk?

Fortunately, ClickHouse collects a lot of statistics within its system tables, so we can easily get the wanted data by running a simple query:

SELECT 
    name, substring(type, 1, 50) as type, 
    formatReadableSize(data_compressed_bytes) AS comp,
    formatReadableSize(data_uncompressed_bytes) AS uncomp,
    round(data_uncompressed_bytes / data_compressed_bytes, 2) AS 
      ratio
FROM system.columns
WHERE database ='dtypes' AND table = 'data_default'

And this is the result:

nametypecompuncompratio
idUInt32383.12 MiB381.47 MiB1
u8UInt895.78 MiB95.37 MiB1
u16UInt16146.84 MiB190.73 MiB1.3
u32UInt32279.67 MiB381.47 MiB1.36
u64UInt64285.15 MiB762.94 MiB2.68
u256UInt256380.01 MiB2.98 GiB8.03
udec3Decimal (3, 0)279.67 MiB381.47 MiB1.36
udec30Decimal (30, 0)286.53 MiB1.49 GiB5.33
ulstrLowCardinality (String)95.89 MiB95.56 MiB1
ustrString182.86 MiB276.57 MiB1.51
i8Int895.78 MiB95.37 MiB1
i16Int16155.06 MiB190.73 MiB1.23
i32Int32281.05 MiB381.47 MiB1.36
i64Int64284.83 MiB762.94 MiB2.68
i256Int256379.24 MiB2.98 GiB8.05
idec3Decimal (3, 0)281.05 MiB381.47 MiB1.36
idec30Decimal (30, 0)285.77 MiB1.49 GiB5.34
ilstrLowCardinality (String)95.89 MiB95.56 MiB1
istrString189.39 MiB315.67 MiB1.67

As you can see, the results resemble the basic assumption we had, that the bigger the data type is, the better the compression ratio becomes, or simply speaking, if a bigger data type in theory takes up double the space than a smaller data type, it doesn’t mean that after compression the data still takes up double the space on disk.

But we also see that choosing the smallest data type possible (UInt8) takes up 95MiB while the biggest one (UInt256) takes 380MiB, so still a factor of 4 for disk usage, which does still matter a lot.

Compression algorithms and codecs

As disk I/O is the most important factor for speed in ClickHouse and disk space being an important factor for cost as well, we could already finish here and conclude that data types still matter and all would be fine.

But ClickHouse has such a wide variety of possibilities for defining columns, using special compression algorithms and codecs for optimizing the compression ratio.

The default algorithm used is LZ4, which is fast and has a good compression ratio overall. But using ZSTD can lead to a better compression with only a little more impact on the CPU.

There are also some special codecs which can be applied to make data more compressible like T64.

To change a codec in ClickHouse you just need to change the definition of the column you want to apply a different codec on:

CREATE TABLE dtypes.data_t64_zstd (
    id UInt32,
    u8 UInt8  CODEC(T64, ZSTD),
    u16 UInt16  CODEC(T64, ZSTD),
…

The T64 codec

I want to give a short description of the T64 codec, as it can have a strong impact on the compression ratio in the case of using bigger data types than needed.

The way it works is that it uses 64 values and writes them in a matrix, then it transposes this matrix and the compression algorithm gets the transposed matrix to compress it.

For better understanding i will demonstrate the logic on 5 values between 0 and 100 and a data type length of 5:

00015
00004
00091
00057
00081

Looking at the number it is hard to compress them, but when applying a T64 transformation before, it gets a lot easier to compress:

00000
00000
00000
10958
54171

So leading zeros in value representation will become zero values, ergo resulting in better compression ratios. Unfortunately T64 cannot be applied to all data types.

Data types and disk space consumption

So finally, let’s see how much really gets stored on the disk, depending on different data types, codes and compression algorithms:

We can see quite the interesting results.

When using the default compression algorithm, the amount of disk space (and therefore cost and I/O performance) varies quite a lot, going from 96MiB for UInt8 up to 380MiB for UInt256.

Using zstd on the other hand leads to a lot smaller variance in sizes, minimum still being around 80MiB for UInt8 but the maximum is only 118MiB for UInt32 (or Decimal3), and even storing the values as string only results in 111MiB for unsigned or 119MiB for the signed ones.

When looking at the results from the T64 transformation, the effect is exactly as wished and the variance there goes from a minimum of 83.8MiB to a maximum of 85.5MiB with nearly no difference between lz4 and zstd.

No disk consumption difference… That means data types really don’t matter anymore?

Whoah! Not so fast with your conclusions there!

Even with the results we saw regarding disk space usage, we’re not at the end yet.

Of course I/O performance is the main bottleneck most of the time, when speaking about general performance of ClickHouse (or other DBMS), but there are more things to consider:

Speed of compression algorithm

If all the algorithms were equally fast, there would only be one algorithm to use, right?

In real-world scenarios, the compression algorithm can make a big impact on speed – particularly when inserting new data. This could have a massive impact on how much inserts your application can handle with a given infrastructure.

Memory usage

Even if different data types take up an equal amount of space on your disks because of compression, when using the data within queries you’re not working with compressed data anymore, but ClickHouse has decompressed the data to throw it into the actual calculation logic.

We can already see this in a very simple example.

Let’s say we need to group by our values within different data types, and want to get the average value of the id column:

SELECT u8 AS grp, avg(id) 
FROM dtypes.data_default 
GROUP BY grp 
ORDER BY grp

Grouping as well as sorting has to be done on uncompressed data of course, that means the memory usage of this simple query differs quite a lot, depending on the data type.

(For simplicity, I’ll omit the signed results from now on, as the results are quite redundant to the unsigned).

So the data type has a direct impact on the memory usage of queries, which can make a big difference on how many queries you can run in parallel or even if you can get a query to finish in the first place.

Query speed

When dealing with large datasets, query speed is mostly limited on I/O performance. But when data is already in memory such as through caching for example, the I/O impact isn’t as critical anymore.

Instead other things can take over.

It could make a difference how fast you can copy data within your memory itself, or it could make a difference if some hardware acceleration can be used.

Most cores have no problem handling 32 bit integer values, but handling 256 bit integers might need some special logic within your code. Some very fast sorting algorithms might just work for small data types while others work best only for integer types etc.

And of course, the decompression and transformation of the data also takes some time.

I ran the query from the last section via clickhouse-benchmark to measure the speed regarding different types and algorithms:

The absolute numbers are not that relevant here, only the relation between the types is interesting. As we can see, regarding query speed, data types can have a big impact on query performance.

Conclusion

So in the end, the tests and benchmarks confirmed my gut feeling directly.

Even with such extreme compression as ClickHouse can provide, data types do still matter quite a lot.

So it’s still a valuable invest of your time to check your data and choose the smallest data type needed, especially if you want to go big.

Of course, if disk space is the only thing that matters to you the case might be different. Especially if you want to store long time archive or a lot of uncleaned or raw data, you might simply just use Strings for that and use ZSTD as compression.

That way, you have the possibility to see when someone accidentally sent some incorrect values as well (like user input instead of numbers).

But once you really want to work with your data in the fastest possible way, you should always invest the time to choose the right types.

Your future self will thank you 🙂.

Originally published by Stefan while he was at Double.Cloud.

Share

ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.

Table of Contents:

Leave a Reply

Your email address will not be published. Required fields are marked *