On 9 November 2022 Altinity reported on a potentially severe data corruption bug related to aggregate function data that affected multiple versions of ClickHouse. The bug is now fixed, thanks to a pull request from Raúl Marín of Tinybird with reviews from many ClickHouse developers. The Altinity QA team has also verified fixes in affected LTS versions. 

Read on to find out more and ensure you are no longer affected by the bug. 

What Happened?

As we reported earlier, there was a bug in data serialization and deserialization functions for common aggregate functions for String data. 

Aggregates work by storing intermediate results (aka “state”) in an aggregate function value that is then merged to retrieve the final result. Affected versions changed the aggegate functions so that they remove a null at the end of the string when generating intermediate results, then add it again when merging. However, data written by older versions of ClickHouse has the extra null already. When such older data are merged, affected ClickHouse servers add a superfluous null to string values.

The extra null can cause various problems. Most importantly it means that the same string values from old and new ClickHouse versions no longer compare correctly, which can break applications.

Affected functions include but are not limited to the following function types, among others. They are used when defining columns that contain aggregate intermediate results. 

  • AggregateFunction(argMax, String)
  • AggregateFunction(argMin, String)
  • AggregateFunction(max, String)
  • AggregateFunction(min, String)
  • AggregateFunction(any, String)

If you have written such values to AggregatingMergeTree or its variants using older ClickHouse versions, your data may be affected by the bug. Using materialized views to write data into AggregatingMergeTree triggers the problem as do INSERT…SEleCT commands. If you use either of these you definitely want to upgrade. 

For more details on the bug please see the Appendix.

Which ClickHouse Versions Are Affected and Which Are Fixed

The PR has been applied to recent ClickHouse releases starting with 22.3. The following table shows affected releases and the first version that has the patch. 

VersionAffected buildsFirst fixed buildRecommended Action
<= 22.2NONE
22.3 LTS22.3.13 – 22.3.1422.3.15Upgrade to fixed build
22.7>= 22.7.6Will not be fixedUpgrade to 22.8.10 or later
22.8 LTS22.8.6 – 22.8.9 22.8.10Upgrade to fixed build
22.9<= 22.9.522.9.6Upgrade to fixed build
22.10<= 22.10.322.10.4Upgrade to fixed build
22.1122.11.122.11.2Upgrade to fixed build
22.12NONE

In general versions of ClickHouse built and released after November 24, 2022 have the fix. 

Please consult the table in the previous section to see if you are running a ClickHouse community build that has the bug. Use the sample query provided in the Appendix if you are still not sure. 

  1. If the build you are running is affected, follow the “Recommended Action” to upgrade. 
  2. If the build you are running is prior to the first affected version, ensure that your next upgrade jumps to the first fixed version. 

The fix restores the older serialization format and will correctly retrieve states written by both older versions, and versions affected by this bug.

There is a corner case the fix does not handle. If the string value actually had a trailing null as part of the data and was written by a version with the bug, that null will be removed when merging. This should be very rare. If you have Altinity support and suspect you have run into this case, please contact us for help. 

The ClickHouse codebase now includes tests for this bug. We do not expect to see it in future versions. 

Altinity Stable Builds are not affected by the bug. You may safely use and upgrade between any Altinity Stable Version. 

We plan to post our first Altinity Stable Build version 22.8 in January 2023. It will likewise be based on code containing the fix. 

Altinity CI/CD has additional tests for the presence of the trailing null problem across a wide range of aggregate functions, so we likewise do not expect regressions in the future.  

Final Notes

Altinity would like to thank Raúl Marín of Tinybird as well as ClickHouse committers for their hard work to fix this bug and ensure our description of the remedy is precise. It was a community effort. Our appreciation to everyone involved.

Appendix – Trailing Null Bug Details

Read this if you want to understand the exact problem that occurred. It includes a query that shows how ClickHouse servers store aggregate data. It also has another query that demonstrates the differences between reading old and new data, showing how they return different results. You can use the latter query to see if your server is affected by the bug.

The first example query shows the result of running argMaxState on the string ‘Hi!’ and storing the intermediate representation in hex characters on a ClickHouse server that does not have the bug. (Version 22.10.4.23.) This simulates writing the result of ArgMaxState on a String value and turning it to hex. It returns 34 hex characters as we see below. (Affected servers return 34 – 2 = 32 hex characters because the null is removed.)

SELECT
    'Hi!' AS val,
    hex(argMaxState('Hi!', number)) AS hexMaxState,
    length(hexMaxState) AS lengthHexMaxState
FROM numbers(1) AS number

┌─val─┬─hexMaxState────────────────────────┬─lengthHexMaxState─┐
│ Hi! │ 0400000048692100010000000000000000 │                34 │
└─────┴────────────────────────────────────┴───────────────────┘

Let’s now take that good argMaxState value and use it to simulate merging the aggregate on the same server that does not have the bug. We’ll call the specialized ClickHouse finalizeAggregation function to complete the merge from the hex data. Observe that we get ‘Hi!’ back as the answer and the character length is 3. 

SELECT
    '0400000048692100010000000000000000' AS hexMaxStateGood,
    finalizeAggregation(CAST(unhex(hexMaxStateGood), 'AggregateFunction(argMax, String, UInt64)')) AS out,
    hex(out) AS hexOut,
    length(out) AS lenOut

┌─hexMaxStateGood────────────────────┬─out─┬─hexOut─┬─lenOut─┐
│ 0400000048692100010000000000000000 │ Hi! │ 486921 │      3 │
└────────────────────────────────────┴─────┴────────┴────────┘

Now let’s run the same simulation on a server that does have the bug. (Version  22.10.1.1877.) The ‘Hi!’ value comes back with an extra null and the character length is 4.  Check the hex output and length to see the problem. This value is not equal to the previous representation of ‘Hi!’, which can cause applications to fail.

SELECT
    '0400000048692100010000000000000000' AS hexMaxStateGood,
    finalizeAggregation(CAST(unhex(hexMaxStateGood), 'AggregateFunction(argMax, String, UInt64)')) AS out,
    hex(out) AS hexOut,
    length(out) AS lenOut

┌─hexMaxStateGood────────────────────┬─out─┬─hexOut───┬─lenOut─┐
│ 0400000048692100010000000000000000 │ Hi! │ 48692100 │      4 │
└────────────────────────────────────┴─────┴──────────┴────────┘

In summary, if you see the above result, your server is affected. Note that any data this server writes will be OK, since it uses the newer storage format. However, if you wrote String data with argMaxState or similar functions on an older ClickHouse version and then upgraded, it is likely that you’ll see the problem with trailing nulls when you select data back using the corresponding merge function. 

To reiterate, the problem arises when an affected server reads back (i.e., merges) aggregate data written by earlier ClickHouse versions that used the original aggregate state format. The data that return are no longer properly comparable with newer string values.

Share

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.