Boosting Performance and Flexibility of ClickHouse Key-Value Pair Extraction

Key-value pairs are widely used to organize data, providing a versatile approach to associating values with unique keys or identifiers. Databases, configuration files, communication protocols, and application logs are a few of the practical usages of key-value pairs.

Take, for instance, the spring boot properties file. It can be organized in a fairly simple key-value pair format.

url=localhost:5432
username=foo
password=bar

Another example is logfmt. It is a logging format convention that organizes log entries as key-value pairs. Below you can find a log sample generated by an HTTP application that uses logfmt.

baseUrl="/" hostname=localhost protocol=http operation=get

Last but not least, the FIX format is a great example of a practical application of key-value pairs. It is a standardized messaging protocol for streamlined communication in the financial industry. 

8=FIX.4.49=14835=D34=108049=TESTBUY152=20180920-18:14:19.50856=TESTSELL111=63673064027889863415=USD21=238=700040=154=155=MSFT60=20180920-18:14:19.49210=092

Parsing key-value pairs format is a crucial task for many applications, whether for generating metrics & insights, processing protocol-specific messages, or converting to different data formats. Only three key-value pair formats have been introduced above, and one can already find substantial differences in how each represents what are a key, a value, and a delimiter between pairs.

There are many standard formats like JSON, CSV, and others that can be easily parsed, but there are also many situations where the data is not in a structured format or does not conform to a standard schema. Not only that, even the idea of having specific functions for each and every format is a challenge.

While there may be software libraries available that can extract key-value pairs from these formats, using them can introduce downsides like increased latency, compatibility issues, and the need to transfer data to different applications, which can add complexity and potentially impact performance. This can complicate the extraction process and increase the risk of errors or data loss.

Obviously, we need a different solution. Since this is ClickHouse, we want it to be very fast, too. The remainder of the article discusses a new ClickHouse feature to address this need.

extractKeyValuePairs Function

The brand new extractKeyValuePairs and extractKeyValuePairsWithEscaping functions implemented by Altinity in #43606 provide a simple and efficient way to extract key-value pairs from structured and unstructured data. They can be used in a variety of scenarios to simplify data analysis and improve productivity. Here is a quick example.

SELECT extractKeyValuePairs('url=localhost:5432\n          
username=foo\n          
password=bar', '=', '\n')

┌─extractKeyValuePairs('url=localhost:5432\n          username=foo\n          password=bar', '=', '\n')─┐
│ {'url':'localhost:5432','          username':'foo','          password':'bar'}                        │
└───────────────────────────────────────────────────────────────────────────────────────────────────────┘

Note: Both extractKeyValuePairs and extractKeyValuePairsWithEscaping provide very similar functionality, so for the sake of simplicity, the extractKeyValuePairs function will be used to address both where applicable.

This function implements a couple of best-effort parsing strategies and relies on delimiters. For that reason, it is able to gracefully ignore most types of noise and does not abort extraction because of invalid formatting.

The extractKeyValuePairs API offered by ClickHouse allows the user to specify the delimiter characters that compose a key-value pair.

  • data – String to extract key-value pairs from. String or FixedString.
  • key_value_delimiter – Character to be used as delimiter between the key and the value. Defaults to “:”. String or FixedString.
  • pair_delimiters – Set of characters to be used as delimiters between pairs. Defaults to “ ”, “,” and “;”. String or FixedString.
  • quoting_character – Character to be used as the quoting character. Defaults to “\””. String or FixedString.

Usage

The function has default values, so the simplest example does not require any additional arguments aside from the data itself.

SELECT extractKeyValuePairs('name:neymar, age:31 team:psg,nationality:brazil') AS kv

┌─kv───────────────────────────────────────────────────────────────┐
│ {'name':'neymar','age':'31','team':'psg','nationality':'brazil'} │
└──────────────────────────────────────────────────────────────────┘

In case the key-value delimiter is not a colon or there are multiple characters that delimit the pairs, those can be specified via arguments.

SELECT extractKeyValuePairs('name=neymar age=31 | nationality=brazil favorite_movie=\'batman\'', '=', '| ', '\'') AS kv

┌─kv────────────────────────────────────────────────────────────────────────────┐
│ {'name':'neymar','age':'31','nationality':'brazil','favorite_movie':'batman'} │
└───────────────────────────────────────────────────────────────────────────────┘

As mentioned previously, a very famous & common logging format is logfmt. The below example illustrates the extraction of pairs in a simple logfmt line.

SELECT extractKeyValuePairs('baseUrl="/" hostname=localhost protocol=http operation=get', '=')

┌─extractKeyValuePairs('baseUrl="/" hostname=localhost protocol=http operation=get', '=')─┐
│ {'baseUrl':'/','hostname':'localhost','protocol':'http','operation':'get'}              │
└─────────────────────────────────────────────────────────────────────────────────────────┘

A more complex case is parsing FIX format messages. FIX format uses the equal sign as a key-value delimiter and 0x01 as well as | as pair delimiters. For more information about FIX, check out the FIX Wikipedia page: https://en.wikipedia.org/wiki/Financial_Information_eXchange.

The following FIX message has been extracted from https://www.fixsim.com/sample-fix-messages.

8=FIX.4.4^9=148^35=D^34=1080^49=TESTBUY1^52=20180920-18:14:19.508^56=TESTSELL1^11=636730640278898634^15=USD^21=2^38=7000^40=1^54=1^55=MSFT^60=20180920-18:14:19.492^10=092^

Note the pair delimiter 0x01 is a non-printable character and has been replaced with "^" to better illustrate it.

The below example demonstrates how the extractKeyValuePairs function can be used to parse FIX messages. The message has been encoded in hexadecimal to support the 0x01 pair delimiter.

WITH
    '383d4649582e342e3401393d3134380133353d440133343d313038300134393d54455354425559310135323d32303138303932302d31383a31343a31392e3530380135363d5445535453454c4c310131313d3633363733303634303237383839383633340131353d5553440132313d320133383d373030300134303d310135343d310135353d4d5346540136303d32303138303932302d31383a31343a31392e3439320131303d30393201' AS hex,
    unhex(hex) AS raw
SELECT extractKeyValuePairs(raw, '=', '|')

┌─extractKeyValuePairs(raw, '=', '|')──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ {'8':'FIX.4.4','9':'148','35':'D','34':'1080','49':'TESTBUY1','52':'20180920-18:14:19.508','56':'TESTSELL1','11':'636730640278898634','15':'USD','21':'2','38':'7000','40':'1','54':'1','55':'MSFT','60':'20180920-18:14:19.492','10':'092'} │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Noisy Input

This function implements a couple of best-effort parsing strategies to handle noise and does not require the input to be strictly formatted in a key-value pair format. This is a powerful feature for dealing with imperfect data sources.

The following example contains a noisy input with a loose key-value pair in the middle of it. It illustrates the extractKeyValuePairs capabilities to ignore noisy characters.

SELECT extractKeyValuePairs('!#1=ad,\'8y company_name:Altinity :- \'a""')

┌─extractKeyValuePairs('!#1=ad,\'8y company_name:Altinity :- \'a""')─┐
│ {'company_name':'Altinity'}                                        │
└────────────────────────────────────────────────────────────────────┘

Escaping Support

If the input contains escape sequences, the withEscaping version of the function can be used. It has support for the following escape sequences: \x, \N, \a, \b, \e, \f, \n, \r, \t, \v and \0. Non-standard escape sequences are returned as it is (including the backslash) unless they are one of the following: \\, ‘, “, backtick, /, = or ASCII control characters (c <= 31). In a nutshell, it implements the same escape sequence support as the rest of ClickHouse.

Consider the following example: 'favorite_quote:"\\"premature optimization is the root of all evil\r\n\\""'. The value is quoted but also contains quoting characters in it. If the version without escaping support is used, the value will be cut at the first quoting character the algorithm finds.

SELECT extractKeyValuePairs('favorite_quote:"\\"premature optimization is the root of all evil\r\n\\""')

┌─extractKeyValuePairs('favorite_quote:"\\"premature optimization is the root of all evil\r\n\\""')─┐
│ {'favorite_quote':'\\'}                                                                           │
└───────────────────────────────────────────────────────────────────────────────────────────────────┘

Therefore, it is necessary to use the version with escaping support.

SELECT extractKeyValuePairsWithEscaping('favorite_quote:"\\"premature optimization is the root of all evil\r\n\\""')

┌─extractKeyValuePairsWithEscaping('favorite_quote:"\\"premature optimization is the root of all evil\r\n\\""')─┐
│ {'favorite_quote':'"premature optimization is the root of all evil\r\n"'}                                     │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Corner Cases and Caveats

The extractKeyValuePairs function relies on well-defined delimiters to perform the extraction, which means these should be picked carefully, with pair delimiters being probably one of the most notorious.

One of the obvious issues that might come up is when a pair delimiter appears in places where the value should be present.

Consider the following example: "name=neymar jr, avg_speed= 5km/h". In this case, the comma is the obvious pair delimiter. The whitespace character is a bit controversial, though. It is part of the keys and values, but it is also present right after the comma, acting as a pair delimiter.

In this case, the user must consider the trade-offs and decide whether the whitespace character should be treated as a pair delimiter or not. Below you can find the output of both cases.

Whitespace as a pair delimiter.

SELECT extractKeyValuePairs('name=neymar jr, avg_speed= 5km/h', '=', ', ') AS kv

┌─kv───────────────────────────────┐
│ {'name':'neymar','avg_speed':''} │
└──────────────────────────────────┘

Whitespace not set as a pair delimiter.

SELECT extractKeyValuePairs(' name=neymar jr, avg_speed= 5km/h', '=', ',') AS kv

┌─kv──────────────────────────────────────────┐
│ {' name':'neymar jr',' avg_speed':' 5km/h'} │
└─────────────────────────────────────────────┘

In the former, the name is missing the “Jr” part and avg_speed has an empty value. That is because a pair delimiter (whitespace in this case) was found in the middle of the values, which means end of value. On the other hand, the latter is not missing any data, but it includes leading and trailing whitespaces that might be unwanted.

Another important thing to keep in mind is that the maximum number of delimiters that can be used is 16, which includes key-value delimiter, pair delimiters, quoting character, and possibly the escaping character. All of these delimiters are a single character, except for pair delimiters, so that is the one to pay attention to.

In practice, it shouldn’t present a problem for the majority of cases since there is room for 13 pair delimiters, a limit that is unlikely to be reached. This hard limit is imposed due to Intel SSE4.2/2 optimizations. For more information, check out the APPENDIX A / Feature design & implementation section.

Benchmarks and Performance Analysis

ClickHouse is synonymous with speed, and it delivers as expected. To assess the performance of the extractKeyValuePairs function, a benchmark was conducted, pitting it against a practical regex approach employed on production systems for parsing & extracting key-value pairs from FIX format messages.

Environment

The benchmark dataset for this evaluation was sourced from the historical market data of agriculture, specifically the Corn Futures on July 15, 2013. To access the data, please visit https://www.cmegroup.com/market-data/datamine-historical-data/#marketDepth.

The test environment consisted of a ClickHouse table named extract_vs_regex, which was created using the following schema:

SHOW CREATE TABLE extract_vs_regex

┌─statement──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE default.extract_vs_regex
(
    `data` String,
    `extractKV` Map(String, String) ALIAS extractKeyValuePairs(data, '=', ' ,{}', '"'),
    `regex` Map(String, String) ALIAS CAST(arrayMap(s -> (splitByChar('=', s)[1], substring(s, length(splitByChar('=', s)[1]) + 2)), extractAll(substring(data, 1, 4000), '\\b\\w+=(?:"[^"]*?"|[^",} ]*)')), 'Map(String, String)')
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS index_granularity = 8192 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The table contains three columns: data, which stores the FIX format messages, and two other computed columns: extractKV and regex. The extractKV column is an alias to extractKeyValuePairs(data, '=', ' ,{}', '"'), while regex is an alias to a set of auxiliary functions, and the regex itself CAST(arrayMap(s -> (splitByChar('=', s)[1], substring(s, length(splitByChar('=', s)[1]) + 2)), extractAll(substring(data, 1, 4000), '\\b\\w+=(?:"[^"]*?"|[^",} ]*)')), 'Map(String, String)').

The table was populated with the following command: INSERT INTO extract_vs_regex select * FROM file('test.fix', 'LineAsString').

It resulted in a table with 12082 rows.

SELECT count(data)
FROM extract_vs_regex

┌─count(data)─┐
│       12082 │
└─────────────┘

Below you can find a single row of the test data for better visualization of what is being parsed.

SELECT data
FROM extract_vs_regex
LIMIT 1

┌─data───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 1128=99=31835=d49=CME34=122852=20141214043048951864=1865=120866=2014121555=SP600-501010460=7207=SPI15=USD980=M870=7871=117872=S&P 600 Diversified Telecommunication Services (Industry)871=118872=SPUSA-600-USDUF--P-US-S--501010871=119872=XF871=122872=15871=123872=143011000871=124872=212011000871=27872=210=216 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Results

The clickhouse-benchmark tool was used to assess the performance. Below, you can find both results.

extractKeyValuePairs

arthur@arthur:~$ clickhouse-benchmark -q "select extractKV from extract_vs_regex" -i 100
Loaded 1 queries.

Queries executed: 100.

localhost:9000, queries 100, QPS: 150.426, RPS: 1,817,441.514, MiB/s: 558.129, result RPS: 1,817,441.514, result MiB/s: 556.033.

0.000%     0.006 sec.
10.000%    0.006 sec.
20.000%    0.006 sec.
30.000%    0.006 sec.
40.000%    0.006 sec.
50.000%    0.006 sec.
60.000%    0.006 sec.
70.000%    0.007 sec.
80.000%    0.007 sec.
90.000%    0.008 sec.
95.000%    0.008 sec.
99.000%    0.011 sec.
99.900%    0.016 sec.
99.990%    0.016 sec.

Regex

arthur@arthur:~$ clickhouse-benchmark -q "select regex from extract_vs_regex" -i 100
Loaded 1 queries.

Queries executed: 70.

localhost:9000, queries 70, QPS: 74.777, RPS: 903454.695, MiB/s: 277.447, result RPS: 903454.695, result MiB/s: 271.345.

0.000%		0.012 sec.	
10.000%		0.012 sec.	
20.000%		0.012 sec.	
30.000%		0.012 sec.	
40.000%		0.012 sec.	
50.000%		0.013 sec.	
60.000%		0.013 sec.	
70.000%		0.013 sec.	
80.000%		0.014 sec.	
90.000%		0.015 sec.	
95.000%		0.018 sec.	
99.000%		0.020 sec.	
99.900%		0.022 sec.	
99.990%		0.022 sec.	

Queries executed: 100.

localhost:9000, queries 100, QPS: 74.506, RPS: 900180.378, MiB/s: 276.442, result RPS: 900180.378, result MiB/s: 270.361.

0.000%		0.012 sec.	
10.000%		0.012 sec.	
20.000%		0.012 sec.	
30.000%		0.012 sec.	
40.000%		0.012 sec.	
50.000%		0.013 sec.	
60.000%		0.013 sec.	
70.000%		0.013 sec.	
80.000%		0.014 sec.	
90.000%		0.015 sec.	
95.000%		0.018 sec.	
99.000%		0.022 sec.	
99.900%		0.022 sec.	
99.990%		0.022 sec.

Analysis

Based on the benchmark results, it is evident that the extractKeyValuePairs function significantly outperforms the regex approach in terms of both query and result processing speed.

For the extractKeyValuePairs approach, the benchmark results show a query rate of 150.426 queries per second (QPS), a result processing speed of 1,817,441.514 rows per second (RPS), and a throughput of 558.129 MiB/s. The latency measurements indicate that the majority of queries were processed within 0.006 to 0.008 seconds, with 99.990% of queries completed within 0.016 seconds.

In contrast, the regex approach demonstrates lower performance. The benchmark results for the regex approach show a query rate of 74.777 QPS, a result processing speed of 903,454.695 RPS, and a throughput of 277.447 MiB/s. The latency measurements reveal that most queries were processed within 0.012 to 0.015 seconds, with 99.990% of queries completed within 0.022 seconds.

Overall, the extractKeyValuePairs approach outperforms the regex approach by approximately double the query rate, result processing speed, and throughput. Moreover, the extractKeyValuePairs approach demonstrates consistently lower latencies across the benchmark measurements.

These results highlight the optimization and efficiency gains achieved by utilizing the extractKeyValuePairs function in ClickHouse compared to the traditional regex approach for parsing FIX format messages.

Wrap Up

There are a lot of applications out there that rely on key-value pairs for a variety of reasons, including configuration files, protocols for data transfer, and logging. This presents a challenge when it comes to data analysis, as the formats might vary, and it becomes difficult to normalize and extract data from it.

The new functions discussed in this article present an alternative & reliable approach to extracting key-value pairs from different formats. Its API is very flexible and allows several combinations of arguments to address various formats. Even though the API is flexible, a few corner cases must be taken into consideration while using it. Special characters being delimiters and valid part of keys and values is one of the cases discussed in this article.

The feature implementation tries to keep a balance between performance and customizability by using advanced string search algorithms, keeping memory allocations to a minimum, and using views when possible.

As demonstrated in the benchmarks earlier, the complex & highly optimized implementation pays off. The function presents a fairly simple & customizable API, while keeping a high bar on the performance.

This function is a great choice for anyone interested in extracting key-value pairs in ClickHouse, especially if the data contains noise (i.e. loose non key-value pair characters) or is using a not very well-known key-value pair format.

APPENDIX A / Feature Design & Implementation Details

This feature has been implemented as a state machine. Below you can find a visual representation of the state machine implemented.

The state machine for the version with escaping support is a bit different. It includes a few more state transitions to handle escaping properly.

By implementing it as a state machine, the code benefits from a clear separation between stages, enhancing its overall robustness and flexibility. The state machine design enables the code to discard invalid characters and reset to a known state if necessary. This capability proves particularly useful when dealing with complex or malformed input data, as the state machine can easily recover from errors and resume parsing from an intermediate stage.

Most state transitions are given by finding specific characters/delimiters, which means the implementation relies on string search algorithms to find those and decide what the next state is. The string search algorithms used are a key part of this implementation because it has a direct and meaningful impact on the performance.

Alongside the extractKeyValuePairs implementation, two string search algorithms were shipped: find_first_symbols and find_first_not_symbols. These two algorithms leverage Intel SSE4.2/2 capabilities if available on the system, which allows vectorized comparisons and can significantly speed up the processing time.

Below is a graph comparing the performance of the find_first_symbols implementation with std::find_first_of extracted from this benchmark. Needle sizes: 2, 4, 8, and 16 bytes.

The final important aspect of the implementation is how the elements (i.e. keys and values) are kept in memory and written into the output column. Ultimately, the input data needs to be processed & transferred to the output column. Transferring the data to the output column means at least one copy will take place. 

The challenge on this one is to avoid allocating memory & copying characters that will be later discarded because the code has reached an invalid state (e.g., found a pair delimiter in the middle of a key). Handling escape sequences spices it up a bit because the data needs to be modified and can’t be written as it is, mandating temporary values. Plus, a temporary object’s memory can’t be re-used by the output column because it relies on a contiguous chunk of memory.

To address these problems and keep the code sane for both implementations (with and without escaping support), two StringWriter strategies were implemented: NoEscapingStateHandler::StringWriter and InlineEscapingStateHandler::StringWriter.

Both of these implementations write directly into the output column and do not allocate memory for temporary objects. The main difference between these two is that the NoEscaping version relies on std::string_view and has a “lazy” write strategy, while the InlineEscaping version continuously writes into the output column.

Because the NoEscaping version implements a lazy write strategy, it is able to only allocate memory when it is certain the key-value pair is valid. The InlineEscaping version does not benefit from it, but upon discarding pairs, it won’t deallocate memory, which possibly saves some time for the next valid pairs.

Thank you for reading!

About the Author

Arthur Passos is a software engineer at Altinity. His interests include C++ programming, networking, and security. His recent work includes implementing the extractKeyValuePairs function, improvements to reverse DNS lookups, and in-progress changes to attach partition logic.

Share