3 Oct 2018

ClickHouse HTTP protocol is good and reliable, it is a base for official JDBC, ODBC and many 3rd party drivers and integrations. It can also be used directly with http client libraries. In this article we describe two advanced features of HTTP protocol: execution progress and sessions.

Introduction.

There is an option to receive query execution progress data with the help of X-ClickHouse-Progress headers.
In order to receive progress information we need to enable send_progress_in_http_headers setting. This can be done either for each query independently or for the whole session and thus – for all queries within this session.

Session

Let’s clarify term session in ClickHouse HTTP protocol. Session provides possibility to group queries by a set of features. For example, we may specify special settings for the session and all requests within this session would be affected. Sometimes this may be more convenient than to specify those settings to each query. We can group queries into sessions by specifying unique session_id – which is an arbitrary string, chosen by the client. Client has to provide session_id to the server for the query, which would be bound to the session. To do this, we need to add the session_id GET parameter to the request. By default, the session is terminated after 60 seconds of inactivity, but we can change this timeout, either in default_session_timeout setting in the server configuration xml file, or to add session_timeout GET parameter to the request. Only one query at a time can be executed within a session, so we can get the following error:

HTTP/1.1 500 Internal Server Error
Session is locked by a concurrent client.

in case of an attempt to run multiple queries concurrently within one session.

Now, let’s return back to X-ClickHouse-Progress headers. In order to understand progress report in full details, we need to cover one more topic.

Response Buffering

We can enable response buffering on the server side. There are two parameters: buffer_size and wait_end_of_query provided for this purpose. We can add these parameters as GET parameters to the query’s URL.

buffer_size determines the number of bytes in the result to buffer in the server memory. If the result body is larger than this threshold, the buffer is written to the HTTP channel as it gets full, and the remaining data is sent directly to the HTTP channel.

To ensure that the entire response is buffered, we can specify wait_end_of_query=1 parameter. In this case, the data that is not stored in memory will be buffered in a temporary server file.

Buffering is most useful in the following cases:

To avoid situations where a query processing error occurred after the response code and HTTP headers were sent to the client. HTTP headers are sent at the very beginning of the data transfer to the client, and in case an error is met during long-running query, there is no way no specify HTTP error code (HTTP header already sent), and in this situation, an error message is written at the end of the response body. On the client side, the error can only be detected at the parsing stage, which may be inconvenient. Getting HTTP error code is much more straightforward
To report query progress. Since query progress is being reported via X-ClickHouse-Progress headers, there is a restriction, that all query progress headers have to be reported before actual response body data to be sent. This is general HTTP protocol requirement – all headers have to be sent before response body. So, in order to receive reasonable progress report, we need to ask server to buffer the whole output, to send progress reports with HTTP headers and to send response body afterward.

In case no wait_end_of_query=1 parameter specified, progress report headers would be sent until buffer_size response body size is reached (buffering in progress) and as body data delivery is started to be sent directly to the HTTP channel, there is no possibility to send additional HTTP headers – all headers already sent. Thus we can receive partial progress information only and response body afterward.

Now let’s take a look on examples.

Raw HTTP exchange. Let’s laucnh long-running query on a big data set: SELECT sequential numbers from 0 to 50000000 (not including 50000000) in one request. Pay attention to wait_end_of_query=1 and send_progress_in_http_headers=1 parameters

HTTP REQUEST

POST /?wait_end_of_query=1&send_progress_in_http_headers=1 HTTP/1.1
Host: localhost:8123
Content-Length: 44
Content-Type: text/plain

SELECT * from system.numbers limit 50000000

HTTP RESPONSE

HTTP/1.1 200 OK
Date: Fri, 28 Sep 2018 10:14:06 GMT
Connection: Keep-Alive
Content-Type: text/tab-separated-values; charset=UTF-8
Transfer-Encoding: chunked
Keep-Alive: timeout=3
X-ClickHouse-Progress: {"read_rows":"3211264","read_bytes":"25690112","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"7077888","read_bytes":"56623104","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"10420224","read_bytes":"83361792","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"14024704","read_bytes":"112197632","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"17891328","read_bytes":"143130624","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"21692416","read_bytes":"173539328","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"25427968","read_bytes":"203423744","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"29294592","read_bytes":"234356736","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"33161216","read_bytes":"265289728","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"37027840","read_bytes":"296222720","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"40894464","read_bytes":"327155712","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"44695552","read_bytes":"357564416","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"48234496","read_bytes":"385875968","total_rows":"0"}

1000
0
1
2
--- CUT HERE ---
49999998
49999999

0

As we can see, there are multiple X-ClickHouse-Progress headers responded by the server during data buffering procedure.
Server sends these headers line-by-line, so client can receive and interpret them. After all, response is buffered on server side, the response body output starts. No headers are possible to be sent from this point, because response body is being delivered to the client. Pay attention to the last item "total_rows":"0" – since ClickHouse does not know how many items it has to scan because system.numbers is a special case of endless system table, there is no total_rows reported. For tables stored with more typical storage engines, such as MergeTree, total_rows would be reported as well, and we’ll have possibility to see real progress, as read_rows/total_rows

Example:

HTTP REQUEST

POST /?wait_end_of_query=1&send_progress_in_http_headers=1 HTTP/1.1
Host: localhost:8123
Content-Length: 92
Content-Type: text/plain

SELECT AVG(toInt64OrZero(MD5(toString(eventTime)))) FROM starexp1 GROUP BY prodTime limit 1

HTTP RESPONSE

HTTP/1.1 200 OK
Date: Tue, 02 Oct 2018 16:34:31 GMT
Connection: Keep-Alive
Content-Type: text/tab-separated-values; charset=UTF-8
X-ClickHouse-Server-Display-Name: ip-172-31-8-99.eu-west-2.compute.internal
Transfer-Encoding: chunked
Keep-Alive: timeout=3
X-ClickHouse-Progress: {"read_rows":"0","read_bytes":"0","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"196608","read_bytes":"3145728","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"327680","read_bytes":"5242880","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"393216","read_bytes":"6291456","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"524288","read_bytes":"8388608","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"655360","read_bytes":"10485760","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"786432","read_bytes":"12582912","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"851968","read_bytes":"13631488","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"917504","read_bytes":"14680064","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"1048576","read_bytes":"16777216","total_rows":"2000060416"}
--- CUT HERE ---

As we can see here, ClickHouse has to scan about 2 billions rows in order to get a result.

Let’s take a look on some more examples, with curl tool now, for convenience

Setup session and set send_progress_in_http_headers=1 for the whole session. Pay attention to wait_end_of_query=1 specified for the query

SESSION_ID=$(date +%s)
echo $SESSION_ID
echo 'SET send_progress_in_http_headers=1' | curl -v "http://localhost:8123/?wait_end_of_query=1&session_id=$SESSION_ID" --data-binary @-
echo 'SELECT * from system.numbers limit 50000000' | curl -v "http://localhost:8123/?wait_end_of_query=1&session_id=$SESSION_ID" --data-binary @- > /dev/null

Setup session and set send_progress_in_http_headers=1 for the whole session. Pay attention NO wait_end_of_query=1 is specified for the query, so there possibliy will be some progress report before that response body delivery started

SESSION_ID=$(date +%s)
echo $SESSION_ID
echo 'SET send_progress_in_http_headers=1' | curl -v "http://localhost:8123/?session_id=$SESSION_ID" --data-binary @-
echo 'SELECT sleep(3), * from system.numbers limit 500000' | curl -v "http://localhost:8123/?session_id=$SESSION_ID" --data-binary @- > /dev/null

No session specified, one query

echo 'SELECT sleep(3), * from system.numbers limit 1000000' | curl -v "http://localhost:8123/?wait_end_of_query=1&send_progress_in_http_headers=1" --data-binary @- > /dev/null

Conclusion

So, as we can see, that ClickHouse provides tools for buffering, progress reports and session management with HTTP protocol.

Mikhail Filimonov says:

8th October 2018 at 12:52 pm

If you need to export a lot of data via http interface it’s better to avoid wait_end_of_query setting. As it can double the time of waiting for a response.
In a big number of cases it’s much easier to control the progress of query by executing SELECT * FROM system.processes where query_id = ... in another HTTP connection.
Nancy says:

20th September 2023 at 3:20 pm

How can the data be returned in a json format, i.e with column names?
1. Alexander Zaitsev says:
  
  22nd September 2023 at 2:57 am
  
  ClickHouse supports a lot of JSON formats, you can specify format for a query output. See there https://clickhouse.com/docs/en/interfaces/formats#json

Comments are closed.

Get in touch with ClickHouse experts.

Get in touch with ClickHouse experts.

ClickHouse HTTP protocol advanced features

Introduction.

Session

Response Buffering

Conclusion

3 Comments