ClickHouse HTTP protocol advanced features
3 Oct 2018
ClickHouse HTTP protocol is good and reliable, it is a base for official JDBC, ODBC and many 3rd party drivers and integrations. It can also be used directly with http client libraries. In this article we describe two advanced features of HTTP protocol: execution progress and sessions.
Introduction.
There is an option to receive query execution progress data with the help of X-ClickHouse-Progress
headers.
In order to receive progress information we need to enable send_progress_in_http_headers
setting. This can be done either for each query independently or for the whole session and thus – for all queries within this session.
Session
Let’s clarify term session in ClickHouse HTTP protocol. Session provides possibility to group queries by a set of features. For example, we may specify special settings for the session and all requests within this session would be affected. Sometimes this may be more convenient than to specify those settings to each query. We can group queries into sessions by specifying unique session_id
– which is an arbitrary string, chosen by the client. Client has to provide session_id
to the server for the query, which would be bound to the session. To do this, we need to add the session_id
GET parameter to the request. By default, the session is terminated after 60 seconds of inactivity, but we can change this timeout, either in default_session_timeout
setting in the server configuration xml file, or to add session_timeout
GET parameter to the request. Only one query at a time can be executed within a session, so we can get the following error:
HTTP/1.1 500 Internal Server Error
Session is locked by a concurrent client.
in case of an attempt to run multiple queries concurrently within one session.
Now, let’s return back to X-ClickHouse-Progress
headers. In order to understand progress report in full details, we need to cover one more topic.
Response Buffering
We can enable response buffering on the server side. There are two parameters: buffer_size
and wait_end_of_query
provided for this purpose. We can add these parameters as GET parameters to the query’s URL.
buffer_size
determines the number of bytes in the result to buffer in the server memory. If the result body is larger than this threshold, the buffer is written to the HTTP channel as it gets full, and the remaining data is sent directly to the HTTP channel.
To ensure that the entire response is buffered, we can specify wait_end_of_query=1
parameter. In this case, the data that is not stored in memory will be buffered in a temporary server file.
Buffering is most useful in the following cases:
- To avoid situations where a query processing error occurred after the response code and HTTP headers were sent to the client. HTTP headers are sent at the very beginning of the data transfer to the client, and in case an error is met during long-running query, there is no way no specify HTTP error code (HTTP header already sent), and in this situation, an error message is written at the end of the response body. On the client side, the error can only be detected at the parsing stage, which may be inconvenient. Getting HTTP error code is much more straightforward
- To report query progress. Since query progress is being reported via
X-ClickHouse-Progress
headers, there is a restriction, that all query progress headers have to be reported before actual response body data to be sent. This is general HTTP protocol requirement – all headers have to be sent before response body. So, in order to receive reasonable progress report, we need to ask server to buffer the whole output, to send progress reports with HTTP headers and to send response body afterward.
In case no wait_end_of_query=1
parameter specified, progress report headers would be sent until buffer_size
response body size is reached (buffering in progress) and as body data delivery is started to be sent directly to the HTTP channel, there is no possibility to send additional HTTP headers – all headers already sent. Thus we can receive partial progress information only and response body afterward.
Now let’s take a look on examples.
Raw HTTP exchange. Let’s laucnh long-running query on a big data set: SELECT
sequential numbers from 0 to 50000000 (not including 50000000) in one request. Pay attention to wait_end_of_query=1
and send_progress_in_http_headers=1
parameters
HTTP REQUEST
POST /?wait_end_of_query=1&send_progress_in_http_headers=1 HTTP/1.1
Host: localhost:8123
Content-Length: 44
Content-Type: text/plain
SELECT * from system.numbers limit 50000000
HTTP RESPONSE
HTTP/1.1 200 OK
Date: Fri, 28 Sep 2018 10:14:06 GMT
Connection: Keep-Alive
Content-Type: text/tab-separated-values; charset=UTF-8
Transfer-Encoding: chunked
Keep-Alive: timeout=3
X-ClickHouse-Progress: {"read_rows":"3211264","read_bytes":"25690112","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"7077888","read_bytes":"56623104","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"10420224","read_bytes":"83361792","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"14024704","read_bytes":"112197632","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"17891328","read_bytes":"143130624","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"21692416","read_bytes":"173539328","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"25427968","read_bytes":"203423744","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"29294592","read_bytes":"234356736","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"33161216","read_bytes":"265289728","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"37027840","read_bytes":"296222720","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"40894464","read_bytes":"327155712","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"44695552","read_bytes":"357564416","total_rows":"0"}
X-ClickHouse-Progress: {"read_rows":"48234496","read_bytes":"385875968","total_rows":"0"}
1000
0
1
2
--- CUT HERE ---
49999998
49999999
0
As we can see, there are multiple X-ClickHouse-Progress
headers responded by the server during data buffering procedure.
Server sends these headers line-by-line, so client can receive and interpret them. After all, response is buffered on server side, the response body output starts. No headers are possible to be sent from this point, because response body is being delivered to the client. Pay attention to the last item "total_rows":"0"
– since ClickHouse does not know how many items it has to scan because system.numbers
is a special case of endless system table, there is no total_rows
reported. For tables stored with more typical storage engines, such as MergeTree, total_rows
would be reported as well, and we’ll have possibility to see real progress, as read_rows/total_rows
Example:
HTTP REQUEST
POST /?wait_end_of_query=1&send_progress_in_http_headers=1 HTTP/1.1
Host: localhost:8123
Content-Length: 92
Content-Type: text/plain
SELECT AVG(toInt64OrZero(MD5(toString(eventTime)))) FROM starexp1 GROUP BY prodTime limit 1
HTTP RESPONSE
HTTP/1.1 200 OK
Date: Tue, 02 Oct 2018 16:34:31 GMT
Connection: Keep-Alive
Content-Type: text/tab-separated-values; charset=UTF-8
X-ClickHouse-Server-Display-Name: ip-172-31-8-99.eu-west-2.compute.internal
Transfer-Encoding: chunked
Keep-Alive: timeout=3
X-ClickHouse-Progress: {"read_rows":"0","read_bytes":"0","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"196608","read_bytes":"3145728","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"327680","read_bytes":"5242880","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"393216","read_bytes":"6291456","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"524288","read_bytes":"8388608","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"655360","read_bytes":"10485760","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"786432","read_bytes":"12582912","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"851968","read_bytes":"13631488","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"917504","read_bytes":"14680064","total_rows":"2000060416"}
X-ClickHouse-Progress: {"read_rows":"1048576","read_bytes":"16777216","total_rows":"2000060416"}
--- CUT HERE ---
As we can see here, ClickHouse has to scan about 2 billions rows in order to get a result.
Let’s take a look on some more examples, with curl
tool now, for convenience
- Setup session and set
send_progress_in_http_headers=1
for the whole session. Pay attention towait_end_of_query=1
specified for the querySESSION_ID=$(date +%s) echo $SESSION_ID echo 'SET send_progress_in_http_headers=1' | curl -v "http://localhost:8123/?wait_end_of_query=1&session_id=$SESSION_ID" --data-binary @- echo 'SELECT * from system.numbers limit 50000000' | curl -v "http://localhost:8123/?wait_end_of_query=1&session_id=$SESSION_ID" --data-binary @- > /dev/null
- Setup session and set
send_progress_in_http_headers=1
for the whole session. Pay attention NOwait_end_of_query=1
is specified for the query, so there possibliy will be some progress report before that response body delivery startedSESSION_ID=$(date +%s) echo $SESSION_ID echo 'SET send_progress_in_http_headers=1' | curl -v "http://localhost:8123/?session_id=$SESSION_ID" --data-binary @- echo 'SELECT sleep(3), * from system.numbers limit 500000' | curl -v "http://localhost:8123/?session_id=$SESSION_ID" --data-binary @- > /dev/null
- No session specified, one query
echo 'SELECT sleep(3), * from system.numbers limit 1000000' | curl -v "http://localhost:8123/?wait_end_of_query=1&send_progress_in_http_headers=1" --data-binary @- > /dev/null
Conclusion
So, as we can see, that ClickHouse provides tools for buffering, progress reports and session management with HTTP protocol.
If you need to export a lot of data via http interface it’s better to avoid wait_end_of_query setting. As it can double the time of waiting for a response.
In a big number of cases it’s much easier to control the progress of query by executing
SELECT * FROM system.processes where query_id = ...
in another HTTP connection.