Connecting ClickHouse to External Data Sources using the JDBC Bridge
Like a bridge over troubled water
I will lay me down
Paul Simon
ClickHouse is a piece of art. It is very fast and efficient. In real applications, however, ClickHouse is never used as a standalone – it integrates with other data sources and exposes query results to applications. OLAP is often built on top of an organization’s IT infrastructure that may include different database systems. Data from those systems needs to be loaded into ClickHouse one way or another in order to be used in analytical queries.
ClickHouse provides several generic mechanisms to talk to other databases: table functions, integration engines and external dictionaries. Table functions are convenient to run an ad-hoc query to an external system. Table engines allow it to represent external data as it was loaded into the ClickHouse table. External dictionaries store a snapshot of external data in RAM or cache.
All external systems are different and require different protocols to communicate. For most popular sources, like MySQL, PostgreSQL or MongoDB, ClickHouse has built-in connectors. For others it allows the use of ODBC or JDBC drivers that gives virtually unlimited possibilities to integrate. The problem, however, is the following. How can ClickHouse safely load and use third-party drivers that are not known in advance? In the case of JDBC, let’s also remember that ClickHouse is written in C++ but JDBC is Java.
When we faced this problem in Altinity.Cloud we did some brainstorming and found an excellent solution. Dmitry Titov from the Altinity support team prototyped it in two hours, and we deployed it to our client immediately. The approach is very elegant. Since it has been little discussed in community channels, we decided to fill the gap and write an article about it.
Bridge Pattern
Altinity.Cloud runs in Kubernetes. Kubernetes runs apps from images. The image is supposed to have all dependencies and libraries already installed. It would be very inconvenient to extend ClickHouse images and add extra connectivity libraries in there. Also, running third-party code needs to be as isolated as possible, so a problem in the driver does not crash ClickHouse. Fortunately, ClickHouse has a solution that is called a “bridge”. The idea is very similar to a bridge pattern in software design but at a higher level.
The bridge is a separate process that handles all the connectivity to external systems and provides an endpoint for ClickHouse to connect to. You can think of it as a proxy. The main advantage is that the bridge can be modified, configured, updated and maintained independently from ClickHouse itself. It can be wrapped as an image and deployed in a separate pod or container in Kubernetes without any changes to the main ClickHouse image.
ODBC or JDBC
Both ODBC Bridge and JDBC bridge are available for ClickHouse. So the next question to decide was about the first letter: O or J? ODBC seemed closer to ClickHouse, since it is written in C/C++. Altinity made a lot of contributions to the ClickHouse ODBC driver, so we’ve got quite a lot of experience there. However, ODBC is cumbersome to set up and configure. It requires setting up a driver manager first, installing a proper driver or drivers, and configuring DSNs in special files. In Kubernetes it is even more difficult, since it would require building an image containing all popular ODBC drivers. And we have not even looked at ARM support yet.
ClickHouse JDBC bridge seemed to be easier for many reasons. It is an actively developed project with a major release in mid-2021 followed by several patch releases. There are some interesting features, like schema management, named data sources and others. There is also a well documented docker image already available. Also Java has a standard way to load external libraries, so we decided to give it a try. For those who are curious about CVE-2021-44228, the log4j dependency has been completely removed in the 2.0.7 release to be safe.
Setting up ClickHouse JDBC Bridge
Setting up the ClickHouse JDBC bridge in Kubernetes is very simple. Here is a sample bridge deployment with MSSQL JDBC driver, and a service for ClickHouse to connect:
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 1
selector:
matchLabels:
app: clickhouse-jdbc-bridge
template:
metadata:
labels:
app: clickhouse-jdbc-bridge
spec:
containers:
- name: clickhouse-jdbc-bridge
image: "clickhouse/jdbc-bridge"
command: ["sh", "-c", "rm -rf /app/drivers/* ; ./docker-entrypoint.sh"]
livenessProbe:
failureThreshold: 10
httpGet:
path: /metrics
port: http
scheme: HTTP
env:
- name: JDBC_DRIVERS
value: "com/microsoft/sqlserver/mssql-jdbc/10.1.0.jre8-preview/mssql-jdbc-10.1.0.jre8-preview.jar"
ports:
- containerPort: 9019
name: http
protocol: TCP
resources:
limits:
cpu: 200m
memory: 2Gi
requests:
cpu: 200m
memory: 256Mi
apiVersion: v1
kind: Service
metadata:
name: clickhouse-jdbc-bridge
labels:
app: clickhouse-jdbc-bridge
spec:
type: ClusterIP
ports:
- port: 9019
targetPort: 9019
protocol: TCP
name: http
selector:
app: clickhouse-jdbc-bridge
Note that the driver library is specified in the JDBC_DRIVERS environment variable. Multiple drivers can be separated by a comma. JDBC bridge loads configured drivers automatically from the Maven repository, which defaults to https://repo1.maven.org/maven2 but may be overwritten to any other location with MAVEN_REPO_URL environment variable.
Kubernetes ConfigMap can be used as well if we want to keep the deployment manifest separate from the list of used drivers. The same jdbc image can be used for any driver or drivers! If the driver version is updated we only need to restart.
Another possibility is to map ‘/app/drivers’ container path to a persistent volume and manage drivers entirely outside of the JDBC bridge deployment.
Connecting ClickHouse to JDBC bridge
In order to let ClickHouse know that it needs to connect to JDBC bridge we only need to add a small configuration file:
config.d/jdbc_bridge.xml:
<clickhouse>
<jdbc_bridge>
<host>clickhouse-jdbc-bridge</host>
<port>9019</port>
</jdbc_bridge>
</clickhouse>
Here, host and port should match those defined in the Kubernetes Service resource.
Once ClickHouse is configured, we can try it out. The connection can be tested with a trivial SQL statement using the jdbc table function:
SELECT * FROM
jdbc('jdbc:sqlserver://<server>:<port>;user=<user>;password=<pass>;encrypt=false', 'SELECT 1');
Note: we use the actual address of the MSSQL server here, not the bridge anymore.
If we know the database and table names, we can query tables directly:
SELECT * FROM
jdbc('jdbc:sqlserver://<server>:<port>;user=<user>;password=<pass>;encrypt=false;databaseName=test_db',
'test_table')
We can also create a table using JDBC engine:
CREATE TABLE jdbc_table
(
`col1` String,
`col2` String
) ENGINE = JDBC('jdbc:sqlserver://<server>:<port>;user=<user>;password=<password>;encrypt=false;databaseName=test_db;',
'', 'test_table')
Finally, let’s create a dictionary, but… Apparently, JDBC dictionaries are not currently supported. Fortunately, with ClickHouse there is always a workaround.
We can use a table with JDBC engine, or create a VIEW, and use it in as a dictionary source:
CREATE VIEW jdbc_dict_source AS
SELECT * FROM jdbc('jdbc:sqlserver://<server>:<port>;user=<user>;password=<password>;encrypt=false;databaseName=test_db', 'test_table')
CREATE DICTIONARY default.jdbc_dict
(
`key` String,
`value` String
)
PRIMARY KEY key
SOURCE(CLICKHOUSE(
table 'jdbc_dict_source'
))
LIFETIME(MIN 300 MAX 300)
LAYOUT(COMPLEX_KEY_HASHED())
Voila! We’ve got a dictionary connected to MSSQL server via JDBC.
Securing Database Credentials
In the examples above, user and password to MSSQL server were exposed in DDL and queries. Some DBAs would not be very happy exposing database credentials so easily. ClickHouse engineer Kseniia Sumarokova was not happy with that as well, so she added a new but not yet documented feature ‘named_collections’. Named collections allow defining a set of properties in the configuration file and use collection names instead of a connection string. This is how it could look like for a JDBC connection:
<clickhouse>
<named_collections>
<mssql_conn>
<driver>sqlserver</driver>
<host>my_host</host>
<port>my_port</port>
<user>my_user</user>
<password>my_password</password>
<database>my_db</database>
</mssql_conn>
</named_collections>
</clickhouse>
With that we could query JDBC table with a much simple query:
SELECT * FROM jdbc(mssql_conn, 'test_table')
Unfortunately, it does not work this way yet. Named collections are implemented only for MySQL, PostgreSQL, MongoDB, URL, S3 and Kafka connections, but not for generic JDBC. This looks like an easy feature to add in one of the future ClickHouse releases.
Is there another way to hide credentials? Of course, there is! It is possible to provide additional configuration to the bridge itself. Bridge supports named data sources that can be used instead of a full connection string. Data sources are loaded from the ‘/app/config/datasources’ folder. In Kubernetes we can store data source files in a ConfigMap, and map it directly to the folder:
apiVersion: v1
kind: ConfigMap
metadata:
name: clickhouse-jdbc-bridge-datasources
data:
mssql_con.json: |
{
"$schema": "../datasource.jschema",
"mssql_conn": {
"driverUrls": [
"https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/10.1.0.jre8-preview/mssql-jdbc-10.1.0.jre8-preview.jar"
],
"driverClassName": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"jdbcUrl": "jdbc:sqlserver://my_host:my_port;encrypt=false;databaseName=test_db",
"username": "my_user",
"password": "my_password"
}
}
Corresponding changes need to be done to deployment as well.
...
spec:
template:
spec:
volumes:
- name: clickhouse-jdbc-bridge-datasources
configMap:
name: clickhouse-jdbc-bridge-datasources
containers:
- name: clickhouse-jdbc-bridge
...
volumeMounts:
- name: clickhouse-jdbc-bridge-datasources
mountPath: /app/config/datasources
Once configured, we can test that bridge picked up the datasource using a special command:
SELECT * FROM jdbc('', 'show datasources')\G
Row 1:
──────
name: mssql_conn
is_alias: 0
instance: -651712816
create_datetime: 2021-12-28 14:37:44
type: jdbc
...
And use it as follows:
SELECT * FROM jdbc('mssql_conn', 'test_table')
This is secure but requires managing configuration in multiple places, which is not always convenient.
Conclusion
In this article we discussed how ClickHouse can connect to an external database, MSSQL in particular. Since ClickHouse does not have a native connector, we had to choose between ODBC and JDBC. ClickHouse can use both via a separate bridge process. JDBC was easier to set up in Kubernetes, so we followed this approach. ClickHouse JDBC bridge in Kubernetes allows plugging required drivers on demand with a simple configuration change. That extends ClickHouse connectivity options almost limitlessly. Stay tuned to learn more!