Updates and Deletes in ClickHouse

By Alexander Zaitsev on October 17th, 2018

Updates and Deletes in ClickHouse

Alexander Zaitsev ClickHouseDeleteTutorialUpdate

It’s been two years already since ClickHouse development team published an excellent blog post “How to update data in ClickHouse”. In that old times ClickHouse supported only monthly partitions, and for mutable data structures, they suggested to use pretty exotic data structures. We were all waiting for a more convenient approach, and finally, it is there: ClickHouse now supports updates in deletes! In this article, we will see how it works.

Test data

Let’s load a test table with some data:

:) select count(*) from system.columns where table='test_update';

┌─count()─┐
│     332 │
└─────────┘

:) select count(*) from test_update;

┌──count()─┐
│ 17925050 │
└──────────┘

We will be trying updates since it is more interesting. Deletes work pretty much the same.

How it works

The syntax for updates and deletes is non-standard SQL. ClickHouse team wanted to express the difference from traditional SQL: new updates and deletes are batch operations, performed asynchronously. It is even called ‘mutations’. Custom syntax highlights the difference.

ALTER TABLE <table_name> DELETE WHERE <filter>;

and

ALTER TABLE <table_name> UPDATE col1 = expr1, ... WHERE <filter>;

In our test table there is column event_status_key.

:) select event_status_key, count(*) from test_update where event_status_key in (0, 22) group by event_status_key;

┌─event_status_key─┬──count()─┐
│                0 │ 17824710 │
│               22 │     1701 │
└──────────────────┴──────────┘

Let’s consider that status 22 was a mistake and we want to fix it. This is just 0.01% of the data, but without DELETE or UPDATE, we would have to reload the table.

:) ALTER TABLE test_update UPDATE event_status_key=0 where event_status_key=22;

0 rows in set. Elapsed: 0.067 sec.

It returns immediately, but the update is asynchronous, so we do not know if data has been updated or not. Let’s check:

:) select event_status_key, count(*) from test_update where event_status_key in (0, 22) group by event_status_key;

 ┌─event_status_key─┬──count()─┐
 │                0 │ 17826411 │
 └──────────────────┴──────────┘

Seems to be working. The status of the update operation can be looked up in system.mutations table:

:) select * from system.mutations where table='test_update';

Row 1:
──────
database:                   test
table:                      test_update
mutation_id:                mutation_162.txt
command:                    UPDATE event_status_key = 0 WHERE event_status_key = 22
create_time:                2018-10-12 12:39:32
block_numbers.partition_id: ['']
block_numbers.number:       [162]
parts_to_do:                0
is_done:                    1

Another interesting insight gives system.parts table:

:) select name, active, rows, bytes_on_disk, modification_time from system.parts where table='test_update' order by modification_time;

┌─name──────────────┬─active─┬────rows─┬─bytes_on_disk─┬───modification_time─┐
│ all_1_36_2        │      0 │ 3841126 │     637611245 │ 2018-10-12 12:16:24 │
│ all_37_75_2       │      0 │ 4358144 │     598548358 │ 2018-10-12 12:16:47 │
│ all_112_117_1     │      0 │  638976 │     167899233 │ 2018-10-12 12:17:00 │
│ all_151_155_1     │      0 │  778240 │      27388052 │ 2018-10-12 12:17:29 │
│ all_76_111_2      │      0 │ 3833856 │     989762502 │ 2018-10-12 12:17:30 │
│ all_156_161_1     │      0 │  837460 │      27490891 │ 2018-10-12 12:17:43 │
│ all_118_150_2     │      0 │ 3637248 │     859673147 │ 2018-10-12 12:17:52 │
│ all_1_36_2_162    │      1 │ 3841126 │     637611232 │ 2018-10-12 12:39:32 │
│ all_37_75_2_162   │      1 │ 4358144 │     598548352 │ 2018-10-12 12:39:32 │
│ all_76_111_2_162  │      1 │ 3833856 │     989762502 │ 2018-10-12 12:39:32 │
│ all_112_117_1_162 │      1 │  638976 │     167899233 │ 2018-10-12 12:39:32 │
│ all_118_150_2_162 │      1 │ 3637248 │     859673147 │ 2018-10-12 12:39:32 │
│ all_151_155_1_162 │      1 │  778240 │      27388052 │ 2018-10-12 12:39:32 │
│ all_156_161_1_162 │      1 │  837460 │      27490891 │ 2018-10-12 12:39:32 │
└───────────────────┴────────┴─────────┴───────────────┴─────────────────────┘

It shows that every part has been touched by update. But it has been pretty fast for small test dataset.

Now let’s try to do something more difficult. There is an array column in our table keeping integer segment ids.

:) select count(*) from test_update where has(dmp_audience_ids, 31694239);
┌─count()─┐
│  228706 │
└─────────┘

Consider that we want to add additional segment to all users with segment 31694239.

:) alter table test_update update dmp_audience_ids = arrayPushBack(dmp_audience_ids, 1234567) where has(dmp_audience_ids, 31694239);

Instant response again, let’s check if data has been updated correctly.

:) select count(*) from test_update where has(dmp_audience_ids, 1234567)

┌─count()─┐
│  228706 │
└─────────┘

:) select dmp_audience_ids from test_update where has(dmp_audience_ids, 1234567) and length(dmp_audience_ids)<5 limit 1;

┌─dmp_audience_ids─────────────────────┐
│ [31694239,31694422,31694635,1234567] │
└──────────────────────────────────────┘

Perfectly works and very fast!

 

  1. why clickhouse does not support delete/update for distributed table?
    and for update non primary key columns, why does clickhouse have to rebuild every parts?

    1. Hi Weitao,
      Since updates and deletes are actually ALTER TABLE statements they are applied to local table, not a distributed one. These are operations that you should not be doing too often. If you want to do it at all cluster, you can use distributed ddl: ALTER TABLE … ON CLUSTER
      For update, ClickHouse rebuilds only updated columns in parts, that are subject to WHERE condition. In some cases all parts can be affected. But it does not rebuild all columns, only updated ones.
      Alexander

  2. By when we can expect clickhouse to add full SQL support for UPDATE/DELETE as mentioned in the article.
    Thanks

  3. Can you delete with the where condition being in query VALUES? I mean something like this:client.execute("ALTER TABLE my_table DELETE WHERE uid IN VALUES", [(12345,), (23456,)])

    1. Hi James,You can not use VALUES, but you can use IN lists, as in normal SQL query, e.g.:
      "ALTER TABLE my_table DELETE WHERE uid IN (12345, 23456)"
      What client are you using?


Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: