Item 44127737

Boxxed • 4 days ago

Does anyone use clickhouse in production? I was initially pretty impressed but when I really put it through its paces I could OoM it as soon as I actually started querying non-trivial amounts of data:

https://github.com/ClickHouse/ClickHouse/issues/79064

fishtoaster • 4 days ago

Yep. Clickhouse is absolutely great for tons of production use cases.

Unless you try to join tables in it, in which case it will immediately explode.

More seriously, it's a columnar data store, not a relational database. It'll definitely pretend to be "postgres but faster", but that's a very thin and very leaky facade. You want to do massively a complex set of selects and conditional sums over one table with 3b rows and tb of data? You'll get a result in tens of seconds without optimization. You want to join two tables that postgres could handle easily? You'll OOM a machine with TB of memory.

So: good for very specific use cases. If you have those usecases, it's great! If you don't, use something else. Many large companies have those use cases.

3 replies

Boxxed • 3 days ago

Yeah I think that's a good summary. For instance, clickbench is comprised of >40 queries and there's not a single join in them: https://github.com/ClickHouse/ClickBench/blob/main/clickhous...

1 reply

zX41ZdbW • 1 day ago

There is the "versions benchmark," which includes a lot of queries with JOINs and compares ClickHouse performance on them: https://benchmark.clickhouse.com/versions/

adrian17 • 3 days ago

The majority of our queries have joins (plus our core logic often depends on fact table expansion with `arrayJoin()`s) before aggregations and we're doing fine. AFAIK whenever we hit memory issues, they are mostly due to high-cardinality aggregations (especially with uniqExact), not joins. But I'm sure it can depend on the specifics.

1 reply

legorobot • 3 days ago

Definitely agree with this, I think ClickHouse can do a lot with joins if you don't implement them naively. Keeping the server up-to-date is a part of it too.

They've made strides in the last year or two to implement more join algorithms, and re-order your joins automatically (including whats on the "left" and "right" of the join, relating to performance of the algorithm).

Their release notes cover a lot of the highlights, and they have dedicated documentation regarding joins[1]. But we've made improvements by an order-of-magnitude before by just reordering our joins to align with how ClickHouse processes them.

[1]: https://clickhouse.com/docs/guides/joining-tables

hodgesrm • 4 days ago

> More seriously, it's a columnar data store, not a relational database.

Could you explain why you don't think ClickHouse is relational? The storage is an implementation detail. It affects how fast queries run but not the query model. Joins have already improved substantially and will continue to do so in future.

1 reply

fishtoaster • 2 days ago

The storage is not just an implementation detail because it affects how fast things run, which affects which tasks it's better or worse for. There's a reason people reach for a columnar datastore for some tasks and something like postgres or mysql for other tasks, even though both are technically capable of nearly the same queries.

hodgesrm • 4 days ago

It's used in production by many thousands of companies at this point. The ClickHouse Inc numbers are just a fraction of the total users.

p.s., It's also possible to break ClickHouse as you demonstrated. It used to be a lot easier.

1 reply

Boxxed • 4 days ago

I guess I'm curious how; I breathe on it wrong and it OoMs.

3 replies

hodgesrm • 4 days ago

One of the tradeoffs for ClickHouse versus databases like Snowflake is that you have to have some knowledge about the internals to use it effectively. For example, Snowflake completely hides partitioning but on the other hand it does not deliver consistent, real-time response the way a well-tuned ClickHouse application can.

When you use INSERT ... SELECT in ClickHouse you do need to pay attention to the generated table partitions, as they coexist in memory before flushing to storage. The usual approach is to break up the insert into chunks so you can control how many parts are generated or to adjust the partitioning in the target table.

It's possible the problem might be somehow related to this behavior but that's just conjecture. It's usually pretty easy to work around. Meanwhile if it's a bug it will probably get fixed quickly.

1 reply

datavirtue • 4 days ago

You have to have knowledge of the internals of any database you use. Not knowing is going to cost someone a lot of money and/or performance.

nasretdinov • 4 days ago

One easy way to achieve this is to store really large values, e.g. 10 Mb per row. Since ClickHouse operates in large blocks you'd easily cause an OOM just by trying to read chunks of 8192 rows (the default) at a time, especially during merges, where it needs to read large blocks from several parts at once

bathtub365 • 4 days ago

You don’t need a good product to have a lot of users, just good marketing and salespeople.

mplanchard • 3 days ago

Yes (via Clickhouse Cloud, which is pretty reasonably priced).

It’s important to structure your tables and queries in a way that aligns with the ordering keys, in order to optimize how much data needs to be loaded into RAM. You absolutely CANNOT just replicate your existing postgres DB and its primary keys or whatever over to CH. There are tricks like projections and incremental materialized views that can help to get the appropriate “lenses” for your queries. We use incremental MVs to, for example, continuously aggregate all-time stats about tens of billions of records. In general, for CH, space is cheap and RAM is expensive, so it’s better to duplicate a table’s data with a different ordering key than to make an inefficient query.

As long as the queries align with the ordering keys, it is insanely fast and able to enable analytics queries for truly massive amounts of data. We’ve been very impressed.

owenthejumper • 4 days ago

I find Clickhouse fascinating, really good, and also really tough to run. It's a non-linear memory hog. It probably needs 32GB RAM for basics to run, otherwise it will OOM on minimal amount of data. That said, it won't "OOM", as in crash. It will just report the query would use too much memory, so it aborted the query.

AlexClickHouse • 2 days ago

Thanks for creating this issue, it is worth investigating!

I see you also created similar issues in Polars: https://github.com/pola-rs/polars/issues/17932 and DuckDB: https://github.com/duckdb/duckdb/issues/17066

ClickHouse has a built-in memory tracker, so even if there is not enough memory, it will stop the query and send an exception to the client, instead of crashing. It also allows fair sharing of memory between different workloads.

You need to provide more info on the issue for reproduction, e.g., how to fill the tables. 16 GB of memory should be enough even for a CROSS JOIN between a 10 billion-row and a 100-row table, because it is processed in a streaming fashion without accumulating a large amount of data in memory. The same should be true for a merge join.

However, there are places when a large buffer might be needed. For example, if you insert data into a table backed by S3 storage, it requires a buffer that can be in the order of 500 MB.

There is a possibility that your machine has 16 GB of memory, but most of it is consumed by Chrome, Slack, or Safari, and not much is left for ClickHouse server.

david38 • 3 days ago

It’s fantastic but it’s a columnar store. It’s not a Postgres replacement.

hackitup7 • 4 days ago

Yes for relatively large workloads

_gmax0 • 3 days ago

Heard from the grapevine that CloudFlare uses it for their analytics.

1 reply

tveita • 3 days ago

They don't make a secret of it: https://blog.cloudflare.com/log-analytics-using-clickhouse/

Clickhouse is great, but like any database if you run it at scale someone must tend to it.

lossolo • 3 days ago

7 years, 24/7 high volume, self hosted, no issues really.