Fix: ClickHouse Not Working — MergeTree ORDER BY, INSERT Batching, LowCardinality, and Replication
Quick Answer
How to fix ClickHouse errors — table engine choice, ORDER BY for primary key, INSERT too many small parts, LowCardinality types, Nullable performance, ZooKeeper for Replicated tables, and HTTP vs Native client.
The Error
You create a table without ORDER BY and the CREATE fails:
DB::Exception: Storage MergeTree requires either ORDER BY or PRIMARY KEY.Or INSERT performance is terrible — 100 rows/second instead of millions:
# Inserting one row at a time:
for row in rows:
client.execute("INSERT INTO events VALUES", [row])
# Each call creates a small part. ClickHouse merges them in background but slows down.Or SELECT count(*) over a column is slow despite a billion-row table:
SELECT count() FROM events WHERE user_country = 'JP';
-- Scans 100 GB instead of using the index.Or queries fail with replication errors:
DB::Exception: ZooKeeper session has expired. Please try reconnecting.Why This Happens
ClickHouse is a columnar OLAP database optimized for analytics. Most issues map to:
- MergeTree engine variants.
MergeTree,ReplacingMergeTree,SummingMergeTree,AggregatingMergeTree,ReplicatedMergeTree. Each has different semantics and tradeoffs. Picking wrong gives correct-looking output but wrong behavior at scale. - INSERT model. ClickHouse expects batched inserts. Row-by-row inserts create one “part” per call, and parts merge in background. Too many parts → slow merges, errors, OOM.
- Primary key isn’t enforced. ORDER BY is the “primary key” (used for sparse index), but doesn’t dedupe —
MergeTreeallows duplicates. UseReplacingMergeTreefor dedup-on-merge. - LowCardinality dictionary encoding. Columns with few distinct values benefit from
LowCardinality(String)— 10-100x storage reduction and faster filters. Without it, every row stores the full string.
Fix 1: Always Specify ORDER BY
CREATE TABLE events (
event_time DateTime,
user_id UInt64,
event_type LowCardinality(String),
country LowCardinality(String),
revenue Float64
) ENGINE = MergeTree()
ORDER BY (event_time, user_id);ORDER BY does two things:
- Physical sort. Data is stored in this order on disk.
- Sparse index. ClickHouse skips parts that don’t match the ORDER BY prefix.
For queries like WHERE event_time > now() - INTERVAL 1 HOUR AND user_id = 42, both predicates use the ORDER BY index — fast.
Choose ORDER BY by:
- Most-selective filter first. If
event_timeis always in your WHERE, put it first. - Bounded cardinality. Avoid high-cardinality columns first (e.g. UUIDs) — they don’t compress or skip well.
- 3-5 columns max. Too many keys = wide index, slower writes.
For wider table tuning:
CREATE TABLE events (...) ENGINE = MergeTree()
ORDER BY (event_time, user_id)
PARTITION BY toYYYYMM(event_time) -- Optional: separate parts per month
TTL event_time + INTERVAL 90 DAY DELETE; -- Auto-delete old rowsPARTITION BY lets you drop entire partitions in O(1) (ALTER TABLE events DROP PARTITION 202401) — much faster than DELETE.
Pro Tip: Don’t over-partition. Each partition is a directory; thousands of small partitions hurts performance. Aim for partitions of 1-100M rows.
Fix 2: Batch INSERTs
ClickHouse expects batches. The recommended pattern:
# Bad — one INSERT per row:
for event in events:
client.execute("INSERT INTO events VALUES", [event])
# Good — one INSERT per batch:
client.execute("INSERT INTO events VALUES", events) # All at once
# Better for very large batches — async insert:
client.execute("INSERT INTO events SETTINGS async_insert=1, wait_for_async_insert=1 VALUES", events)For streaming inserts where you can’t accumulate large batches client-side, use async_insert:
INSERT INTO events SETTINGS async_insert = 1, wait_for_async_insert = 0
VALUES (now(), 1, 'click', 'JP', 0.0);ClickHouse buffers async inserts server-side and flushes them in batches (default 200ms or 10 MB). Per-row cost drops dramatically.
For very high write throughput, use Buffer engine or write to a Kafka table function — ClickHouse pulls from Kafka in batches automatically.
Common Mistake: Setting async_insert = 1 without wait_for_async_insert. Inserts return immediately, before they’re flushed. If the server crashes between buffer and flush, data is lost. Use wait_for_async_insert = 1 for durability.
Fix 3: Use LowCardinality for String Enums
-- BAD: every row stores the full string
CREATE TABLE events (event_type String, country String, ...);
-- GOOD: dictionary-encoded
CREATE TABLE events (event_type LowCardinality(String), country LowCardinality(String), ...);LowCardinality(String) stores strings as int dictionary keys. For fields with ≤ ~10K distinct values, this is 5-100x more compact and faster to filter.
Rule of thumb: use LowCardinality when you have:
- Country codes
- Event type enums
- HTTP status codes
- Service names
- Region/zone identifiers
- Any “category” string
Don’t use for:
- User IDs (high cardinality)
- URLs (often unique)
- Free-form text
Pro Tip: You can convert later: ALTER TABLE events MODIFY COLUMN event_type LowCardinality(String). But ClickHouse rewrites the column — expensive on big tables. Set it correctly at table create.
Fix 4: Avoid Nullable Unless You Need It
-- Slow: Nullable adds overhead (separate bitmap)
CREATE TABLE events (revenue Nullable(Float64), ...);
-- Fast: use a sentinel value if NULL semantics aren't critical
CREATE TABLE events (revenue Float64 DEFAULT 0, ...);Nullable(T) columns have ~2x storage overhead and slower filters. For numeric columns where 0 or -1 can serve as “no value,” skip Nullable.
For strings, use empty string instead of NULL:
CREATE TABLE events (referer String DEFAULT '', ...);If you genuinely need NULL semantics (distinct from “empty”), use Nullable but be deliberate.
Fix 5: Match Engine to Use Case
| Engine | Use case |
|---|---|
MergeTree | General-purpose. Allows duplicates. |
ReplacingMergeTree(version_col) | Deduplicate on background merge. |
SummingMergeTree(sum_col) | Sum-aggregate on background merge. |
AggregatingMergeTree | Custom aggregates via aggregate functions. |
CollapsingMergeTree(sign) | Mark rows as deleted (-1 sign). |
VersionedCollapsingMergeTree | Versioned collapsing. |
For event sourcing where you want the “latest version of each entity”:
CREATE TABLE users (
user_id UInt64,
name String,
updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY user_id;ClickHouse keeps the row with the highest updated_at for each user_id after background merges. Note: merges are async — you may see duplicates briefly. Use FINAL modifier (SELECT * FROM users FINAL) to force dedup at query time (slower).
For pre-aggregated rollups:
CREATE MATERIALIZED VIEW events_daily
ENGINE = SummingMergeTree()
ORDER BY (date, country, event_type)
AS SELECT
toDate(event_time) AS date,
country,
event_type,
count() AS event_count,
sum(revenue) AS total_revenue
FROM events
GROUP BY date, country, event_type;Each new INSERT to events triggers a small INSERT into events_daily. Background merges sum the matching rows. Queries against events_daily are massively faster than aggregating raw events.
Common Mistake: Querying a ReplacingMergeTree table without FINAL and expecting dedup. Without FINAL (or aggregate argMax), you see all versions.
Fix 6: Replication via ZooKeeper / Keeper
For HA, use ReplicatedMergeTree:
CREATE TABLE events ON CLUSTER my_cluster (...) ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/events', -- ZK path
'{replica}' -- Replica name
)
ORDER BY (event_time, user_id);{shard} and {replica} are macros defined per-server in config.
Each shard has multiple replicas. INSERTs to one replica propagate to others. SELECTs can hit any replica.
For multi-shard tables (sharded by hash), wrap in Distributed:
CREATE TABLE events_dist ON CLUSTER my_cluster AS events
ENGINE = Distributed(my_cluster, default, events, rand());Inserts to events_dist route to shards by rand() (or your hash). Queries fan out across all shards.
Note: ClickHouse Keeper is a built-in ZooKeeper replacement. Newer deployments should use Keeper — fewer dependencies, better performance, ClickHouse-native config.
For monitoring replication lag:
SELECT * FROM system.replication_queue WHERE table = 'events';A growing queue means a replica is falling behind — investigate before it becomes critical.
Fix 7: HTTP vs Native Protocol
ClickHouse exposes two protocols:
- Native (port 9000): Binary, used by official
clickhouse-clientand most language libraries. Fastest. - HTTP (port 8123): Plain HTTP with JSON or TSV responses. Easier to debug; works from any HTTP client.
For Python with the native driver:
from clickhouse_driver import Client
client = Client(host="localhost", port=9000, user="default", password="")
result = client.execute("SELECT count() FROM events")For HTTP-based:
import requests
r = requests.post(
"http://localhost:8123",
params={"query": "SELECT count() FROM events FORMAT JSON"},
)
print(r.json())For Node:
import { createClient } from "@clickhouse/client";
const client = createClient({ url: "http://localhost:8123" });
const result = await client.query({
query: "SELECT count() FROM events",
format: "JSON",
});
console.log(await result.json());The official JS client uses HTTP under the hood. There’s no widely-used native Node driver.
Pro Tip: For high-throughput INSERT, prefer native (Python). For occasional queries from a web app, HTTP is fine.
Fix 8: Common Query Patterns
Time-bucketed counts:
SELECT
toStartOfHour(event_time) AS hour,
count() AS events
FROM events
WHERE event_time >= now() - INTERVAL 1 DAY
GROUP BY hour
ORDER BY hour;Top-N per group:
SELECT country, event_type, count() AS c
FROM events
WHERE event_time >= today()
GROUP BY country, event_type
ORDER BY c DESC
LIMIT 10 BY country;LIMIT N BY column is a ClickHouse extension — returns top N per group, more efficient than window functions.
Percentile:
SELECT
quantile(0.99)(latency_ms) AS p99,
quantile(0.95)(latency_ms) AS p95,
quantile(0.50)(latency_ms) AS p50
FROM requests
WHERE event_time >= now() - INTERVAL 1 HOUR;ClickHouse’s quantile is approximate by default — fast on huge tables. For exact, use quantileExact.
JOIN large tables:
-- Use the smaller table on the right:
SELECT users.name, events.event_type
FROM events
INNER JOIN users ON users.id = events.user_id
WHERE events.event_time >= today();ClickHouse builds a hash table from the right side. Put the smaller table there.
For very large joins where neither side fits in memory:
SET join_algorithm = 'partial_merge'; -- Disk-based merge joinStill Not Working?
A few less-obvious failures:
Too many partserror. Background merges can’t keep up. Throttle inserts, increase batch size, or check disk I/O.system.partsshows part counts per table.- OOM during a large query. Set
max_memory_usageper query. For aggregations, enablemax_bytes_before_external_group_byto spill to disk. DB::Exception: Code: 159. Timeout exceeded. Increasemax_execution_timeor break the query into smaller chunks.Replication queue grows unbounded. Network split or one replica is down. Restart the lagging replica or remove it temporarily.Code: 252. DB::Exception: Too many partitions. Each partition is a directory. Don’t partition by hour if you have years of data — partition by month or week instead.SELECTreturns duplicate rows after INSERT.ReplacingMergeTreehasn’t merged yet. UseFINALorargMaxif you need dedupe immediately.Cannot bind to address. Port conflict. ClickHouse defaults to 8123 (HTTP) and 9000 (Native). Other tools (Portainer, MinIO) use 9000 too.- High cardinality
LowCardinalityfield warning. If a column has >10K distinct values,LowCardinalityadds overhead. ClickHouse can warn or error depending on settings — checklow_cardinality_max_dictionary_size.
For related analytics and OLAP database issues, see DuckDB not working, Postgres slow query, MySQL slow query optimization, and Postgres index not used.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: BigQuery Not Working — Partitioning, Slots, Streaming Inserts, and Cost Surprises
How to fix BigQuery errors — query scans full table without partition filter, slot contention with on-demand pricing, streaming insert quota exceeded, DML row limits, service account auth, and INFORMATION_SCHEMA for monitoring.
Fix: DuckDB Not Working — File Lock Conflicts, Out of Memory, Extensions, and Parquet/S3 Reads
How to fix DuckDB errors — IOException database is locked, OutOfMemoryException on large queries, httpfs extension not loaded, secret manager for S3, Pandas/Polars zero-copy, and concurrent writer limits.
Fix: dbt Not Working — ref() Not Found, Schema Mismatch, and Compilation Errors
How to fix dbt errors — ref() model not found, profile not found, database relation does not exist, incremental model schema mismatch requiring full-refresh, dbt deps failure, Jinja compilation errors, and test failures.
Fix: Snowflake Not Working — Warehouse Suspend, COPY INTO, Roles, Time Travel, and Snowpipe
How to fix Snowflake errors — warehouse auto-suspend pricing, account URL format, role-vs-grant confusion, COPY INTO S3/Azure errors, Snowpipe lag, Time Travel and Fail-safe, and connection string for snowflake-connector-python.