Skip to content

ClickHouse Guide

Real-Time Columnar Database for Analytics


Overview

ClickHouse is a real-time columnar database management system designed for online analytical processing (OLAP). It excels at high-ingestion workloads, real-time analytics, and time-series data. ClickHouse is particularly strong for event data, log analysis, and monitoring scenarios.


Core Concepts

ClickHouse Architecture

Key Characteristics

CharacteristicDescriptionBenefit
ColumnarColumn-oriented storageFast analytical queries
Real-time ingestHigh write throughputMillions of rows/second
CompressionExcellent compressionLow storage cost
SQL supportSQL-like query languageFamiliar interface
ShardingHorizontal scalabilityLinear scalability
ReplicationData replicationHigh availability

Table Engines

MergeTree Family (Primary Engine)

MergeTree (Base Engine)

CREATE TABLE events (
timestamp DateTime,
user_id UInt64,
event_type String,
amount Decimal(18, 2)
)
ENGINE = MergeTree()
ORDER BY (user_id, timestamp)
PARTITION BY toYYYYMM(timestamp)
SETTINGS index_granularity = 8192;

Key settings:

  • ORDER BY: Sort key (primary index)
  • PARTITION BY: Partition key (separate directories)
  • PRIMARY KEY: Optional (defaults to ORDER BY)

ReplacingMergeTree (Deduplication)

CREATE TABLE events (
timestamp DateTime,
user_id UInt64,
event_type String,
version UInt64
)
ENGINE = ReplacingMergeTree(version)
ORDER BY (user_id, timestamp)
PARTITION BY toYYYYMM(timestamp);

Use case: Deduplicate by version (last write wins).

Gotcha: Deduplication happens on merge, not immediately.

SummingMergeTree (Pre-Aggregation)

CREATE TABLE daily_metrics (
date Date,
metric_name String,
metric_value Decimal(18, 2)
)
ENGINE = SummingMergeTree(metric_value)
ORDER BY (date, metric_name);

Use case: Pre-aggregate numeric metrics (sum on merge).

ReplicatedMergeTree (Replication)

CREATE TABLE events_replicated (
timestamp DateTime,
user_id UInt64,
event_type String,
amount Decimal(18, 2)
)
ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/events',
'{replica}'
)
ORDER BY (user_id, timestamp)
PARTITION BY toYYYYMM(timestamp);

Use case: High availability (replicate data across nodes).


Data Types

Numeric Types

-- Unsigned integers
UInt8, UInt16, UInt32, UInt64, UInt128, UInt256
-- Signed integers
Int8, Int16, Int32, Int64, Int128, Int256
-- Floating point
Float32, Float64
-- Decimal (precise)
Decimal32(P), Decimal64(P), Decimal128(P), Decimal256(P)
-- P = precision (1-38), scale = S
Decimal(P, S) -- Generic, picks size automatically

Temporal Types

-- Date (4 bytes)
Date
-- DateTime (4 bytes, Unix timestamp)
DateTime
-- DateTime64 (8 bytes, with sub-second precision)
DateTime64(3) -- Millisecond precision
DateTime64(6) -- Microsecond precision
DateTime64(9) -- Nanosecond precision

String Types

-- String (variable length)
String
-- Fixed string (N bytes)
FixedString(N)
-- Low-cardinality string (dictionary encoding)
LowCardinality(String)

Special Types

-- Array
Array(String)
Array(UInt32)
-- Map (key-value)
Map(String, Int32)
-- Tuple (struct)
Tuple(UInt32, String, Decimal(18, 2))
-- Nullable (allows NULL)
Nullable(UInt32)
Nullable(String)
-- UUID (16 bytes)
UUID

Performance Optimization

Primary Key Design

-- Good: High-cardinality, filter-heavy columns
ORDER BY (user_id, event_timestamp)
-- Bad: Low-cardinality column (inefficient)
ORDER BY (event_type)
-- Good: Composite key for multi-column queries
ORDER BY (user_id, event_timestamp, event_type)

Partitioning

-- Good: Date-based partitioning
PARTITION BY toYYYYMM(event_timestamp)
-- Good: High-cardinality partitioning
PARTITION BY user_id % 100 -- 100 partitions
-- Bad: Too many partitions (>1000)
PARTITION toYYYYMMDD(event_timestamp) -- 365 partitions per year

Sampling

-- Create sampling key
ORDER BY (intHash32(user_id), event_timestamp)
-- Sample query (approximate result)
SELECT *
FROM events
SAMPLE 0.1 -- 10% sample

Data Skipping

-- Skip indices (data skipping)
CREATE INDEX idx_event_type
ON events (event_type)
TYPE minmax GRANULARITY 4;
CREATE INDEX idx_user_id_bloom
ON events (user_id)
TYPE bloom_filter GRANULARITY 1;

Real-Time Ingestion

High-Throughput Inserts

-- Batch insert (recommended)
INSERT INTO events VALUES
(now(), 1, 'click', 100.0),
(now(), 2, 'view', 0.0),
(now(), 3, 'purchase', 500.0);
-- Insert from file
cat events.csv | clickhouse-client --query="INSERT INTO events FORMAT CSV"
-- Insert from Kafka
CREATE TABLE events_kafka (
timestamp DateTime,
user_id UInt64,
event_type String,
amount Decimal(18, 2)
)
ENGINE = Kafka
SETTINGS
kafka_broker_list = 'kafka:9092',
kafka_topic_list = 'events',
kafka_group_name = 'clickhouse',
kafka_format = 'JSONEachRow';
-- Materialize Kafka stream to table
CREATE MATERIALIZED VIEW events_mv TO events
AS SELECT * FROM events_kafka;

Buffer Engine (Buffering)

-- Buffer table for small writes
CREATE TABLE events_buffer AS events
ENGINE = Buffer(
events, -- Target database
events, -- Target table
16, -- Number of buckets
10, -- Max time (seconds)
10000, -- Max rows
10000, -- Max bytes
10000, -- Max rows in buffer
1000000, -- Max bytes in buffer
10000 -- Max seconds in buffer
);

Query Optimization

Query Profiling

-- Enable query logging
SET log_queries = 1;
SET log_queries_min_type = 'QueryFinish';
-- View query log
SELECT
query,
query_duration_ms,
memory_usage,
read_rows,
read_bytes
FROM system.query_log
WHERE type = 'QueryFinish'
ORDER BY event_time DESC
LIMIT 10;

EXPLAIN Plan

-- View query plan
EXPLAIN PLAN
SELECT user_id, SUM(amount)
FROM events
WHERE event_timestamp > now() - INTERVAL 1 DAY
GROUP BY user_id;
-- View query execution stats
EXPLAIN PIPELINE
SELECT user_id, SUM(amount)
FROM events
WHERE event_timestamp > now() - INTERVAL 1 DAY
GROUP BY user_id;

Materialized Views

-- Pre-aggregate data
CREATE MATERIALIZED VIEW events_daily_mv
ENGINE = SummingMergeTree()
ORDER BY (date, user_id)
AS SELECT
toDate(event_timestamp) as date,
user_id,
sum(amount) as total_amount,
count() as event_count
FROM events
GROUP BY date, user_id;
-- Query materialized view (fast)
SELECT * FROM events_daily_mv WHERE date >= yesterday();

Sharding and Replication

Sharding Strategy

Distributed Table

-- Create distributed table
CREATE TABLE events_distributed AS events
ENGINE = Distributed(
cluster_name, -- Cluster name in config
database_name, -- Database
events, -- Local table
cityHash64(user_id) -- Sharding key
);
-- Query distributed table (queries all shards)
SELECT * FROM events_distributed;

Replication Setup

/etc/clickhouse-server/config.d/zookeeper.xml
<zookeeper>
<node>
<host>zookeeper1</host>
<port>2181</port>
</node>
<node>
<host>zookeeper2</host>
<port>2181</port>
</node>
</zookeeper>

Cost Optimization

Compression

-- Set compression codec
ALTER TABLE events MODIFY SETTING compression_codec = 'ZSTD';
-- Options:
-- NONE, LZ4, ZSTD, ZSTD(1-22), ZSTD(19)
-- ZSTD provides best compression

Compression ratios:

  • LZ4: 3-5x compression, fast
  • ZSTD: 5-10x compression, slower

TTL (Data Lifecycle)

-- Delete old data
ALTER TABLE events
MODIFY TTL event_timestamp + INTERVAL 30 DAY
DELETE;
-- Move to cold storage
ALTER TABLE events
MODIFY TTL event_timestamp + INTERVAL 30 DAY
TO DISK 'cold';
-- Compress old data
ALTER TABLE events
MODIFY TTL event_timestamp + INTERVAL 30 DAY
TO VOLUME 'cold' SETTINGS storage_policy = 'cold_storage';

Senior Level Gotchas

Gotcha 1: Mutation is Async

Problem: ALTER TABLE mutations are asynchronous.

-- Bad: Assume mutation is immediate
ALTER TABLE events DELETE WHERE event_timestamp < '2025-01-01';
-- SELECT still returns old data until mutation completes
-- Good: Monitor mutation progress
SELECT * FROM system.mutations WHERE database = 'db' AND table = 'events';

Gotcha 2: ReplacingMergeTree Deduplication Delay

Problem: Deduplication happens on merge, not immediately.

-- Bad: Expect immediate deduplication
INSERT INTO events VALUES (...);
INSERT INTO events VALUES (...); -- Duplicate
-- Both rows visible immediately
-- Good: Force merge or use FINAL
SELECT * FROM events FINAL;
OPTIMIZE TABLE events FINAL;

Gotcha 3: Limited Support for UPDATE/DELETE

Problem: ClickHouse is optimized for append-only.

Solutions:

  1. Use ReplacingMergeTree for deduplication
  2. Use CollapsingMergeTree for soft deletes
  3. Use TTL for data lifecycle
  4. Use mutations (async, expensive)

Best Practices

DO

  • Use MergeTree family engines (not Log engines for production)
  • Partition by date (toYYYYMM)
  • Order by high-cardinality, filter-heavy columns
  • Use LowCardinality for low-cardinality strings
  • Batch inserts (1000+ rows per batch)
  • Use ReplicatedMergeTree for HA
  • Use ZSTD compression for storage savings

DON’T

  • Use Log engines for production (not scalable)
  • Partition by low-cardinality columns
  • Use UPDATE/DELETE frequently (async, expensive)
  • Insert single rows (slow, inefficient)
  • Ignore TTL (data never expires)
  • Forget to shard at scale

Key Takeaways

  1. Real-time ingest: Millions of rows/second write throughput
  2. Columnar storage: Fast analytical queries
  3. MergeTree engine: Primary engine for production
  4. Partition by date: Automatic data lifecycle
  5. Sharding for scale: Horizontal scalability
  6. Replication for HA: Data replication across nodes
  7. Not for OLTP: Optimized for analytics, not transactions

Back to Compute Engines