ClickHouse Guide
Real-Time Columnar Database for Analytics
Overview
ClickHouse is a real-time columnar database management system designed for online analytical processing (OLAP). It excels at high-ingestion workloads, real-time analytics, and time-series data. ClickHouse is particularly strong for event data, log analysis, and monitoring scenarios.
Core Concepts
ClickHouse Architecture
Key Characteristics
| Characteristic | Description | Benefit |
|---|---|---|
| Columnar | Column-oriented storage | Fast analytical queries |
| Real-time ingest | High write throughput | Millions of rows/second |
| Compression | Excellent compression | Low storage cost |
| SQL support | SQL-like query language | Familiar interface |
| Sharding | Horizontal scalability | Linear scalability |
| Replication | Data replication | High availability |
Table Engines
MergeTree Family (Primary Engine)
MergeTree (Base Engine)
CREATE TABLE events ( timestamp DateTime, user_id UInt64, event_type String, amount Decimal(18, 2))ENGINE = MergeTree()ORDER BY (user_id, timestamp)PARTITION BY toYYYYMM(timestamp)SETTINGS index_granularity = 8192;Key settings:
- ORDER BY: Sort key (primary index)
- PARTITION BY: Partition key (separate directories)
- PRIMARY KEY: Optional (defaults to ORDER BY)
ReplacingMergeTree (Deduplication)
CREATE TABLE events ( timestamp DateTime, user_id UInt64, event_type String, version UInt64)ENGINE = ReplacingMergeTree(version)ORDER BY (user_id, timestamp)PARTITION BY toYYYYMM(timestamp);Use case: Deduplicate by version (last write wins).
Gotcha: Deduplication happens on merge, not immediately.
SummingMergeTree (Pre-Aggregation)
CREATE TABLE daily_metrics ( date Date, metric_name String, metric_value Decimal(18, 2))ENGINE = SummingMergeTree(metric_value)ORDER BY (date, metric_name);Use case: Pre-aggregate numeric metrics (sum on merge).
ReplicatedMergeTree (Replication)
CREATE TABLE events_replicated ( timestamp DateTime, user_id UInt64, event_type String, amount Decimal(18, 2))ENGINE = ReplicatedMergeTree( '/clickhouse/tables/{shard}/events', '{replica}')ORDER BY (user_id, timestamp)PARTITION BY toYYYYMM(timestamp);Use case: High availability (replicate data across nodes).
Data Types
Numeric Types
-- Unsigned integersUInt8, UInt16, UInt32, UInt64, UInt128, UInt256
-- Signed integersInt8, Int16, Int32, Int64, Int128, Int256
-- Floating pointFloat32, Float64
-- Decimal (precise)Decimal32(P), Decimal64(P), Decimal128(P), Decimal256(P)-- P = precision (1-38), scale = SDecimal(P, S) -- Generic, picks size automaticallyTemporal Types
-- Date (4 bytes)Date
-- DateTime (4 bytes, Unix timestamp)DateTime
-- DateTime64 (8 bytes, with sub-second precision)DateTime64(3) -- Millisecond precisionDateTime64(6) -- Microsecond precisionDateTime64(9) -- Nanosecond precisionString Types
-- String (variable length)String
-- Fixed string (N bytes)FixedString(N)
-- Low-cardinality string (dictionary encoding)LowCardinality(String)Special Types
-- ArrayArray(String)Array(UInt32)
-- Map (key-value)Map(String, Int32)
-- Tuple (struct)Tuple(UInt32, String, Decimal(18, 2))
-- Nullable (allows NULL)Nullable(UInt32)Nullable(String)
-- UUID (16 bytes)UUIDPerformance Optimization
Primary Key Design
-- Good: High-cardinality, filter-heavy columnsORDER BY (user_id, event_timestamp)
-- Bad: Low-cardinality column (inefficient)ORDER BY (event_type)
-- Good: Composite key for multi-column queriesORDER BY (user_id, event_timestamp, event_type)Partitioning
-- Good: Date-based partitioningPARTITION BY toYYYYMM(event_timestamp)
-- Good: High-cardinality partitioningPARTITION BY user_id % 100 -- 100 partitions
-- Bad: Too many partitions (>1000)PARTITION toYYYYMMDD(event_timestamp) -- 365 partitions per yearSampling
-- Create sampling keyORDER BY (intHash32(user_id), event_timestamp)
-- Sample query (approximate result)SELECT *FROM eventsSAMPLE 0.1 -- 10% sampleData Skipping
-- Skip indices (data skipping)CREATE INDEX idx_event_typeON events (event_type)TYPE minmax GRANULARITY 4;
CREATE INDEX idx_user_id_bloomON events (user_id)TYPE bloom_filter GRANULARITY 1;Real-Time Ingestion
High-Throughput Inserts
-- Batch insert (recommended)INSERT INTO events VALUES (now(), 1, 'click', 100.0), (now(), 2, 'view', 0.0), (now(), 3, 'purchase', 500.0);
-- Insert from filecat events.csv | clickhouse-client --query="INSERT INTO events FORMAT CSV"
-- Insert from KafkaCREATE TABLE events_kafka ( timestamp DateTime, user_id UInt64, event_type String, amount Decimal(18, 2))ENGINE = KafkaSETTINGS kafka_broker_list = 'kafka:9092', kafka_topic_list = 'events', kafka_group_name = 'clickhouse', kafka_format = 'JSONEachRow';
-- Materialize Kafka stream to tableCREATE MATERIALIZED VIEW events_mv TO eventsAS SELECT * FROM events_kafka;Buffer Engine (Buffering)
-- Buffer table for small writesCREATE TABLE events_buffer AS eventsENGINE = Buffer( events, -- Target database events, -- Target table 16, -- Number of buckets 10, -- Max time (seconds) 10000, -- Max rows 10000, -- Max bytes 10000, -- Max rows in buffer 1000000, -- Max bytes in buffer 10000 -- Max seconds in buffer);Query Optimization
Query Profiling
-- Enable query loggingSET log_queries = 1;SET log_queries_min_type = 'QueryFinish';
-- View query logSELECT query, query_duration_ms, memory_usage, read_rows, read_bytesFROM system.query_logWHERE type = 'QueryFinish'ORDER BY event_time DESCLIMIT 10;EXPLAIN Plan
-- View query planEXPLAIN PLANSELECT user_id, SUM(amount)FROM eventsWHERE event_timestamp > now() - INTERVAL 1 DAYGROUP BY user_id;
-- View query execution statsEXPLAIN PIPELINESELECT user_id, SUM(amount)FROM eventsWHERE event_timestamp > now() - INTERVAL 1 DAYGROUP BY user_id;Materialized Views
-- Pre-aggregate dataCREATE MATERIALIZED VIEW events_daily_mvENGINE = SummingMergeTree()ORDER BY (date, user_id)AS SELECT toDate(event_timestamp) as date, user_id, sum(amount) as total_amount, count() as event_countFROM eventsGROUP BY date, user_id;
-- Query materialized view (fast)SELECT * FROM events_daily_mv WHERE date >= yesterday();Sharding and Replication
Sharding Strategy
Distributed Table
-- Create distributed tableCREATE TABLE events_distributed AS eventsENGINE = Distributed( cluster_name, -- Cluster name in config database_name, -- Database events, -- Local table cityHash64(user_id) -- Sharding key);
-- Query distributed table (queries all shards)SELECT * FROM events_distributed;Replication Setup
<zookeeper> <node> <host>zookeeper1</host> <port>2181</port> </node> <node> <host>zookeeper2</host> <port>2181</port> </node></zookeeper>Cost Optimization
Compression
-- Set compression codecALTER TABLE events MODIFY SETTING compression_codec = 'ZSTD';
-- Options:-- NONE, LZ4, ZSTD, ZSTD(1-22), ZSTD(19)-- ZSTD provides best compressionCompression ratios:
- LZ4: 3-5x compression, fast
- ZSTD: 5-10x compression, slower
TTL (Data Lifecycle)
-- Delete old dataALTER TABLE eventsMODIFY TTL event_timestamp + INTERVAL 30 DAYDELETE;
-- Move to cold storageALTER TABLE eventsMODIFY TTL event_timestamp + INTERVAL 30 DAYTO DISK 'cold';
-- Compress old dataALTER TABLE eventsMODIFY TTL event_timestamp + INTERVAL 30 DAYTO VOLUME 'cold' SETTINGS storage_policy = 'cold_storage';Senior Level Gotchas
Gotcha 1: Mutation is Async
Problem: ALTER TABLE mutations are asynchronous.
-- Bad: Assume mutation is immediateALTER TABLE events DELETE WHERE event_timestamp < '2025-01-01';-- SELECT still returns old data until mutation completes
-- Good: Monitor mutation progressSELECT * FROM system.mutations WHERE database = 'db' AND table = 'events';Gotcha 2: ReplacingMergeTree Deduplication Delay
Problem: Deduplication happens on merge, not immediately.
-- Bad: Expect immediate deduplicationINSERT INTO events VALUES (...);INSERT INTO events VALUES (...); -- Duplicate-- Both rows visible immediately
-- Good: Force merge or use FINALSELECT * FROM events FINAL;OPTIMIZE TABLE events FINAL;Gotcha 3: Limited Support for UPDATE/DELETE
Problem: ClickHouse is optimized for append-only.
Solutions:
- Use ReplacingMergeTree for deduplication
- Use CollapsingMergeTree for soft deletes
- Use TTL for data lifecycle
- Use mutations (async, expensive)
Best Practices
DO
- Use MergeTree family engines (not Log engines for production)
- Partition by date (toYYYYMM)
- Order by high-cardinality, filter-heavy columns
- Use LowCardinality for low-cardinality strings
- Batch inserts (1000+ rows per batch)
- Use ReplicatedMergeTree for HA
- Use ZSTD compression for storage savings
DON’T
- Use Log engines for production (not scalable)
- Partition by low-cardinality columns
- Use UPDATE/DELETE frequently (async, expensive)
- Insert single rows (slow, inefficient)
- Ignore TTL (data never expires)
- Forget to shard at scale
Key Takeaways
- Real-time ingest: Millions of rows/second write throughput
- Columnar storage: Fast analytical queries
- MergeTree engine: Primary engine for production
- Partition by date: Automatic data lifecycle
- Sharding for scale: Horizontal scalability
- Replication for HA: Data replication across nodes
- Not for OLTP: Optimized for analytics, not transactions
Back to Compute Engines