Open Table Formats Comparison

Delta Lake vs. Apache Iceberg vs. Apache Hudi

Overview

Open Table Formats (OTFs) are the enabling technology for Lakehouse architectures. They bring database-like capabilities (ACID transactions, schema enforcement, time travel) to object storage. This document provides a comprehensive comparison of the three major OTFs to guide architectural decisions.

Quick Comparison Matrix

Feature	Delta Lake	Apache Iceberg	Apache Hudi
Project Origin	Databricks (2017)	Netflix (2018)	Uber (2016)
Apache Status	Incubating (2024)	Top-level (2020)	Top-level (2020)
Primary Language	Scala + Python	Java	Java
Ecosystem Ties	Databricks first	Broad adoption	Streaming focus
Maturity	Very mature	Mature	Mature
Best For	Databricks, concurrency	Multi-cloud, high partitions	Streaming, CDC, upserts

Detailed Feature Comparison

Core Capabilities

Capability	Delta Lake	Apache Iceberg	Apache Hudi
ACID Transactions	Full	Full	Full
Schema Enforcement	Yes	Yes	Yes
Schema Evolution	Yes	Yes	Yes
Time Travel	Yes (excellent)	Yes	Yes
Concurrency Model	Optimistic (fine-grained)	Optimistic (metadata)	Optimistic (locks)
Partition Evolution	Yes	Yes	Yes
Hidden Partitioning	Yes	Yes	No

Write Patterns

Pattern	Delta Lake	Apache Iceberg	Apache Hudi
Append	Excellent	Excellent	Excellent
Batch Upsert	Good (deletion vectors)	Good	Excellent (MOR)
Streaming Upsert	Good	Good	Excellent
Delete	Excellent (deletion vectors)	Good	Excellent (MOR)
CDC	Good	Good	Excellent

Metadata Handling

Aspect	Delta Lake	Apache Iceberg	Apache Hudi
Metadata Location	`_delta_log/` JSON	`metadata/` JSON	`.hoodie/`
Listing Performance	Good	Excellent (manifests)	Good
High Partition Count	Good	Excellent (best)	Good
Metastore	File-based or Unity	File-based or Hive	File-based or Hive
Catalog Support	Unity, Glue, Hive	Glue, Hive, custom	Glue, Hive

Performance Characteristics

Metric	Delta Lake	Apache Iceberg	Apache Hudi
Write Speed	Fast	Fast	Moderate
Read Speed	Fast	Fast	Fast
Merge/Upsert	Very fast (deletion vectors)	Fast	Fast (MOR tables)
Compaction	Automatic (OPTIMIZE)	Manual (rewrite)	Automatic (MOR)
Query Planning	Fast	Very fast (high partitions)	Fast

Architecture Comparison

Delta Lake Architecture

Key characteristics:

Transaction log as single source of truth
Checkpoint files for faster log replay
Deletion vectors for efficient updates/deletes
Optimized for high concurrency
Deep Spark integration

Strengths:

Best Spark integration
Deletion vectors (major performance win)
Best for Databricks workloads
Strong concurrency control
Excellent time travel

Weaknesses:

Databricks origin (perceived vendor tie)
Less community diversity vs. Iceberg
Historically less focus on non-Spark engines

Apache Iceberg Architecture

Key characteristics:

Manifest-based metadata (extremely scalable)
No lock reads (metadata is immutable)
Best for high partition counts
Strong community, multi-cloud adoption
Vendor-neutral design

Strengths:

Best metadata scalability (millions of partitions)
Strongest community diversity (Snowflake, Databricks, BigQuery all support)
Vendor-neutral, no lock-in concerns
Excellent for non-Spark engines (Trino, BigQuery)
Hidden partitioning (no partition columns in queries)

Weaknesses:

Less mature write optimization vs. Delta
Fewer performance features (no deletion vectors yet)
More complex to set up initially

Apache Hudi Architecture

Key characteristics:

Table type choice (COW vs. MOR)
Built for streaming and upsert workloads
Built-in compaction and management
Timeline-based operations

Strengths:

Best for streaming and CDC workloads
MOR tables enable fast upserts
Built-in compaction (less operational overhead)
Strong for real-time use cases
HoodieCli for management operations

Weaknesses:

More complex (two table types)
Smaller community vs. Delta/Iceberg
More operational complexity (compaction tuning)
Less adoption in pure batch workloads

Table Type Deep Dive (Hudi)

Copy-on-Write (COW)

Characteristics:

Updates rewrite entire Parquet files
Reads: Parquet only (fast)
Writes: Slower due to rewrite
Best for: Read-heavy, write-light workloads

Performance:

Write latency: High (rewrite cost)
Read latency: Low (Parquet only)
Storage: Higher (version files)

Merge-on-Read (MOR)

Characteristics:

Updates append to log files (fast)
Reads: Merge base + log (slower) or base-only (faster)
Compaction merges logs into base files
Best for: Write-heavy, streaming workloads

Performance:

Write latency: Low (append only)
Read latency: Medium (merge overhead) to Low (read-optimized view)
Storage: Medium (logs + base)

Selection Framework

Decision Tree

Selection by Use Case

Use Case	Recommended	Rationale
Databricks Platform	Delta Lake	Native support, best integration, deletion vectors
Snowflake	Iceberg or Delta	Both supported, Iceberg more vendor-neutral
BigQuery	Iceberg	Native support, optimized
Pure Spark (OSS)	Delta or Iceberg	Delta for features, Iceberg for community
Trino/Presto	Iceberg	Best support, metadata scalability
Streaming + Upserts	Hudi MOR	Built for streaming, fast upserts
CDC Pipelines	Hudi or Delta	Hudi for pure streaming, Delta with Structured Streaming
High Partition Count	Iceberg	Best metadata scalability
Multi-cloud Strategy	Iceberg	Most vendor-neutral
Real-time Personalization	Hudi MOR	Fast writes, fast enough reads
Batch ETL	Delta or Iceberg	Both excellent, Delta faster merges

Performance Benchmarks

Write Performance

Operation	Delta Lake	Iceberg	Hudi COW	Hudi MOR
Append (100M rows)	2.0x	2.1x	2.0x	2.2x
Merge (10M upserts)	1.0x (baseline)	1.3x	0.8x	1.5x
Delete (10M rows)	1.0x (baseline)	1.4x	0.9x	1.2x
CDC (continuous)	Good	Good	N/A	Excellent

Relative to Delta Lake baseline. Higher = slower. Hudi MOR writes are fastest.

Read Performance

Operation	Delta Lake	Iceberg	Hudi COW	Hudi MOR (RO)
Full scan	1.0x	1.05x	1.0x	1.0x
Filtered scan	1.0x	1.0x	1.0x	1.0x
Point lookup	1.0x	1.0x	1.0x	1.1x
Time travel	1.0x	1.1x	1.2x	1.2x

All formats have excellent read performance. Delta has edge in time travel.

Query Planning (Metadata Operations)

Partition Count	Delta Lake	Iceberg	Hudi
100	10ms	8ms	12ms
1,000	50ms	15ms	80ms
10,000	500ms	25ms	800ms
100,000	5000ms	50ms	N/A
1,000,000	N/A	100ms	N/A

Iceberg dominates at high partition counts due to manifest-based metadata.

Cost Implications

Storage Costs

Scenario	Delta Lake	Iceberg	Hudi	Notes
Append-only	Baseline	Baseline	Baseline	Minimal difference
Heavy updates	+20% (deletion vectors)	+40% (rewrite)	+30% (COW) / +15% (MOR)	Hudi MOR most storage-efficient
Time travel retention	+10-30%	+10-30%	+15-40%	Configurable retention

Compute Costs

Operation	Delta Lake	Iceberg	Hudi	Notes
Batch ETL	Baseline	+5%	+10%	Delta has best Spark optimization
Streaming writes	Baseline	+5%	-10% (MOR)	Hudi MOR fastest for streaming
Merge operations	Baseline	+20%	-20% (MOR)	Deletion vectors make Delta fast, Hudi faster
Query planning	1.0x	0.3x (high partitions)	2.0x (high partitions)	Iceberg wins at scale

Operational Costs

Area	Delta Lake	Iceberg	Hudi
Compaction	Manual (OPTIMIZE)	Manual (rewrite)	Automatic (MOR)
Vacuum/Cleanup	Manual (VACUUM)	Manual (expire_snapshots)	Automatic (cleaner)
Monitoring	Required	Required	Required
Ops effort	Medium	Medium	Low (Hudi auto-manages)

Implementation Considerations

Migration Complexity

From	To Delta Lake	To Iceberg	To Hudi
Hive tables	Easy (CONVERT TO DELTA)	Easy (migrate procedure)	Medium (bulk_insert)
Warehouse	Medium	Medium	Medium
Delta Lake	N/A	Easy	Medium
Iceberg	Easy	N/A	Medium
Hudi	Medium	Medium	N/A

Ecosystem Support

Tool / Platform	Delta Lake	Iceberg	Hudi
Databricks	Native	Via Unity Catalog	Supported
Spark	Excellent	Excellent	Excellent
Trino/Presto	Good	Excellent	Good
Flink	Good	Good	Excellent
BigQuery	Planned	Native	No
Snowflake	Supported	Native	No
Redshift	No	Via Spectrum	No
Pandas/Polars	Good	Good	Poor

Community & Maturity

Metric	Delta Lake	Iceberg	Hudi
GitHub Stars	6.5K	5.5K	4.8K
Contributors	400+	350+	250+
Adoption	Very high	Very high	Medium
Vendor Support	Databricks, Microsoft, Google	AWS, Google, Adobe, Netflix	Uber, Amazon (EMR)

Code Examples

Delta Lake

-- Create table
CREATE TABLE sales (
    id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(10,2),
    sale_date DATE
) USING DELTA;

-- Upsert
MERGE INTO sales target
USING updates source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

-- Time travel
SELECT * FROM sales VERSION AS OF 123;
SELECT * FROM sales TIMESTAMP AS OF '2025-01-26';

-- Optimize
OPTIMIZE sales WHERE sale_date >= '2025-01-01';

-- Vacuum
VACUUM sales RETAIN 30 HOURS;

Apache Iceberg

-- Create table
CREATE TABLE sales (
    id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(10,2),
    sale_date DATE
) USING ICEBERG
PARTITIONED BY (days(sale_date));

-- Upsert
MERGE INTO sales target
USING updates source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

-- Time travel
SELECT * FROM sales VERSION AS OF 123;
SELECT * FROM sales TIMESTAMP AS OF '2025-01-26';

-- Rewrite (compact)
CALL catalog.rewrite_data_files('db.sales');

-- Expire snapshots
CALL catalog.expire_snapshots('db.sales', TIMESTAMP '2025-01-01 00:00:00');

Apache Hudi

-- Create COW table
CREATE TABLE sales (
    id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(10,2),
    sale_date DATE
) USING HUDI
OPTIONS (
    type = 'cow',
    primaryKey = 'id',
    preCombineField = 'sale_date'
);

-- Create MOR table
CREATE TABLE sales_mor (
    id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(10,2),
    sale_date DATE
) USING HUDI
OPTIONS (
    type = 'mor',
    primaryKey = 'id',
    preCombineField = 'sale_date'
);

-- Upsert
INSERT INTO sales VALUES (...);
INSERT INTO sales VALUES (...);  -- Automatic upsert on pk

-- Time travel
SELECT * FROM sales WHERE `_hoodie_commit_time` <= '20250126120000';

-- Compaction (MOR)
CALL run_compaction('db.sales');

Senior Level Recommendations

For Greenfield Projects

Scenario	Recommendation
Databricks shop	Delta Lake (native, deletion vectors)
Multi-cloud strategy	Iceberg (vendor-neutral)
Streaming-first	Hudi MOR (built for streaming)
Uncertain	Iceberg (safest long-term bet)
Snowflake + Spark	Iceberg (supported on both)

For Brownfield (Existing Projects)

From	To	Recommendation
Hive on S3	Any OTF	Delta (easiest migration)
Redshift/Snowflake	Lakehouse	Iceberg (vendor-neutral)
Databricks Delta	Open source	Stay with Delta OSS
On-prem Hadoop	Cloud	Iceberg or Hudi (CDC capability)

Strategic Considerations

Vendor lock-in: Iceberg is most vendor-neutral
Databricks investment: Delta is optimized for Databricks
Streaming requirements: Hudi MOR is purpose-built
Community momentum: Iceberg has most diverse adoption
Skill availability: Delta has most Spark developers

Key Takeaways

All three are production-ready: No wrong choice among the three
Delta Lake: Best for Databricks, high concurrency, fastest merges
Iceberg: Best for multi-cloud, high partition counts, vendor neutrality
Hudi: Best for streaming, CDC, and upsert-heavy workloads
Migration is possible: Can migrate between formats (not trivial)
Coexistence is possible: All three can coexist in same platform
Cost difference is minimal: Storage costs similar, compute varies by workload

Next: Storage Formats Deep Dive