Lakehouse Concepts

The Architecture That Unified Data Lake and Warehouse

Overview

The Lakehouse architecture represents the paradigm shift that occurred from 2020-2025, combining the best aspects of data lakes (low-cost storage, flexibility) with data warehouses (ACID transactions, schema enforcement). This architecture is enabled by Open Table Formats and is now the de facto standard for modern data platforms.

Historical Context

The Evolution of Data Architecture

The Problem Lakehouse Solves

Problem	Data Lake	Data Warehouse	Lakehouse
Storage Cost	Low ($/TB)	High (premium compute)	Low ($/TB)
ACID Transactions	No	Yes	Yes (via OTF)
Schema Enforcement	Weak	Strong	Strong
Data Quality	Poor (swamp risk)	Good	Good
BI Support	Poor	Excellent	Excellent
ML/AI Support	Good	Poor	Good
Storage Duplication	Single copy	Duplicate storage	Single copy
Query Performance	Variable	Excellent	Excellent

Lakehouse Architecture

Core Components

Key Principles

Unified Storage: Single copy of data in object storage
Compute-Storage Separation: Independent scaling
Open Formats: No vendor lock-in (Parquet + OTF)
ACID Transactions: Reliable, concurrent operations
Schema Enforcement: Quality at ingest
BI Compatibility: Standard SQL, high performance

The Open Table Format Enabler

What OTFs Provide

How OTFs Work

Traditional Hive:

table/
├── dt=2025-01-27/
│   ├── part-00001.parquet
│   ├── part-00002.parquet
│   └── ...
├── dt=2025-01-28/
│   ├── part-00001.parquet
│   └── ...

No transaction log
No atomic commits
No schema enforcement
Manual “list files” for discovery

Lakehouse with OTF:

table/
├── _delta_log/  (or metadata/)
│   ├── 000000.json
│   ├── 000001.json
│   └── ...
├── dt=2025-01-27/
│   ├── part-00001.parquet
│   └── ...

Transaction log tracks all changes
Atomic commits (all-or-nothing)
Schema enforcement and evolution
Metadata-driven queries

Core Capabilities

1. ACID Transactions

What it means:

Atomicity: Commits are all-or-nothing
Consistency: Schema and constraints enforced
Isolation: Readers don’t see partial writes
Durability: Committed data is permanent

Why it matters:

Multiple concurrent writers don’t corrupt data
Queries see consistent snapshots
Failed writes don’t leave partial data

Example scenario:

-- Without ACID (Hive):
-- Writer writes 5 files, crashes after 3
-- Reader sees partial data - INCONSISTENT

-- With ACID (Lakehouse):
-- Writer commits transaction
-- Either all 5 files visible, or none
-- Reader always sees consistent state

2. Schema Enforcement & Evolution

Schema Enforcement:

-- Table expects: col1 INT, col2 STRING
INSERT INTO table VALUES (1, 'valid');  -- Success
INSERT INTO table VALUES ('invalid', 2);  -- Rejected

Schema Evolution:

-- Add column - non-breaking
ALTER TABLE table ADD COLUMN col3 STRING;

-- Drop column - breaking (needs explicit handling)
ALTER TABLE table DROP COLUMN col2;

Types of Evolution:

Operation	Safe?	Action Required
Add column	Yes	None
Rename column	No	Update downstream queries
Change type	No	Full rewrite or cast
Drop column	No	Update downstream queries

3. Time Travel

What it means: Query previous versions of data.

Use cases:

Debugging data issues
ML model reproducibility
Audit and compliance
Rollback from bad data

Example:

-- Current data
SELECT * FROM table;

-- Data as of yesterday
SELECT * FROM table VERSION AS OF 12345;
SELECT * FROM table TIMESTAMP AS OF '2025-01-26';

-- History
SELECT * FROM table.history;

Cost Implication: Time travel retains old versions. Typical retention: 30 days. Long retention increases storage costs.

4. Incremental Updates

Problem: In data lakes, updating a record required rewriting entire partition.

Solution: OTFs support efficient MERGE, UPDATE, DELETE operations.

MERGE INTO target
USING updates
ON target.id = updates.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Performance comparison:

Operation	Hive (Rewrite)	Lakehouse (Merge)
Update 1M rows in 1TB	Rewrite 1TB (~30 min, $50)	Update 1M rows (~5 min, $5)

5. Concurrency Control

Challenge: Multiple readers/writers accessing same data.

Solution: Optimistic concurrency control.

Writers assume no conflicts
On commit, detect conflicts
Retry if conflict occurred

Implementation varies:

Delta Lake: Fine-grained, high concurrency
Iceberg: Optimistic, metadata-level locking
Hudi: Lock-based (ZooKeeper, Hive Metastore)

Data Layout & Organization

Medallion Architecture

Cost consideration: 2-3x storage for multiple layers, justified by:

Faster queries (Gold is optimized)
Reduced reprocessing (replay from Bronze)
Better data quality (errors caught upstream)

Optimization: Use views for Silver where appropriate to reduce storage duplication.

Partitioning Strategy

Best practices:

Partition by high-cardinality, filter-heavy columns (typically date)
Target: 100M-1B rows per partition for optimal pruning
Avoid: Over-partitioning (thousands of small partitions)

Performance Features

Data Skipping

How it works: OTFs maintain statistics (min/max) per file for each column.

Result: 50-90% I/O reduction for selective queries.

Z-Ordering / Clustering

Problem: Data skipping works on one column (partitioning).

Solution: Multi-dimensional clustering enables data skipping on multiple columns.

How it works:

Interleave bits from multiple columns (Z-order curve)
Sort data by Z-order value
Store in sorted order

Example: Z-order on (country, date)

Query filtering on country: skip irrelevant files
Query filtering on date: skip irrelevant files
Query filtering on both: even better skipping

Cost: 1.5-2x write cost for 5-10x query improvement on filtered queries.

File Size Optimization

File Size	Read Performance	Metadata Overhead	Recommendation
< 128MB	Poor (many small reads)	High	Too small
128MB-1GB	Good	Manageable	Optimal
> 1GB	Good	Low	Good for large scans
> 10GB	No additional benefit	Low	Parallelization limited

Target: Aim for 256MB-512MB Parquet files for most workloads.

Migration Strategy

From Data Warehouse to Lakehouse

Migration approaches:

Approach	Timeline	Risk	Cost
Big Bang	Weeks	High	Medium (dual-write period)
Phased	Months	Low	High (run both systems)
Hybrid	Months	Medium	Medium (federation)

Cost Analysis

Storage Cost Comparison

Scenario: 100TB data, growing 10TB/month

Architecture	Storage	Duplicate	Monthly Cost
Data Warehouse	100TB	Warehouse storage	$2,300
Data Lake + Warehouse	200TB	Both	$1,150 (lake) + $2,300 (warehouse) = $3,450
Lakehouse	100TB	Single copy	$1,150

Savings: Lakehouse saves $1,150-2,300/month ($13,800-27,600/year) per 100TB.

Query Cost Comparison

Scenario	Warehouse	Lakehouse	Savings
Full table scan	$10.00	$3.00	70%
Filtered query	$2.00	$0.50	75%
Large aggregation	$50.00	$15.00	70%

Why: Lakehouse uses cheaper storage + flexible compute (spot instances, serverless).

Total Cost of Ownership

Typical TCO breakdown:

Lakehouse: 40% storage, 40% compute, 15% ops, 5% vendor fees
Warehouse: 30% storage, 40% compute, 10% ops, 20% vendor premium

Senior Level Considerations

When to Use Lakehouse

Good fit:

Analytics + ML on same data
Need for both batch and streaming
Cost-sensitive workloads
Multi-cloud strategy
Custom processing requirements

Consider warehouse instead:

Pure BI, no ML
Want fully managed (no ops)
Small data (< 10TB)
Need premium support

Anti-Patterns

Anti-Pattern 1: Using Lakehouse as a “better Hive”

Wrong: Just storing Parquet files without OTF
Right: Use OTF for ACID, schema enforcement

Anti-Pattern 2: Ignoring file size optimization

Wrong: Letting small files accumulate
Right: Continuous compaction, target 256MB+ files

Anti-Pattern 3: Over-partitioning

Wrong: Partition by date, hour, country, product
Right: Partition by date, use Z-order for other columns

Anti-Pattern 4: Skipping time travel configuration

Wrong: Default retention, fill up metastore
Right: Configure retention based on compliance needs

Architecture Decision Template

When deciding on Lakehouse:

Question	Consideration
Data volume?	> 10TB strongly favors Lakehouse
Workload types?	Analytics + ML = Lakehouse; Pure BI = Warehouse OK
Concurrency?	High concurrent writes = Delta; High partition count = Iceberg
Team skills?	Open-source skills = Lakehouse; SQL-only = Warehouse
Cost sensitivity?	Cost-sensitive = Lakehouse; Budget for premium = Warehouse
Multi-cloud?	Lakehouse enables portability; Warehouse = vendor lock-in

Key Takeaways

Lakehouse = Lake + Warehouse: Best of both worlds via OTF
Cost savings: 40-60% by eliminating duplicate storage
Open formats: No vendor lock-in, Parquet + OTF
ACID transactions: Reliable, concurrent operations
File size matters: Target 256MB-512MB to avoid small files problem
Z-order for multi-column: Enables multi-dimensional data skipping
Migration is gradual: Can coexist with warehouse during transition

Next: Open Table Formats Comparison