Lakehouse Concepts
The Architecture That Unified Data Lake and Warehouse
Overview
The Lakehouse architecture represents the paradigm shift that occurred from 2020-2025, combining the best aspects of data lakes (low-cost storage, flexibility) with data warehouses (ACID transactions, schema enforcement). This architecture is enabled by Open Table Formats and is now the de facto standard for modern data platforms.
Historical Context
The Evolution of Data Architecture
The Problem Lakehouse Solves
| Problem | Data Lake | Data Warehouse | Lakehouse |
|---|---|---|---|
| Storage Cost | Low ($/TB) | High (premium compute) | Low ($/TB) |
| ACID Transactions | No | Yes | Yes (via OTF) |
| Schema Enforcement | Weak | Strong | Strong |
| Data Quality | Poor (swamp risk) | Good | Good |
| BI Support | Poor | Excellent | Excellent |
| ML/AI Support | Good | Poor | Good |
| Storage Duplication | Single copy | Duplicate storage | Single copy |
| Query Performance | Variable | Excellent | Excellent |
Lakehouse Architecture
Core Components
Key Principles
- Unified Storage: Single copy of data in object storage
- Compute-Storage Separation: Independent scaling
- Open Formats: No vendor lock-in (Parquet + OTF)
- ACID Transactions: Reliable, concurrent operations
- Schema Enforcement: Quality at ingest
- BI Compatibility: Standard SQL, high performance
The Open Table Format Enabler
What OTFs Provide
How OTFs Work
Traditional Hive:
table/├── dt=2025-01-27/│ ├── part-00001.parquet│ ├── part-00002.parquet│ └── ...├── dt=2025-01-28/│ ├── part-00001.parquet│ └── ...- No transaction log
- No atomic commits
- No schema enforcement
- Manual “list files” for discovery
Lakehouse with OTF:
table/├── _delta_log/ (or metadata/)│ ├── 000000.json│ ├── 000001.json│ └── ...├── dt=2025-01-27/│ ├── part-00001.parquet│ └── ...- Transaction log tracks all changes
- Atomic commits (all-or-nothing)
- Schema enforcement and evolution
- Metadata-driven queries
Core Capabilities
1. ACID Transactions
What it means:
- Atomicity: Commits are all-or-nothing
- Consistency: Schema and constraints enforced
- Isolation: Readers don’t see partial writes
- Durability: Committed data is permanent
Why it matters:
- Multiple concurrent writers don’t corrupt data
- Queries see consistent snapshots
- Failed writes don’t leave partial data
Example scenario:
-- Without ACID (Hive):-- Writer writes 5 files, crashes after 3-- Reader sees partial data - INCONSISTENT
-- With ACID (Lakehouse):-- Writer commits transaction-- Either all 5 files visible, or none-- Reader always sees consistent state2. Schema Enforcement & Evolution
Schema Enforcement:
-- Table expects: col1 INT, col2 STRINGINSERT INTO table VALUES (1, 'valid'); -- SuccessINSERT INTO table VALUES ('invalid', 2); -- RejectedSchema Evolution:
-- Add column - non-breakingALTER TABLE table ADD COLUMN col3 STRING;
-- Drop column - breaking (needs explicit handling)ALTER TABLE table DROP COLUMN col2;Types of Evolution:
| Operation | Safe? | Action Required |
|---|---|---|
| Add column | Yes | None |
| Rename column | No | Update downstream queries |
| Change type | No | Full rewrite or cast |
| Drop column | No | Update downstream queries |
3. Time Travel
What it means: Query previous versions of data.
Use cases:
- Debugging data issues
- ML model reproducibility
- Audit and compliance
- Rollback from bad data
Example:
-- Current dataSELECT * FROM table;
-- Data as of yesterdaySELECT * FROM table VERSION AS OF 12345;SELECT * FROM table TIMESTAMP AS OF '2025-01-26';
-- HistorySELECT * FROM table.history;Cost Implication: Time travel retains old versions. Typical retention: 30 days. Long retention increases storage costs.
4. Incremental Updates
Problem: In data lakes, updating a record required rewriting entire partition.
Solution: OTFs support efficient MERGE, UPDATE, DELETE operations.
MERGE INTO targetUSING updatesON target.id = updates.idWHEN MATCHED THEN UPDATE SET *WHEN NOT MATCHED THEN INSERT *;Performance comparison:
| Operation | Hive (Rewrite) | Lakehouse (Merge) |
|---|---|---|
| Update 1M rows in 1TB | Rewrite 1TB (~30 min, $50) | Update 1M rows (~5 min, $5) |
5. Concurrency Control
Challenge: Multiple readers/writers accessing same data.
Solution: Optimistic concurrency control.
- Writers assume no conflicts
- On commit, detect conflicts
- Retry if conflict occurred
Implementation varies:
- Delta Lake: Fine-grained, high concurrency
- Iceberg: Optimistic, metadata-level locking
- Hudi: Lock-based (ZooKeeper, Hive Metastore)
Data Layout & Organization
Medallion Architecture
Cost consideration: 2-3x storage for multiple layers, justified by:
- Faster queries (Gold is optimized)
- Reduced reprocessing (replay from Bronze)
- Better data quality (errors caught upstream)
Optimization: Use views for Silver where appropriate to reduce storage duplication.
Partitioning Strategy
Best practices:
- Partition by high-cardinality, filter-heavy columns (typically date)
- Target: 100M-1B rows per partition for optimal pruning
- Avoid: Over-partitioning (thousands of small partitions)
Performance Features
Data Skipping
How it works: OTFs maintain statistics (min/max) per file for each column.
Result: 50-90% I/O reduction for selective queries.
Z-Ordering / Clustering
Problem: Data skipping works on one column (partitioning).
Solution: Multi-dimensional clustering enables data skipping on multiple columns.
How it works:
- Interleave bits from multiple columns (Z-order curve)
- Sort data by Z-order value
- Store in sorted order
Example: Z-order on (country, date)
- Query filtering on country: skip irrelevant files
- Query filtering on date: skip irrelevant files
- Query filtering on both: even better skipping
Cost: 1.5-2x write cost for 5-10x query improvement on filtered queries.
File Size Optimization
| File Size | Read Performance | Metadata Overhead | Recommendation |
|---|---|---|---|
| < 128MB | Poor (many small reads) | High | Too small |
| 128MB-1GB | Good | Manageable | Optimal |
| > 1GB | Good | Low | Good for large scans |
| > 10GB | No additional benefit | Low | Parallelization limited |
Target: Aim for 256MB-512MB Parquet files for most workloads.
Migration Strategy
From Data Warehouse to Lakehouse
Migration approaches:
| Approach | Timeline | Risk | Cost |
|---|---|---|---|
| Big Bang | Weeks | High | Medium (dual-write period) |
| Phased | Months | Low | High (run both systems) |
| Hybrid | Months | Medium | Medium (federation) |
Cost Analysis
Storage Cost Comparison
Scenario: 100TB data, growing 10TB/month
| Architecture | Storage | Duplicate | Monthly Cost |
|---|---|---|---|
| Data Warehouse | 100TB | Warehouse storage | $2,300 |
| Data Lake + Warehouse | 200TB | Both | $1,150 (lake) + $2,300 (warehouse) = $3,450 |
| Lakehouse | 100TB | Single copy | $1,150 |
Savings: Lakehouse saves $1,150-2,300/month ($13,800-27,600/year) per 100TB.
Query Cost Comparison
| Scenario | Warehouse | Lakehouse | Savings |
|---|---|---|---|
| Full table scan | $10.00 | $3.00 | 70% |
| Filtered query | $2.00 | $0.50 | 75% |
| Large aggregation | $50.00 | $15.00 | 70% |
Why: Lakehouse uses cheaper storage + flexible compute (spot instances, serverless).
Total Cost of Ownership
Typical TCO breakdown:
- Lakehouse: 40% storage, 40% compute, 15% ops, 5% vendor fees
- Warehouse: 30% storage, 40% compute, 10% ops, 20% vendor premium
Senior Level Considerations
When to Use Lakehouse
Good fit:
- Analytics + ML on same data
- Need for both batch and streaming
- Cost-sensitive workloads
- Multi-cloud strategy
- Custom processing requirements
Consider warehouse instead:
- Pure BI, no ML
- Want fully managed (no ops)
- Small data (< 10TB)
- Need premium support
Anti-Patterns
Anti-Pattern 1: Using Lakehouse as a “better Hive”
- Wrong: Just storing Parquet files without OTF
- Right: Use OTF for ACID, schema enforcement
Anti-Pattern 2: Ignoring file size optimization
- Wrong: Letting small files accumulate
- Right: Continuous compaction, target 256MB+ files
Anti-Pattern 3: Over-partitioning
- Wrong: Partition by date, hour, country, product
- Right: Partition by date, use Z-order for other columns
Anti-Pattern 4: Skipping time travel configuration
- Wrong: Default retention, fill up metastore
- Right: Configure retention based on compliance needs
Architecture Decision Template
When deciding on Lakehouse:
| Question | Consideration |
|---|---|
| Data volume? | > 10TB strongly favors Lakehouse |
| Workload types? | Analytics + ML = Lakehouse; Pure BI = Warehouse OK |
| Concurrency? | High concurrent writes = Delta; High partition count = Iceberg |
| Team skills? | Open-source skills = Lakehouse; SQL-only = Warehouse |
| Cost sensitivity? | Cost-sensitive = Lakehouse; Budget for premium = Warehouse |
| Multi-cloud? | Lakehouse enables portability; Warehouse = vendor lock-in |
Key Takeaways
- Lakehouse = Lake + Warehouse: Best of both worlds via OTF
- Cost savings: 40-60% by eliminating duplicate storage
- Open formats: No vendor lock-in, Parquet + OTF
- ACID transactions: Reliable, concurrent operations
- File size matters: Target 256MB-512MB to avoid small files problem
- Z-order for multi-column: Enables multi-dimensional data skipping
- Migration is gradual: Can coexist with warehouse during transition