Skip to content

Lakehouse Concepts

The Architecture That Unified Data Lake and Warehouse


Overview

The Lakehouse architecture represents the paradigm shift that occurred from 2020-2025, combining the best aspects of data lakes (low-cost storage, flexibility) with data warehouses (ACID transactions, schema enforcement). This architecture is enabled by Open Table Formats and is now the de facto standard for modern data platforms.


Historical Context

The Evolution of Data Architecture

The Problem Lakehouse Solves

ProblemData LakeData WarehouseLakehouse
Storage CostLow ($/TB)High (premium compute)Low ($/TB)
ACID TransactionsNoYesYes (via OTF)
Schema EnforcementWeakStrongStrong
Data QualityPoor (swamp risk)GoodGood
BI SupportPoorExcellentExcellent
ML/AI SupportGoodPoorGood
Storage DuplicationSingle copyDuplicate storageSingle copy
Query PerformanceVariableExcellentExcellent

Lakehouse Architecture

Core Components

Key Principles

  1. Unified Storage: Single copy of data in object storage
  2. Compute-Storage Separation: Independent scaling
  3. Open Formats: No vendor lock-in (Parquet + OTF)
  4. ACID Transactions: Reliable, concurrent operations
  5. Schema Enforcement: Quality at ingest
  6. BI Compatibility: Standard SQL, high performance

The Open Table Format Enabler

What OTFs Provide

How OTFs Work

Traditional Hive:

table/
├── dt=2025-01-27/
│ ├── part-00001.parquet
│ ├── part-00002.parquet
│ └── ...
├── dt=2025-01-28/
│ ├── part-00001.parquet
│ └── ...
  • No transaction log
  • No atomic commits
  • No schema enforcement
  • Manual “list files” for discovery

Lakehouse with OTF:

table/
├── _delta_log/ (or metadata/)
│ ├── 000000.json
│ ├── 000001.json
│ └── ...
├── dt=2025-01-27/
│ ├── part-00001.parquet
│ └── ...
  • Transaction log tracks all changes
  • Atomic commits (all-or-nothing)
  • Schema enforcement and evolution
  • Metadata-driven queries

Core Capabilities

1. ACID Transactions

What it means:

  • Atomicity: Commits are all-or-nothing
  • Consistency: Schema and constraints enforced
  • Isolation: Readers don’t see partial writes
  • Durability: Committed data is permanent

Why it matters:

  • Multiple concurrent writers don’t corrupt data
  • Queries see consistent snapshots
  • Failed writes don’t leave partial data

Example scenario:

-- Without ACID (Hive):
-- Writer writes 5 files, crashes after 3
-- Reader sees partial data - INCONSISTENT
-- With ACID (Lakehouse):
-- Writer commits transaction
-- Either all 5 files visible, or none
-- Reader always sees consistent state

2. Schema Enforcement & Evolution

Schema Enforcement:

-- Table expects: col1 INT, col2 STRING
INSERT INTO table VALUES (1, 'valid'); -- Success
INSERT INTO table VALUES ('invalid', 2); -- Rejected

Schema Evolution:

-- Add column - non-breaking
ALTER TABLE table ADD COLUMN col3 STRING;
-- Drop column - breaking (needs explicit handling)
ALTER TABLE table DROP COLUMN col2;

Types of Evolution:

OperationSafe?Action Required
Add columnYesNone
Rename columnNoUpdate downstream queries
Change typeNoFull rewrite or cast
Drop columnNoUpdate downstream queries

3. Time Travel

What it means: Query previous versions of data.

Use cases:

  • Debugging data issues
  • ML model reproducibility
  • Audit and compliance
  • Rollback from bad data

Example:

-- Current data
SELECT * FROM table;
-- Data as of yesterday
SELECT * FROM table VERSION AS OF 12345;
SELECT * FROM table TIMESTAMP AS OF '2025-01-26';
-- History
SELECT * FROM table.history;

Cost Implication: Time travel retains old versions. Typical retention: 30 days. Long retention increases storage costs.

4. Incremental Updates

Problem: In data lakes, updating a record required rewriting entire partition.

Solution: OTFs support efficient MERGE, UPDATE, DELETE operations.

MERGE INTO target
USING updates
ON target.id = updates.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Performance comparison:

OperationHive (Rewrite)Lakehouse (Merge)
Update 1M rows in 1TBRewrite 1TB (~30 min, $50)Update 1M rows (~5 min, $5)

5. Concurrency Control

Challenge: Multiple readers/writers accessing same data.

Solution: Optimistic concurrency control.

  • Writers assume no conflicts
  • On commit, detect conflicts
  • Retry if conflict occurred

Implementation varies:

  • Delta Lake: Fine-grained, high concurrency
  • Iceberg: Optimistic, metadata-level locking
  • Hudi: Lock-based (ZooKeeper, Hive Metastore)

Data Layout & Organization

Medallion Architecture

Cost consideration: 2-3x storage for multiple layers, justified by:

  • Faster queries (Gold is optimized)
  • Reduced reprocessing (replay from Bronze)
  • Better data quality (errors caught upstream)

Optimization: Use views for Silver where appropriate to reduce storage duplication.

Partitioning Strategy

Best practices:

  • Partition by high-cardinality, filter-heavy columns (typically date)
  • Target: 100M-1B rows per partition for optimal pruning
  • Avoid: Over-partitioning (thousands of small partitions)

Performance Features

Data Skipping

How it works: OTFs maintain statistics (min/max) per file for each column.

Result: 50-90% I/O reduction for selective queries.

Z-Ordering / Clustering

Problem: Data skipping works on one column (partitioning).

Solution: Multi-dimensional clustering enables data skipping on multiple columns.

How it works:

  1. Interleave bits from multiple columns (Z-order curve)
  2. Sort data by Z-order value
  3. Store in sorted order

Example: Z-order on (country, date)

  • Query filtering on country: skip irrelevant files
  • Query filtering on date: skip irrelevant files
  • Query filtering on both: even better skipping

Cost: 1.5-2x write cost for 5-10x query improvement on filtered queries.

File Size Optimization

File SizeRead PerformanceMetadata OverheadRecommendation
< 128MBPoor (many small reads)HighToo small
128MB-1GBGoodManageableOptimal
> 1GBGoodLowGood for large scans
> 10GBNo additional benefitLowParallelization limited

Target: Aim for 256MB-512MB Parquet files for most workloads.


Migration Strategy

From Data Warehouse to Lakehouse

Migration approaches:

ApproachTimelineRiskCost
Big BangWeeksHighMedium (dual-write period)
PhasedMonthsLowHigh (run both systems)
HybridMonthsMediumMedium (federation)

Cost Analysis

Storage Cost Comparison

Scenario: 100TB data, growing 10TB/month

ArchitectureStorageDuplicateMonthly Cost
Data Warehouse100TBWarehouse storage$2,300
Data Lake + Warehouse200TBBoth$1,150 (lake) + $2,300 (warehouse) = $3,450
Lakehouse100TBSingle copy$1,150

Savings: Lakehouse saves $1,150-2,300/month ($13,800-27,600/year) per 100TB.

Query Cost Comparison

ScenarioWarehouseLakehouseSavings
Full table scan$10.00$3.0070%
Filtered query$2.00$0.5075%
Large aggregation$50.00$15.0070%

Why: Lakehouse uses cheaper storage + flexible compute (spot instances, serverless).

Total Cost of Ownership

Typical TCO breakdown:

  • Lakehouse: 40% storage, 40% compute, 15% ops, 5% vendor fees
  • Warehouse: 30% storage, 40% compute, 10% ops, 20% vendor premium

Senior Level Considerations

When to Use Lakehouse

Good fit:

  • Analytics + ML on same data
  • Need for both batch and streaming
  • Cost-sensitive workloads
  • Multi-cloud strategy
  • Custom processing requirements

Consider warehouse instead:

  • Pure BI, no ML
  • Want fully managed (no ops)
  • Small data (< 10TB)
  • Need premium support

Anti-Patterns

Anti-Pattern 1: Using Lakehouse as a “better Hive”

  • Wrong: Just storing Parquet files without OTF
  • Right: Use OTF for ACID, schema enforcement

Anti-Pattern 2: Ignoring file size optimization

  • Wrong: Letting small files accumulate
  • Right: Continuous compaction, target 256MB+ files

Anti-Pattern 3: Over-partitioning

  • Wrong: Partition by date, hour, country, product
  • Right: Partition by date, use Z-order for other columns

Anti-Pattern 4: Skipping time travel configuration

  • Wrong: Default retention, fill up metastore
  • Right: Configure retention based on compliance needs

Architecture Decision Template

When deciding on Lakehouse:

QuestionConsideration
Data volume?> 10TB strongly favors Lakehouse
Workload types?Analytics + ML = Lakehouse; Pure BI = Warehouse OK
Concurrency?High concurrent writes = Delta; High partition count = Iceberg
Team skills?Open-source skills = Lakehouse; SQL-only = Warehouse
Cost sensitivity?Cost-sensitive = Lakehouse; Budget for premium = Warehouse
Multi-cloud?Lakehouse enables portability; Warehouse = vendor lock-in

Key Takeaways

  1. Lakehouse = Lake + Warehouse: Best of both worlds via OTF
  2. Cost savings: 40-60% by eliminating duplicate storage
  3. Open formats: No vendor lock-in, Parquet + OTF
  4. ACID transactions: Reliable, concurrent operations
  5. File size matters: Target 256MB-512MB to avoid small files problem
  6. Z-order for multi-column: Enables multi-dimensional data skipping
  7. Migration is gradual: Can coexist with warehouse during transition

Next: Open Table Formats Comparison