Compute-Storage Separation

Disaggregated Architecture for Modern Data Platforms

Overview

Compute-storage separation is the foundational architecture pattern enabling modern cloud data platforms. By decoupling storage (object storage) from compute (processing engines), organizations can independently scale each layer, optimize costs, and avoid vendor lock-in.

The Architecture Shift

Legacy Architecture (Pre-2015)

Problems:

Storage and compute must scale together
Vendor lock-in (proprietary formats)
High costs (premium pricing for bundled solution)
Inflexible (can’t optimize independently)

Modern Architecture (Lakehouse)

Benefits:

Independent scaling
Open formats (no lock-in)
Cost optimization (right-size each layer)
Flexibility (choose best tool for job)

Core Concepts

Object Storage as Foundation

Why object storage:

S3 et al. provide: Infinite scale, 11 9’s durability, low cost
Separation enables: Multiple compute engines accessing same data
Open formats: Parquet/Iceberg ensure compatibility

Independent Scaling

Cost impact: Separation can reduce compute costs by 50-70% for variable workloads.

Architecture Patterns

Pattern 1: Multi-Engine Access

Benefits:

Right tool for each workload
No data duplication (single copy)
Independent scaling per engine

Pattern 2: Compute Isolation

Benefits:

Isolated performance (no noisy neighbors)
Separate cost centers
Independent failure domains

Pattern 3: Right-Sizing by Workload

Cost optimization:

Heavy workloads: Provisioned clusters (cheaper at scale)
Light workloads: Small clusters or single machines
Intermittent workloads: Serverless (no idle cost)

Implementation Considerations

Storage Layer Design

Requirements:

Object storage: S3, GCS, or Azure Data Lake
Open formats: Parquet + OTF (Delta/Iceberg/Hudi)
Catalog: Metastore (Glue, Hive, Unity Catalog)

Compute Layer Design

Requirements:

Engine selection: Match engine to workload
Sizing: Right-size for workload
Isolation: Separate clusters for separate workloads

Cost Analysis

Traditional Warehouse vs. Lakehouse

Scenario: 100TB data, variable query load

Dimension	Traditional Warehouse	Lakehouse (Separated)
Storage	100TB @ $23/TB = $2,300/mo	100TB @ $23/TB = $2,300/mo
Compute (average)	100 DWU @ $2,000/mo	Spark: $500 + Trino: $300 = $800/mo
Compute (peak)	Included	Scale to $2,000 for peak hours
Total	$4,300/mo	$3,100/mo (28% savings)

Note: Lakehouse savings come from:

No premium warehouse pricing
Right-sized compute for average load
Scale to zero when not in use
Use spot instances (60-80% discount)

Scaling Cost Comparison

Result: As data grows, warehouse compute costs grow too. Lakehouse compute stays flat.

Senior Level Considerations

When to Use Compute-Storage Separation

Ideal for:

Variable workloads (bursty, seasonal)
Multiple engine types (batch + interactive + streaming)
Cost-sensitive environments
Multi-cloud strategy
Data science + analytics on same data

Consider alternatives when:

Steady, predictable workloads (warehouse may be simpler)
Small data (< 10TB) (managed warehouse simpler)
Need fully managed (no ops)
SQL-only users (warehouse better UX)

Implementation Challenges

Challenge 1: Data Movement

Problem: Traditional warehouse stores data internally
Solution: Use object storage as single source of truth
Gotcha: Avoid duplication between warehouse and lake

Challenge 2: Catalog Management

Problem: Multiple engines need consistent catalog
Solution: Centralized catalog (Glue, Unity, Hive)
Gotcha: Catalog becomes single point of failure

Challenge 3: Security

Problem: Object storage is open by default
Solution: IAM roles, bucket policies, encryption
Gotcha: Complex IAM policies across engines

Challenge 4: Performance

Problem: Network latency vs. local storage
Solution: Colocation, caching, right-sizing
Gotcha: Cold starts in serverless

Migration Strategy

From Warehouse to Lakehouse

Migration Approaches

Approach	Timeline	Risk	Cost
Big Bang	Weeks	High	Low (short dual-write)
Phased	Months	Low	High (long dual-write)
Hybrid	Months	Medium	Medium (federation)

Recommendation: Start with new workloads in lakehouse, migrate existing gradually.

Best Practices

DO

# 1. Use object storage as single source of truth
storage:
  type: s3
  bucket: data-lake
  path: s3://data-lake/{tier}/{database}/{table}/

# 2. Separate compute by workload
compute:
  etl:
    engine: spark
    nodes: 100
    instance: r5.8xlarge
  bi:
    engine: trino
    nodes: 10
    instance: r5.2xlarge

# 3. Use open formats
formats:
  storage: parquet
  table_format: iceberg

DON’T

# 1. Don't duplicate storage
# Bad: Copy to warehouse
# Good: Query directly from object storage

# 2. Don't use wrong engine for workload
# Bad: Use Spark for interactive BI
# Good: Use Trino for BI, Spark for ETL

# 3. Don't ignore network costs
# Bad: Cross-region queries
# Good: Colocate compute with data

Key Takeaways

Decoupling enables flexibility: Choose best tool for each workload
Independent scaling: Scale storage and compute independently
Cost optimization: Right-size each layer, use spot instances
Open formats: Parquet + OTF prevents vendor lock-in
Multiple engines: Spark for ETL, Trino for BI, Flink for streaming
Object storage foundation: S3/GCS/ADLS as single source of truth
Migration is gradual: Can coexist with warehouse during transition

Next: Module 2: Computing & Processing