Skip to content

Compute-Storage Separation

Disaggregated Architecture for Modern Data Platforms


Overview

Compute-storage separation is the foundational architecture pattern enabling modern cloud data platforms. By decoupling storage (object storage) from compute (processing engines), organizations can independently scale each layer, optimize costs, and avoid vendor lock-in.


The Architecture Shift

Legacy Architecture (Pre-2015)

Problems:

  • Storage and compute must scale together
  • Vendor lock-in (proprietary formats)
  • High costs (premium pricing for bundled solution)
  • Inflexible (can’t optimize independently)

Modern Architecture (Lakehouse)

Benefits:

  • Independent scaling
  • Open formats (no lock-in)
  • Cost optimization (right-size each layer)
  • Flexibility (choose best tool for job)

Core Concepts

Object Storage as Foundation

Why object storage:

  • S3 et al. provide: Infinite scale, 11 9’s durability, low cost
  • Separation enables: Multiple compute engines accessing same data
  • Open formats: Parquet/Iceberg ensure compatibility

Independent Scaling

Cost impact: Separation can reduce compute costs by 50-70% for variable workloads.


Architecture Patterns

Pattern 1: Multi-Engine Access

Benefits:

  • Right tool for each workload
  • No data duplication (single copy)
  • Independent scaling per engine

Pattern 2: Compute Isolation

Benefits:

  • Isolated performance (no noisy neighbors)
  • Separate cost centers
  • Independent failure domains

Pattern 3: Right-Sizing by Workload

Cost optimization:

  • Heavy workloads: Provisioned clusters (cheaper at scale)
  • Light workloads: Small clusters or single machines
  • Intermittent workloads: Serverless (no idle cost)

Implementation Considerations

Storage Layer Design

Requirements:

  1. Object storage: S3, GCS, or Azure Data Lake
  2. Open formats: Parquet + OTF (Delta/Iceberg/Hudi)
  3. Catalog: Metastore (Glue, Hive, Unity Catalog)

Compute Layer Design

Requirements:

  1. Engine selection: Match engine to workload
  2. Sizing: Right-size for workload
  3. Isolation: Separate clusters for separate workloads

Cost Analysis

Traditional Warehouse vs. Lakehouse

Scenario: 100TB data, variable query load

DimensionTraditional WarehouseLakehouse (Separated)
Storage100TB @ $23/TB = $2,300/mo100TB @ $23/TB = $2,300/mo
Compute (average)100 DWU @ $2,000/moSpark: $500 + Trino: $300 = $800/mo
Compute (peak)IncludedScale to $2,000 for peak hours
Total$4,300/mo$3,100/mo (28% savings)

Note: Lakehouse savings come from:

  • No premium warehouse pricing
  • Right-sized compute for average load
  • Scale to zero when not in use
  • Use spot instances (60-80% discount)

Scaling Cost Comparison

Result: As data grows, warehouse compute costs grow too. Lakehouse compute stays flat.


Senior Level Considerations

When to Use Compute-Storage Separation

Ideal for:

  • Variable workloads (bursty, seasonal)
  • Multiple engine types (batch + interactive + streaming)
  • Cost-sensitive environments
  • Multi-cloud strategy
  • Data science + analytics on same data

Consider alternatives when:

  • Steady, predictable workloads (warehouse may be simpler)
  • Small data (< 10TB) (managed warehouse simpler)
  • Need fully managed (no ops)
  • SQL-only users (warehouse better UX)

Implementation Challenges

Challenge 1: Data Movement

  • Problem: Traditional warehouse stores data internally
  • Solution: Use object storage as single source of truth
  • Gotcha: Avoid duplication between warehouse and lake

Challenge 2: Catalog Management

  • Problem: Multiple engines need consistent catalog
  • Solution: Centralized catalog (Glue, Unity, Hive)
  • Gotcha: Catalog becomes single point of failure

Challenge 3: Security

  • Problem: Object storage is open by default
  • Solution: IAM roles, bucket policies, encryption
  • Gotcha: Complex IAM policies across engines

Challenge 4: Performance

  • Problem: Network latency vs. local storage
  • Solution: Colocation, caching, right-sizing
  • Gotcha: Cold starts in serverless

Migration Strategy

From Warehouse to Lakehouse

Migration Approaches

ApproachTimelineRiskCost
Big BangWeeksHighLow (short dual-write)
PhasedMonthsLowHigh (long dual-write)
HybridMonthsMediumMedium (federation)

Recommendation: Start with new workloads in lakehouse, migrate existing gradually.


Best Practices

DO

# 1. Use object storage as single source of truth
storage:
type: s3
bucket: data-lake
path: s3://data-lake/{tier}/{database}/{table}/
# 2. Separate compute by workload
compute:
etl:
engine: spark
nodes: 100
instance: r5.8xlarge
bi:
engine: trino
nodes: 10
instance: r5.2xlarge
# 3. Use open formats
formats:
storage: parquet
table_format: iceberg

DON’T

# 1. Don't duplicate storage
# Bad: Copy to warehouse
# Good: Query directly from object storage
# 2. Don't use wrong engine for workload
# Bad: Use Spark for interactive BI
# Good: Use Trino for BI, Spark for ETL
# 3. Don't ignore network costs
# Bad: Cross-region queries
# Good: Colocate compute with data

Key Takeaways

  1. Decoupling enables flexibility: Choose best tool for each workload
  2. Independent scaling: Scale storage and compute independently
  3. Cost optimization: Right-size each layer, use spot instances
  4. Open formats: Parquet + OTF prevents vendor lock-in
  5. Multiple engines: Spark for ETL, Trino for BI, Flink for streaming
  6. Object storage foundation: S3/GCS/ADLS as single source of truth
  7. Migration is gradual: Can coexist with warehouse during transition

Next: Module 2: Computing & Processing