Compute-Storage Separation
Disaggregated Architecture for Modern Data Platforms
Overview
Compute-storage separation is the foundational architecture pattern enabling modern cloud data platforms. By decoupling storage (object storage) from compute (processing engines), organizations can independently scale each layer, optimize costs, and avoid vendor lock-in.
The Architecture Shift
Legacy Architecture (Pre-2015)
Problems:
- Storage and compute must scale together
- Vendor lock-in (proprietary formats)
- High costs (premium pricing for bundled solution)
- Inflexible (can’t optimize independently)
Modern Architecture (Lakehouse)
Benefits:
- Independent scaling
- Open formats (no lock-in)
- Cost optimization (right-size each layer)
- Flexibility (choose best tool for job)
Core Concepts
Object Storage as Foundation
Why object storage:
- S3 et al. provide: Infinite scale, 11 9’s durability, low cost
- Separation enables: Multiple compute engines accessing same data
- Open formats: Parquet/Iceberg ensure compatibility
Independent Scaling
Cost impact: Separation can reduce compute costs by 50-70% for variable workloads.
Architecture Patterns
Pattern 1: Multi-Engine Access
Benefits:
- Right tool for each workload
- No data duplication (single copy)
- Independent scaling per engine
Pattern 2: Compute Isolation
Benefits:
- Isolated performance (no noisy neighbors)
- Separate cost centers
- Independent failure domains
Pattern 3: Right-Sizing by Workload
Cost optimization:
- Heavy workloads: Provisioned clusters (cheaper at scale)
- Light workloads: Small clusters or single machines
- Intermittent workloads: Serverless (no idle cost)
Implementation Considerations
Storage Layer Design
Requirements:
- Object storage: S3, GCS, or Azure Data Lake
- Open formats: Parquet + OTF (Delta/Iceberg/Hudi)
- Catalog: Metastore (Glue, Hive, Unity Catalog)
Compute Layer Design
Requirements:
- Engine selection: Match engine to workload
- Sizing: Right-size for workload
- Isolation: Separate clusters for separate workloads
Cost Analysis
Traditional Warehouse vs. Lakehouse
Scenario: 100TB data, variable query load
| Dimension | Traditional Warehouse | Lakehouse (Separated) |
|---|---|---|
| Storage | 100TB @ $23/TB = $2,300/mo | 100TB @ $23/TB = $2,300/mo |
| Compute (average) | 100 DWU @ $2,000/mo | Spark: $500 + Trino: $300 = $800/mo |
| Compute (peak) | Included | Scale to $2,000 for peak hours |
| Total | $4,300/mo | $3,100/mo (28% savings) |
Note: Lakehouse savings come from:
- No premium warehouse pricing
- Right-sized compute for average load
- Scale to zero when not in use
- Use spot instances (60-80% discount)
Scaling Cost Comparison
Result: As data grows, warehouse compute costs grow too. Lakehouse compute stays flat.
Senior Level Considerations
When to Use Compute-Storage Separation
Ideal for:
- Variable workloads (bursty, seasonal)
- Multiple engine types (batch + interactive + streaming)
- Cost-sensitive environments
- Multi-cloud strategy
- Data science + analytics on same data
Consider alternatives when:
- Steady, predictable workloads (warehouse may be simpler)
- Small data (< 10TB) (managed warehouse simpler)
- Need fully managed (no ops)
- SQL-only users (warehouse better UX)
Implementation Challenges
Challenge 1: Data Movement
- Problem: Traditional warehouse stores data internally
- Solution: Use object storage as single source of truth
- Gotcha: Avoid duplication between warehouse and lake
Challenge 2: Catalog Management
- Problem: Multiple engines need consistent catalog
- Solution: Centralized catalog (Glue, Unity, Hive)
- Gotcha: Catalog becomes single point of failure
Challenge 3: Security
- Problem: Object storage is open by default
- Solution: IAM roles, bucket policies, encryption
- Gotcha: Complex IAM policies across engines
Challenge 4: Performance
- Problem: Network latency vs. local storage
- Solution: Colocation, caching, right-sizing
- Gotcha: Cold starts in serverless
Migration Strategy
From Warehouse to Lakehouse
Migration Approaches
| Approach | Timeline | Risk | Cost |
|---|---|---|---|
| Big Bang | Weeks | High | Low (short dual-write) |
| Phased | Months | Low | High (long dual-write) |
| Hybrid | Months | Medium | Medium (federation) |
Recommendation: Start with new workloads in lakehouse, migrate existing gradually.
Best Practices
DO
# 1. Use object storage as single source of truthstorage: type: s3 bucket: data-lake path: s3://data-lake/{tier}/{database}/{table}/
# 2. Separate compute by workloadcompute: etl: engine: spark nodes: 100 instance: r5.8xlarge bi: engine: trino nodes: 10 instance: r5.2xlarge
# 3. Use open formatsformats: storage: parquet table_format: icebergDON’T
# 1. Don't duplicate storage# Bad: Copy to warehouse# Good: Query directly from object storage
# 2. Don't use wrong engine for workload# Bad: Use Spark for interactive BI# Good: Use Trino for BI, Spark for ETL
# 3. Don't ignore network costs# Bad: Cross-region queries# Good: Colocate compute with dataKey Takeaways
- Decoupling enables flexibility: Choose best tool for each workload
- Independent scaling: Scale storage and compute independently
- Cost optimization: Right-size each layer, use spot instances
- Open formats: Parquet + OTF prevents vendor lock-in
- Multiple engines: Spark for ETL, Trino for BI, Flink for streaming
- Object storage foundation: S3/GCS/ADLS as single source of truth
- Migration is gradual: Can coexist with warehouse during transition