Module 1: Modern Data Architecture
Overview
This module covers the foundational technologies enabling modern data platforms at scale. Understanding the Lakehouse paradigm, Open Table Formats, and modern compute engines is essential for Principal-level architecture decisions.
Module Contents
Core Architecture
| Document | Description | Key Topics |
|---|---|---|
| Lakehouse Concepts | Lakehouse vs. Lake vs. Warehouse | ACID transactions, schema enforcement, unified platform |
| Compute-Storage Separation | Disaggregated architecture | Independent scaling, cost implications |
Table Formats & Storage
| Document | Description | Key Topics |
|---|---|---|
| Open Table Formats | Delta vs. Iceberg vs. Hudi | Comparison matrix, selection criteria |
| Storage Formats | Parquet, ORC, Avro deep dive | Compression, encoding, performance |
| Partitioning Strategies | Partitioning, Z-Ordering, clustering | Small files problem, query optimization |
Compute Engines
| Document | Description | Key Topics |
|---|---|---|
| Compute Engines | Spark, Trino, DuckDB, ClickHouse | Use cases, trade-offs, cost |
Learning Objectives
After completing this module, you will:
- Explain Lakehouse Architecture - Benefits, trade-offs, and implementation patterns
- Select Table Formats - Delta vs. Iceberg vs. Hudi based on requirements
- Choose Storage Formats - Parquet vs. ORC vs. Avro for specific scenarios
- Design Partitioning - Optimal strategies for query patterns and cost
- Select Compute Engines - Spark vs. Trino vs. DuckDB based on workload
- Optimize Cost - Storage tiering, file sizing, and compute selection
The Lakehouse Paradigm
Why Lakehouse Matters
| Problem | Lake Solution | Warehouse Solution | Lakehouse Solution |
|---|---|---|---|
| Transactions | No ACID | ACID built-in | ACID via OTF |
| Schema Enforcement | Schema-on-read only | Schema-on-write | Both modes |
| Quality | Data swamp risk | High quality | Enforced quality |
| Cost | Low | High (duplicate storage) | Low (single copy) |
| BI Support | Poor | Excellent | Excellent |
Cost Impact: Eliminating duplicate storage (lake + warehouse) typically saves 40-60% of total data platform costs.
Technology Selection Framework
Open Table Formats: The Big Three
Quick Selection Guide:
| Use Case | Recommended | Alternative |
|---|---|---|
| Databricks Platform | Delta Lake | Iceberg (via Unity Catalog) |
| Streaming Ingest + Updates | Hudi MOR | Delta (deletion vectors) |
| High Partition Count | Iceberg | Hudi |
| Multi-cloud Strategy | Iceberg | Delta (open source) |
| Spark Ecosystem | Any OTF | Delta (best Spark integration) |
| Trino/Presto Queries | Iceberg | Delta (supported) |
Storage Format Selection
| Format | Best For | Avoid When |
|---|---|---|
| Parquet | Analytics, columnar queries | Row-heavy operations, write-heavy |
| ORC | Hive, Presto/Trino | Non-Hive ecosystems |
| Avro | Streaming, schema evolution | Analytical queries |
| JSON/CSV | Debugging, interchange | Production analytics (10-100x cost) |
Cost Rule of Thumb: Parquet compression typically saves 85-90% vs. CSV, and 60-70% vs. JSON. Never use JSON/CSV for production analytics storage.
Compute Engine Selection
Quick Selection:
| Scenario | Engine | Why |
|---|---|---|
| Local dev, < 100GB | DuckDB | Fast, simple, no cluster |
| Batch ETL at scale | Spark | Throughput, ecosystem |
| Interactive BI queries | Trino | Low latency, federation |
| Real-time analytics | ClickHouse | Ingestion performance |
| Serverless needed | BigQuery/Snowflake | Managed, auto-scale |
Cost Optimization Lens
Every decision in this module has cost implications:
Storage Costs
| Decision | Impact |
|---|---|
| Format choice | Parquet: 1x |
| Compression codec | Zstd: -20% vs. Snappy |
| File sizing | Optimal: 100% |
| Partitioning | Optimal: 1x |
| Tiering | Hot/Warm/Cold: 30-70% savings |
Compute Costs
| Decision | Impact |
|---|---|
| Engine choice | Spark: $1.00 |
| Cluster sizing | Right-sized: 1x |
| Spot instances | Spot: 60-80% discount |
| Caching | Cached: 90%+ cost reduction |
Network Costs
| Decision | Impact |
|---|---|
| Colocation | Same region: 1x |
| Federation | Direct: $0.01/GB |
Architecture Patterns
Pattern 1: Bronze-Silver-Gold
Cost Implications: 2-3x storage cost for multiple layers, justified by:
- Reduced reprocessing costs
- Faster time-to-insight
- Better data quality
Pattern 2: Unified Lakehouse
Cost Benefits:
- Single copy of data (no lake + warehouse duplication)
- Independent compute scaling
- Right-size compute per workload
Senior Level Gotchas
Gotcha 1: The Small Files Death Spiral
Problem: Streaming creates thousands of small files → Slow queries → Expensive scans → More files from retries.
Solution:
- Continuous compaction (OPTIMIZE in Delta, rewrite_data_files in Iceberg)
- Target file size: 128MB-1GB for Parquet
- Monitor file count metrics
Cost Impact: Unchecked small files can increase query costs by 10-100x.
Gotcha 2: Partitioning Overkill
Problem: Partition by date + hour + country + product → Thousands of partitions → Metadata explosion.
Solution:
- Partition by high-cardinality, filter-heavy columns (typically date)
- Use Z-Ordering/clustering for other dimensions
- Target: Thousands to millions of rows per partition
Cost Impact: Over-partitioning can make simple full-table scans faster than partitioned queries.
Gotcha 3: The JSON Legacy Trap
Problem: “We’ll store as JSON for flexibility” → 10-20x storage + query costs → Hard to migrate later.
Solution:
- Start with Parquet + schema enforcement
- Use JSON only for raw landing (Bronze)
- Convert immediately to typed storage
Cost Impact: Migrating 100TB from JSON to Parquet costs ~$5K-10K in compute, saves $50K-100K annually.
Gotcha 4: Wrong OTF Choice
Problem: Choose Iceberg for Databricks, or Delta for high-partition Trino workload → Suboptimal performance.
Solution: Use selection framework above.
Cost Impact: 20-50% performance difference = direct cost impact.
Pre-Assessment
Before proceeding, ask yourself:
- Can I explain why Lakehouse replaced both lake and warehouse?
- Can I articulate the differences between Delta, Iceberg, and Hudi?
- Do I understand when to use Parquet vs. ORC vs. Avro?
- Can I design a partitioning strategy for a given query pattern?
- Do I know when to use Spark vs. Trino vs. DuckDB?
- Can I estimate the cost impact of format/partitioning decisions?
- Have I designed a system addressing the small files problem?
If you answered “No” to 3+ questions: Study this module deeply.
If you answered “Yes” to 6+ questions: Focus on Module 7 (Cost Optimization) for advanced patterns.
Study Order
- Start Here → Lakehouse Concepts
- Then → Open Table Formats
- Then → Storage Formats
- Then → Partitioning Strategies
- Then → Compute Engines
- Finally → Compute-Storage Separation
Module Dependencies
This module is foundational. Understanding Lakehouse and OTF is prerequisite for all subsequent modules.
Next Steps
After completing this module:
- Proceed to Module 2 (Computing & Processing) for deep dive on execution engines
- Or skip to Module 7 (Performance & Cost Optimization) if you need immediate cost knowledge
- Or jump to Case Studies (Module 8) to see architecture patterns in action
Estimated Time to Complete Module 1: 6-8 hours