Module 1: Modern Data Architecture

Overview

This module covers the foundational technologies enabling modern data platforms at scale. Understanding the Lakehouse paradigm, Open Table Formats, and modern compute engines is essential for Principal-level architecture decisions.

Module Contents

Core Architecture

Document	Description	Key Topics
Lakehouse Concepts	Lakehouse vs. Lake vs. Warehouse	ACID transactions, schema enforcement, unified platform
Compute-Storage Separation	Disaggregated architecture	Independent scaling, cost implications

Table Formats & Storage

Document	Description	Key Topics
Open Table Formats	Delta vs. Iceberg vs. Hudi	Comparison matrix, selection criteria
Storage Formats	Parquet, ORC, Avro deep dive	Compression, encoding, performance
Partitioning Strategies	Partitioning, Z-Ordering, clustering	Small files problem, query optimization

Compute Engines

Document	Description	Key Topics
Compute Engines	Spark, Trino, DuckDB, ClickHouse	Use cases, trade-offs, cost

Learning Objectives

After completing this module, you will:

Explain Lakehouse Architecture - Benefits, trade-offs, and implementation patterns
Select Table Formats - Delta vs. Iceberg vs. Hudi based on requirements
Choose Storage Formats - Parquet vs. ORC vs. Avro for specific scenarios
Design Partitioning - Optimal strategies for query patterns and cost
Select Compute Engines - Spark vs. Trino vs. DuckDB based on workload
Optimize Cost - Storage tiering, file sizing, and compute selection

The Lakehouse Paradigm

Why Lakehouse Matters

Problem	Lake Solution	Warehouse Solution	Lakehouse Solution
Transactions	No ACID	ACID built-in	ACID via OTF
Schema Enforcement	Schema-on-read only	Schema-on-write	Both modes
Quality	Data swamp risk	High quality	Enforced quality
Cost	Low	High (duplicate storage)	Low (single copy)
BI Support	Poor	Excellent	Excellent

Cost Impact: Eliminating duplicate storage (lake + warehouse) typically saves 40-60% of total data platform costs.

Technology Selection Framework

Open Table Formats: The Big Three

Quick Selection Guide:

Use Case	Recommended	Alternative
Databricks Platform	Delta Lake	Iceberg (via Unity Catalog)
Streaming Ingest + Updates	Hudi MOR	Delta (deletion vectors)
High Partition Count	Iceberg	Hudi
Multi-cloud Strategy	Iceberg	Delta (open source)
Spark Ecosystem	Any OTF	Delta (best Spark integration)
Trino/Presto Queries	Iceberg	Delta (supported)

Storage Format Selection

Format	Best For	Avoid When
Parquet	Analytics, columnar queries	Row-heavy operations, write-heavy
ORC	Hive, Presto/Trino	Non-Hive ecosystems
Avro	Streaming, schema evolution	Analytical queries
JSON/CSV	Debugging, interchange	Production analytics (10-100x cost)

Cost Rule of Thumb: Parquet compression typically saves 85-90% vs. CSV, and 60-70% vs. JSON. Never use JSON/CSV for production analytics storage.

Compute Engine Selection

Quick Selection:

Scenario	Engine	Why
Local dev, < 100GB	DuckDB	Fast, simple, no cluster
Batch ETL at scale	Spark	Throughput, ecosystem
Interactive BI queries	Trino	Low latency, federation
Real-time analytics	ClickHouse	Ingestion performance
Serverless needed	BigQuery/Snowflake	Managed, auto-scale

Cost Optimization Lens

Every decision in this module has cost implications:

Storage Costs

Decision	Impact
Format choice	Parquet: 1x
Compression codec	Zstd: -20% vs. Snappy
File sizing	Optimal: 100%
Partitioning	Optimal: 1x
Tiering	Hot/Warm/Cold: 30-70% savings

Compute Costs

Decision	Impact
Engine choice	Spark: $1.00
Cluster sizing	Right-sized: 1x
Spot instances	Spot: 60-80% discount
Caching	Cached: 90%+ cost reduction

Network Costs

Decision	Impact
Colocation	Same region: 1x
Federation	Direct: $0.01/GB

Architecture Patterns

Pattern 1: Bronze-Silver-Gold

Cost Implications: 2-3x storage cost for multiple layers, justified by:

Reduced reprocessing costs
Faster time-to-insight
Better data quality

Pattern 2: Unified Lakehouse

Cost Benefits:

Single copy of data (no lake + warehouse duplication)
Independent compute scaling
Right-size compute per workload

Senior Level Gotchas

Gotcha 1: The Small Files Death Spiral

Problem: Streaming creates thousands of small files → Slow queries → Expensive scans → More files from retries.

Solution:

Continuous compaction (OPTIMIZE in Delta, rewrite_data_files in Iceberg)
Target file size: 128MB-1GB for Parquet
Monitor file count metrics

Cost Impact: Unchecked small files can increase query costs by 10-100x.

Gotcha 2: Partitioning Overkill

Problem: Partition by date + hour + country + product → Thousands of partitions → Metadata explosion.

Solution:

Partition by high-cardinality, filter-heavy columns (typically date)
Use Z-Ordering/clustering for other dimensions
Target: Thousands to millions of rows per partition

Cost Impact: Over-partitioning can make simple full-table scans faster than partitioned queries.

Gotcha 3: The JSON Legacy Trap

Problem: “We’ll store as JSON for flexibility” → 10-20x storage + query costs → Hard to migrate later.

Solution:

Start with Parquet + schema enforcement
Use JSON only for raw landing (Bronze)
Convert immediately to typed storage

Cost Impact: Migrating 100TB from JSON to Parquet costs ~$5K-10K in compute, saves $50K-100K annually.

Gotcha 4: Wrong OTF Choice

Problem: Choose Iceberg for Databricks, or Delta for high-partition Trino workload → Suboptimal performance.

Solution: Use selection framework above.

Cost Impact: 20-50% performance difference = direct cost impact.

Pre-Assessment

Before proceeding, ask yourself:

Can I explain why Lakehouse replaced both lake and warehouse?
Can I articulate the differences between Delta, Iceberg, and Hudi?
Do I understand when to use Parquet vs. ORC vs. Avro?
Can I design a partitioning strategy for a given query pattern?
Do I know when to use Spark vs. Trino vs. DuckDB?
Can I estimate the cost impact of format/partitioning decisions?
Have I designed a system addressing the small files problem?

If you answered “No” to 3+ questions: Study this module deeply.

If you answered “Yes” to 6+ questions: Focus on Module 7 (Cost Optimization) for advanced patterns.

Study Order

Start Here → Lakehouse Concepts
Then → Open Table Formats
Then → Storage Formats
Then → Partitioning Strategies
Then → Compute Engines
Finally → Compute-Storage Separation

Module Dependencies

This module is foundational. Understanding Lakehouse and OTF is prerequisite for all subsequent modules.

Next Steps

After completing this module:

Proceed to Module 2 (Computing & Processing) for deep dive on execution engines
Or skip to Module 7 (Performance & Cost Optimization) if you need immediate cost knowledge
Or jump to Case Studies (Module 8) to see architecture patterns in action

Estimated Time to Complete Module 1: 6-8 hours