Skip to content

Module 1: Modern Data Architecture


Overview

This module covers the foundational technologies enabling modern data platforms at scale. Understanding the Lakehouse paradigm, Open Table Formats, and modern compute engines is essential for Principal-level architecture decisions.


Module Contents

Core Architecture

DocumentDescriptionKey Topics
Lakehouse ConceptsLakehouse vs. Lake vs. WarehouseACID transactions, schema enforcement, unified platform
Compute-Storage SeparationDisaggregated architectureIndependent scaling, cost implications

Table Formats & Storage

DocumentDescriptionKey Topics
Open Table FormatsDelta vs. Iceberg vs. HudiComparison matrix, selection criteria
Storage FormatsParquet, ORC, Avro deep diveCompression, encoding, performance
Partitioning StrategiesPartitioning, Z-Ordering, clusteringSmall files problem, query optimization

Compute Engines

DocumentDescriptionKey Topics
Compute EnginesSpark, Trino, DuckDB, ClickHouseUse cases, trade-offs, cost

Learning Objectives

After completing this module, you will:

  1. Explain Lakehouse Architecture - Benefits, trade-offs, and implementation patterns
  2. Select Table Formats - Delta vs. Iceberg vs. Hudi based on requirements
  3. Choose Storage Formats - Parquet vs. ORC vs. Avro for specific scenarios
  4. Design Partitioning - Optimal strategies for query patterns and cost
  5. Select Compute Engines - Spark vs. Trino vs. DuckDB based on workload
  6. Optimize Cost - Storage tiering, file sizing, and compute selection

The Lakehouse Paradigm

Why Lakehouse Matters

ProblemLake SolutionWarehouse SolutionLakehouse Solution
TransactionsNo ACIDACID built-inACID via OTF
Schema EnforcementSchema-on-read onlySchema-on-writeBoth modes
QualityData swamp riskHigh qualityEnforced quality
CostLowHigh (duplicate storage)Low (single copy)
BI SupportPoorExcellentExcellent

Cost Impact: Eliminating duplicate storage (lake + warehouse) typically saves 40-60% of total data platform costs.


Technology Selection Framework

Open Table Formats: The Big Three

Quick Selection Guide:

Use CaseRecommendedAlternative
Databricks PlatformDelta LakeIceberg (via Unity Catalog)
Streaming Ingest + UpdatesHudi MORDelta (deletion vectors)
High Partition CountIcebergHudi
Multi-cloud StrategyIcebergDelta (open source)
Spark EcosystemAny OTFDelta (best Spark integration)
Trino/Presto QueriesIcebergDelta (supported)

Storage Format Selection

FormatBest ForAvoid When
ParquetAnalytics, columnar queriesRow-heavy operations, write-heavy
ORCHive, Presto/TrinoNon-Hive ecosystems
AvroStreaming, schema evolutionAnalytical queries
JSON/CSVDebugging, interchangeProduction analytics (10-100x cost)

Cost Rule of Thumb: Parquet compression typically saves 85-90% vs. CSV, and 60-70% vs. JSON. Never use JSON/CSV for production analytics storage.


Compute Engine Selection

Quick Selection:

ScenarioEngineWhy
Local dev, < 100GBDuckDBFast, simple, no cluster
Batch ETL at scaleSparkThroughput, ecosystem
Interactive BI queriesTrinoLow latency, federation
Real-time analyticsClickHouseIngestion performance
Serverless neededBigQuery/SnowflakeManaged, auto-scale

Cost Optimization Lens

Every decision in this module has cost implications:

Storage Costs

DecisionImpact
Format choiceParquet: 1x
Compression codecZstd: -20% vs. Snappy
File sizingOptimal: 100%
PartitioningOptimal: 1x
TieringHot/Warm/Cold: 30-70% savings

Compute Costs

DecisionImpact
Engine choiceSpark: $1.00
Cluster sizingRight-sized: 1x
Spot instancesSpot: 60-80% discount
CachingCached: 90%+ cost reduction

Network Costs

DecisionImpact
ColocationSame region: 1x
FederationDirect: $0.01/GB

Architecture Patterns

Pattern 1: Bronze-Silver-Gold

Cost Implications: 2-3x storage cost for multiple layers, justified by:

  • Reduced reprocessing costs
  • Faster time-to-insight
  • Better data quality

Pattern 2: Unified Lakehouse

Cost Benefits:

  • Single copy of data (no lake + warehouse duplication)
  • Independent compute scaling
  • Right-size compute per workload

Senior Level Gotchas

Gotcha 1: The Small Files Death Spiral

Problem: Streaming creates thousands of small files → Slow queries → Expensive scans → More files from retries.

Solution:

  • Continuous compaction (OPTIMIZE in Delta, rewrite_data_files in Iceberg)
  • Target file size: 128MB-1GB for Parquet
  • Monitor file count metrics

Cost Impact: Unchecked small files can increase query costs by 10-100x.

Gotcha 2: Partitioning Overkill

Problem: Partition by date + hour + country + product → Thousands of partitions → Metadata explosion.

Solution:

  • Partition by high-cardinality, filter-heavy columns (typically date)
  • Use Z-Ordering/clustering for other dimensions
  • Target: Thousands to millions of rows per partition

Cost Impact: Over-partitioning can make simple full-table scans faster than partitioned queries.

Gotcha 3: The JSON Legacy Trap

Problem: “We’ll store as JSON for flexibility” → 10-20x storage + query costs → Hard to migrate later.

Solution:

  • Start with Parquet + schema enforcement
  • Use JSON only for raw landing (Bronze)
  • Convert immediately to typed storage

Cost Impact: Migrating 100TB from JSON to Parquet costs ~$5K-10K in compute, saves $50K-100K annually.

Gotcha 4: Wrong OTF Choice

Problem: Choose Iceberg for Databricks, or Delta for high-partition Trino workload → Suboptimal performance.

Solution: Use selection framework above.

Cost Impact: 20-50% performance difference = direct cost impact.


Pre-Assessment

Before proceeding, ask yourself:

  • Can I explain why Lakehouse replaced both lake and warehouse?
  • Can I articulate the differences between Delta, Iceberg, and Hudi?
  • Do I understand when to use Parquet vs. ORC vs. Avro?
  • Can I design a partitioning strategy for a given query pattern?
  • Do I know when to use Spark vs. Trino vs. DuckDB?
  • Can I estimate the cost impact of format/partitioning decisions?
  • Have I designed a system addressing the small files problem?

If you answered “No” to 3+ questions: Study this module deeply.

If you answered “Yes” to 6+ questions: Focus on Module 7 (Cost Optimization) for advanced patterns.


Study Order

  1. Start HereLakehouse Concepts
  2. ThenOpen Table Formats
  3. ThenStorage Formats
  4. ThenPartitioning Strategies
  5. ThenCompute Engines
  6. FinallyCompute-Storage Separation

Module Dependencies

This module is foundational. Understanding Lakehouse and OTF is prerequisite for all subsequent modules.


Next Steps

After completing this module:

  1. Proceed to Module 2 (Computing & Processing) for deep dive on execution engines
  2. Or skip to Module 7 (Performance & Cost Optimization) if you need immediate cost knowledge
  3. Or jump to Case Studies (Module 8) to see architecture patterns in action

Estimated Time to Complete Module 1: 6-8 hours