Module 3: Cloud Infrastructure
Overview
This module covers cloud infrastructure for data platforms, including provider comparison (AWS/GCP/Azure), managed data services, infrastructure as code, orchestration, and containerization. Understanding cloud-native patterns and selecting the right services is critical for Principal-level architecture.
Module Contents
Cloud Services
| Document | Description | Key Topics |
|---|---|---|
| Cloud Provider Comparison | AWS vs. GCP vs. Azure | Services, pricing, ecosystem |
| Data Warehouse Services | Redshift, BigQuery, Snowflake, Databricks | Architecture, optimization, migration |
| Infrastructure as Code | Terraform, Ansible | Modules, state, CI/CD |
| Orchestration | Airflow, Dagster, Prefect, K8s | Workflows, deployments, monitoring |
| Containerization | Docker, Kubernetes for data | Images, pods, scaling |
Detailed Breakdown
Data Warehouse Services (5 files)
- Redshift Guide - AWS data warehouse, distribution, WLM
- BigQuery Guide - Serverless warehouse, partitioning, ML
- Snowflake Guide - Multi-cloud warehouse, time travel, cloning
- Databricks Guide - Lakehouse platform, Delta Lake, MLflow
- Data Warehouse Comparison - Feature comparison, selection guide
Infrastructure as Code (3 files)
- Terraform Guide - Cloud provisioning, modules, state
- Ansible Guide - Configuration management, roles, playbooks
Orchestration (5 files)
- Airflow Guide - Modern Airflow, TaskFlow API, task groups
- Dagster Guide - Data-aware orchestration, assets, IO managers
- Prefect Guide - Modern orchestration, flows, state handling
- Kubernetes Orchestration - Cloud-native orchestration, operators
Containerization (3 files)
- Docker Guide - Containerization, Dockerfile, Compose
- Kubernetes for Data - Spark, Airflow, Jupyter on K8s
Cloud Provider Comparison
Data Warehouse Services Comparison
| Service | Strength | Weakness | Cost per TB |
|---|---|---|---|
| BigQuery | Serverless, fast | Less control | $5.00 |
| Snowflake | Multi-cloud, features | Expensive | $3.00-6.00 |
| Redshift | AWS integration | Operational overhead | $2.50-5.00 |
| Databricks SQL | Lakehouse native | Newer | $0.50-2.00 |
Orchestration Tool Selection
Cost Optimization
Infrastructure Cost Strategies
| Strategy | Savings | Complexity |
|---|---|---|
| Spot instances | 60-80% | Low |
| Reserved instances | 30-50% | Low |
| Right-sizing | 20-40% | Medium |
| Serverless | Variable | Low |
| Multi-region | Optimized egress | High |
FinOps for Data Platforms
Learning Objectives
After completing this module, you will:
- Compare cloud providers: AWS vs. GCP vs. Azure for data platforms
- Select managed services: BigQuery, Snowflake, Redshift, Databricks
- Implement IaC: Terraform patterns for data infrastructure
- Select orchestration: Airflow vs. Dagster vs. Prefect vs. K8s
- Containerize data workloads: Docker, Kubernetes for data
- Optimize cloud costs: Spot instances, right-sizing, serverless
Module Dependencies
Next Steps
- Review Cloud Provider Comparison
- Study Data Warehouse Services
- Learn Infrastructure as Code
- Explore Orchestration
- Review Containerization
Estimated Time to Complete Module 3: 8-10 hours