AWS vs. GCP vs. Azure for Data Platforms
Overview
This document compares the three major cloud providers (AWS, GCP, Azure) for data platform workloads. While all three provide similar services, differences in maturity, pricing, and ecosystem can significantly impact architecture decisions.
Quick Comparison Matrix
Service Category AWS GCP Azure Object Storage S3 GCS Blob Storage Data Warehouse Redshift BigQuery Synapse Streaming MSK (Kafka) Pub/Sub Event Hubs Orchestration MWAA (Airflow) Composer (Airflow) Data Factory Notebooks SageMaker Vertex AI Synapse ETL Service Glue Dataflow Data Factory Managed Spark EMR Dataproc Synapse Metadata Glue Catalog Data Catalog Purview
Service-by-Service Comparison
Object Storage
Feature S3 GCS Blob Storage Latency Low Low Low API REST, SOAP REST, XML API REST Tiering Standard → IA → Glacier Standard → Nearline → Cold Hot → Cool → Archive Cost (per TB/month) $23 $20 $18 Min charge None None None Best for Ecosystem Price Enterprise
Winner : GCS (lowest cost), S3 (most mature)
Data Warehouses
Feature Redshift BigQuery Synapse Model Provisioned clusters Serverless Provisioned + Serverless Cost (per TB queried) $5 (RA3) $5 $5-6 Concurrency Limited by WLM High Limited Ecosystem AWS-native Google-native Azure-native Best for Predictable workloads Variable workloads Enterprise
Winner : BigQuery (serverless, excellent performance)
Feature MSK (Kafka) Pub/Sub Event Hubs Protocol Kafka Proprietary Proprietary (AMQP) Scaling Manual rebalance Auto Auto Retention Configurable 7 days (default) 1-90 days Cost (per TB) ~$25-50 ~$10-20 ~$25-50 Best for Kafka workloads Google ecosystem Enterprise
Winner : Pub/Sub (lowest cost, auto-scaling)
Managed Spark
Feature EMR Dataproc Synapse Spark Version Latest Latest Latest Pricing Instance-based Instance-based DWU-based Spot Instances Yes Preemptible Low priority Management Customer managed Customer managed Fully managed Best for Ecosystem Price Ease of use
Winner : Dataproc (lowest cost), Synapse (easiest)
Pricing Comparison
Storage Costs (per TB/month)
Tier AWS GCP Azure Hot $23 $20 $18 Cool $12 (IA) $10 (Nearline) $10 Cold $4 (Glacier) $4 (Archive) $2 (Archive) Archive $1 (Deep Archive) $1 (Coldline) N/A
Winner : Azure (lowest storage costs)
Compute Costs (similar instance)
Instance Type AWS GCP Azure General (8 vCPU, 32GB) $0.30/hour $0.26/hour $0.29/hour Memory (8 vCPU, 64GB) $0.60/hour $0.50/hour $0.55/hour Spot Discount 70% 80% 60%
Winner : GCP (lowest compute, highest spot discount)
Ecosystem and Maturity
Data Engineering Ecosystem
Third-Party Integration
Platform Partner Ecosystem Data Tools Available AWS Largest (Snowflake, Databricks, etc.) All tools support AWS GCP Growing Most tools support GCP Azure Strong enterprise Most tools support Azure
Selection Framework
Decision Guide
Scenario Recommended Rationale Cost-sensitive GCP Lowest storage/compute costs Ecosystem AWS Broadest tooling BigQuery-centric GCP Best warehouse Enterprise Azure Enterprise features, hybrid Existing Microsoft Azure Integration with O365, Teams Open-source focus GCP Open-source contributions Multi-cloud AWS or GCP Best multi-cloud support
Data Warehouse Comparison
Deep Dive: BigQuery vs. Redshift vs. Snowflake
Feature BigQuery Redshift Snowflake Architecture Serverless Provisioned Serverless Pricing $5/TB queried $5/TB (cluster cost) $2-6/TB Concurrency Unlimited (serverless) Limited by cluster Unlimited ML Integration Vertex AI SageMaker Snowpark Best For Variable workloads Predictable workloads Multi-cloud
Deep Dive: MSK vs. Pub/Sub vs. Event Hubs
Feature MSK Pub/Sub Event Hubs Protocol Kafka (open source) Proprietary Proprietary Latency 10-50ms 50-200ms 50-200ms Scaling Manual rebalance Auto Auto Retention Unlimited 7 days (default) 1-90 days Cost $0.12/GB/month $0.08/GB/month $0.15/GB/month Best For Kafka workloads Cost, simplicity Enterprise
Cost Optimization by Provider
AWS Cost Optimization
- Use S3 Intelligent Tiering
- Lifecycle policies to Glacier
- S3 One Zone-IA (20% cheaper)
- Spot instances (70% discount)
- Reserved instances (1-3 year terms)
- Auto-scaling for variable workloads
- Redshift RA3 with managed storage
- Concurrency scaling (free)
GCP Cost Optimization
- GCS Autoclass (automatic tiering)
- Nearline/Coldline for older data
- Regional storage (cheaper)
- Preemptible VMs (80% discount)
- Committed use discounts
- Autoscaler for Dataproc
- BigQuery on-demand pricing
- Capacity commitments for predictability
- BigQuery slots for isolation
Azure Cost Optimization
- Lifecycle management policies
- Azure Blob storage LRS (cheapest)
- Low priority VMs (60% discount)
- Synapse serverless for unpredictable
- Dedicated pools for predictable
- Power BI Premium integration
Multi-Cloud Strategies
Why Multi-Cloud?
Multi-Cloud Patterns
Pattern Description Complexity Replication Copy data across clouds High Federation Query across clouds Medium Hybrid On-prem + cloud Medium Best-of-breed Different services per cloud High
Migration Considerations
AWS → GCP Migration
Service AWS GCP Complexity Storage S3 GCS Low (gsutil) Warehouse Redshift BigQuery Medium Streaming MSK Pub/Sub High (protocol change) ETL Glue Dataflow High (code rewrite)
AWS → Azure Migration
Service AWS Azure Complexity Storage S3 Blob Storage Low (AzCopy) Warehouse Redshift Synapse Medium Streaming MSK Event Hubs High (protocol change) ETL Glue Data Factory Medium
Senior Level Considerations
Vendor Lock-In
Provider Lock-in Risk Mitigation AWS High (proprietary services) Use open standards GCP Medium (more open) Use open standards Azure High (Microsoft ecosystem) Use open standards
Hidden Costs
Provider Hidden Costs to Watch AWS Data transfer out, NAT Gateway GCP Egress, BigQuery slots Azure Bandwidth, support plans
Key Takeaways
AWS : Largest ecosystem, mature services, moderate pricing
GCP : Lowest costs, best BigQuery, strong analytics
Azure : Enterprise features, Microsoft integration
Cost : GCP generally 10-20% cheaper than AWS/Azure
Selection : Existing investment > features > cost
Multi-cloud : Possible but adds complexity
Open standards : Mitigate lock-in with Parquet + OTF
Back to Module 3