Skip to content

Cloud Provider Comparison

AWS vs. GCP vs. Azure for Data Platforms


Overview

This document compares the three major cloud providers (AWS, GCP, Azure) for data platform workloads. While all three provide similar services, differences in maturity, pricing, and ecosystem can significantly impact architecture decisions.


Quick Comparison Matrix

Service CategoryAWSGCPAzure
Object StorageS3GCSBlob Storage
Data WarehouseRedshiftBigQuerySynapse
StreamingMSK (Kafka)Pub/SubEvent Hubs
OrchestrationMWAA (Airflow)Composer (Airflow)Data Factory
NotebooksSageMakerVertex AISynapse
ETL ServiceGlueDataflowData Factory
Managed SparkEMRDataprocSynapse
MetadataGlue CatalogData CatalogPurview

Service-by-Service Comparison

Object Storage

FeatureS3GCSBlob Storage
LatencyLowLowLow
APIREST, SOAPREST, XML APIREST
TieringStandard → IA → GlacierStandard → Nearline → ColdHot → Cool → Archive
Cost (per TB/month)$23$20$18
Min chargeNoneNoneNone
Best forEcosystemPriceEnterprise

Winner: GCS (lowest cost), S3 (most mature)

Data Warehouses

FeatureRedshiftBigQuerySynapse
ModelProvisioned clustersServerlessProvisioned + Serverless
Cost (per TB queried)$5 (RA3)$5$5-6
ConcurrencyLimited by WLMHighLimited
EcosystemAWS-nativeGoogle-nativeAzure-native
Best forPredictable workloadsVariable workloadsEnterprise

Winner: BigQuery (serverless, excellent performance)

Streaming Platforms

FeatureMSK (Kafka)Pub/SubEvent Hubs
ProtocolKafkaProprietaryProprietary (AMQP)
ScalingManual rebalanceAutoAuto
RetentionConfigurable7 days (default)1-90 days
Cost (per TB)~$25-50~$10-20~$25-50
Best forKafka workloadsGoogle ecosystemEnterprise

Winner: Pub/Sub (lowest cost, auto-scaling)

Managed Spark

FeatureEMRDataprocSynapse
Spark VersionLatestLatestLatest
PricingInstance-basedInstance-basedDWU-based
Spot InstancesYesPreemptibleLow priority
ManagementCustomer managedCustomer managedFully managed
Best forEcosystemPriceEase of use

Winner: Dataproc (lowest cost), Synapse (easiest)


Pricing Comparison

Storage Costs (per TB/month)

TierAWSGCPAzure
Hot$23$20$18
Cool$12 (IA)$10 (Nearline)$10
Cold$4 (Glacier)$4 (Archive)$2 (Archive)
Archive$1 (Deep Archive)$1 (Coldline)N/A

Winner: Azure (lowest storage costs)

Compute Costs (similar instance)

Instance TypeAWSGCPAzure
General (8 vCPU, 32GB)$0.30/hour$0.26/hour$0.29/hour
Memory (8 vCPU, 64GB)$0.60/hour$0.50/hour$0.55/hour
Spot Discount70%80%60%

Winner: GCP (lowest compute, highest spot discount)


Ecosystem and Maturity

Data Engineering Ecosystem

Third-Party Integration

PlatformPartner EcosystemData Tools Available
AWSLargest (Snowflake, Databricks, etc.)All tools support AWS
GCPGrowingMost tools support GCP
AzureStrong enterpriseMost tools support Azure

Selection Framework

Decision Guide

ScenarioRecommendedRationale
Cost-sensitiveGCPLowest storage/compute costs
EcosystemAWSBroadest tooling
BigQuery-centricGCPBest warehouse
EnterpriseAzureEnterprise features, hybrid
Existing MicrosoftAzureIntegration with O365, Teams
Open-source focusGCPOpen-source contributions
Multi-cloudAWS or GCPBest multi-cloud support

Data Warehouse Comparison

Deep Dive: BigQuery vs. Redshift vs. Snowflake

FeatureBigQueryRedshiftSnowflake
ArchitectureServerlessProvisionedServerless
Pricing$5/TB queried$5/TB (cluster cost)$2-6/TB
ConcurrencyUnlimited (serverless)Limited by clusterUnlimited
ML IntegrationVertex AISageMakerSnowpark
Best ForVariable workloadsPredictable workloadsMulti-cloud

Streaming Platform Comparison

Deep Dive: MSK vs. Pub/Sub vs. Event Hubs

FeatureMSKPub/SubEvent Hubs
ProtocolKafka (open source)ProprietaryProprietary
Latency10-50ms50-200ms50-200ms
ScalingManual rebalanceAutoAuto
RetentionUnlimited7 days (default)1-90 days
Cost$0.12/GB/month$0.08/GB/month$0.15/GB/month
Best ForKafka workloadsCost, simplicityEnterprise

Cost Optimization by Provider

AWS Cost Optimization

strategies:
storage:
- Use S3 Intelligent Tiering
- Lifecycle policies to Glacier
- S3 One Zone-IA (20% cheaper)
compute:
- Spot instances (70% discount)
- Reserved instances (1-3 year terms)
- Auto-scaling for variable workloads
data_warehouse:
- Redshift RA3 with managed storage
- AQUA for acceleration
- Concurrency scaling (free)

GCP Cost Optimization

strategies:
storage:
- GCS Autoclass (automatic tiering)
- Nearline/Coldline for older data
- Regional storage (cheaper)
compute:
- Preemptible VMs (80% discount)
- Committed use discounts
- Autoscaler for Dataproc
data_warehouse:
- BigQuery on-demand pricing
- Capacity commitments for predictability
- BigQuery slots for isolation

Azure Cost Optimization

strategies:
storage:
- Lifecycle management policies
- Cool/Archive tier
- Azure Blob storage LRS (cheapest)
compute:
- Low priority VMs (60% discount)
- Reserved instances
- Azure Hybrid Benefit
data_warehouse:
- Synapse serverless for unpredictable
- Dedicated pools for predictable
- Power BI Premium integration

Multi-Cloud Strategies

Why Multi-Cloud?

Multi-Cloud Patterns

PatternDescriptionComplexity
ReplicationCopy data across cloudsHigh
FederationQuery across cloudsMedium
HybridOn-prem + cloudMedium
Best-of-breedDifferent services per cloudHigh

Migration Considerations

AWS → GCP Migration

ServiceAWSGCPComplexity
StorageS3GCSLow (gsutil)
WarehouseRedshiftBigQueryMedium
StreamingMSKPub/SubHigh (protocol change)
ETLGlueDataflowHigh (code rewrite)

AWS → Azure Migration

ServiceAWSAzureComplexity
StorageS3Blob StorageLow (AzCopy)
WarehouseRedshiftSynapseMedium
StreamingMSKEvent HubsHigh (protocol change)
ETLGlueData FactoryMedium

Senior Level Considerations

Vendor Lock-In

ProviderLock-in RiskMitigation
AWSHigh (proprietary services)Use open standards
GCPMedium (more open)Use open standards
AzureHigh (Microsoft ecosystem)Use open standards

Hidden Costs

ProviderHidden Costs to Watch
AWSData transfer out, NAT Gateway
GCPEgress, BigQuery slots
AzureBandwidth, support plans

Key Takeaways

  1. AWS: Largest ecosystem, mature services, moderate pricing
  2. GCP: Lowest costs, best BigQuery, strong analytics
  3. Azure: Enterprise features, Microsoft integration
  4. Cost: GCP generally 10-20% cheaper than AWS/Azure
  5. Selection: Existing investment > features > cost
  6. Multi-cloud: Possible but adds complexity
  7. Open standards: Mitigate lock-in with Parquet + OTF

Back to Module 3