Data Warehouse Services Comparison
Redshift vs. BigQuery vs. Snowflake vs. Databricks
Overview
This document provides a comprehensive comparison of the four major cloud data warehouse services: Amazon Redshift, Google BigQuery, Snowflake, and Databricks. Understanding the trade-offs between these platforms is critical for Principal-level architecture decisions.
Quick Comparison Matrix
| Feature | Redshift | BigQuery | Snowflake | Databricks |
|---|---|---|---|---|
| Cloud | AWS only | GCP only | AWS/GCP/Azure | AWS/GCP/Azure |
| Architecture | Shared-nothing | Serverless | Multi-cluster | Lakehouse |
| Storage | Managed | Managed | Managed | S3/ADLS/GCS |
| Format | Columnar | Capacitor | Columnar | Delta (Parquet) |
| Pricing | $5/TB + cluster | $5/TB | $2-6/TB | Serverless/Cluster |
| Concurrency | Limited by cluster | Unlimited | Multi-cluster | Auto-scaling |
| Setup Time | Hours | Minutes | Minutes | Minutes |
| Time Travel | None | 7 days | 90 days | Unlimited (Delta) |
Architecture Comparison
Redshift Architecture
Characteristics:
- Provisioned clusters (manual scaling)
- Leader + compute nodes
- Shared-nothing architecture
- No native data lake support
BigQuery Architecture
Characteristics:
- Fully serverless (no clusters)
- Auto-scaling query execution
- 24-hour query cache
- No infrastructure management
Snowflake Architecture
Characteristics:
- Compute-storage separation
- Multi-cluster warehouses
- Zero-copy cloning
- 90-day time travel
Databricks Architecture
Characteristics:
- Lakehouse (data lake + warehouse)
- Delta Lake (ACID on data lake)
- Open format (Parquet)
- ML and AI integration
Pricing Comparison
Compute Pricing
| Service | Pricing Model | Cost per TB | Notes |
|---|---|---|---|
| Redshift | Cluster + per TB | $5 + cluster cost | RA3 nodes: $1.97-$19.34/hour |
| BigQuery | Per TB scanned | $5.00 | On-demand |
| Snowflake | Per TB (warehouse) | $2.00-6.00 | Based on warehouse size |
| Databricks | Cluster/Serverless | $0.50-2.00 | Serverless: $0.50/TB |
Storage Pricing
| Service | Hot Storage | Cold Storage | Archive |
|---|---|---|---|
| Redshift | Included | N/A | N/A |
| BigQuery | $0.02/GB/month | $0.01/GB/month | N/A |
| Snowflake | $40/TB/month | Included | Included |
| Databricks | S3/ADLS/GCS pricing | Same | Same |
Total Cost of Ownership (TCO)
Scenario: 100 TB queries per month, 500 TB storage
| Service | Compute Cost | Storage Cost | Total Monthly |
|---|---|---|---|
| Redshift | $500 + $3,600 (cluster) | Included | $4,100 |
| BigQuery | $500 | $10,000 | $10,500 |
| Snowflake | $400 | $20,000 | $20,400 |
| Databricks | $100 | $10,000 | $10,100 |
Winner: Redshift for compute-heavy, Databricks for storage-heavy
Performance Comparison
Query Performance
| Benchmark | Redshift | BigQuery | Snowflake | Databricks |
|---|---|---|---|---|
| Simple SELECT | 10s | 5s | 8s | 12s |
| Aggregation | 30s | 15s | 25s | 35s |
| JOIN | 45s | 20s | 35s | 40s |
| Complex Query | 60s | 30s | 50s | 55s |
Winner: BigQuery (fastest), followed by Snowflake
Concurrency
Winner: BigQuery (unlimited), followed by Snowflake
Feature Comparison
Core Features
| Feature | Redshift | BigQuery | Snowflake | Databricks |
|---|---|---|---|---|
| Time Travel | ❌ | ✅ 7 days | ✅ 90 days | ✅ Unlimited (Delta) |
| Cloning | ❌ | ❌ | ✅ Zero-copy | ✅ Zero-copy |
| Data Sharing | ❌ | ❌ | ✅ Native | ❌ |
| ML Support | ❌ (via SageMaker) | ✅ BigQuery ML | ✅ Snowpark | ✅ MLflow (native) |
| Streaming | ❌ | ✅ BigQuery Streaming | ✅ Snowpipe | ✅ Structured Streaming |
| ACID Transactions | ❌ | ❌ | ✅ | ✅ (Delta) |
| Schema Evolution | ❌ | ❌ | ❌ | ✅ |
Winner: Snowflake and Databricks for features
Use Case Selection
When to Use Redshift
Best For:
- AWS-centric workloads
- Predictable query patterns
- Cost-sensitive (with reserved instances)
- Integration with AWS ecosystem (Glue, EMR)
Avoid When:
- Need serverless (use BigQuery)
- Need multi-cloud (use Snowflake)
- Need ML (use Databricks)
When to Use BigQuery
Best For:
- GCP-centric workloads
- Variable/unpredictable workloads
- Serverless preference (no clusters)
- Ad-hoc analytics
Avoid When:
- Need multi-cloud (use Snowflake)
- Need advanced features (use Snowflake)
- Need native data lake (use Databricks)
When to Use Snowflake
Best For:
- Multi-cloud strategy
- Data sharing requirements
- Advanced features (time travel, cloning)
- Enterprise workloads
Avoid When:
- Cost-sensitive (use BigQuery or Redshift)
- Need open format (use Databricks)
- Need native ML (use Databricks)
When to Use Databricks
Best For:
- Lakehouse architecture (data lake + warehouse)
- ML and AI workloads
- Open format preference (Parquet)
- ELT (extract, load, transform) patterns
Avoid When:
- Need pure warehouse (use Snowflake)
- Need serverless (use BigQuery)
- Cost-sensitive (use Redshift with reserved)
Migration Considerations
Migrating to Redshift
From: Oracle, SQL Server, PostgreSQL
Migration Path:
- Use AWS Schema Conversion Tool (SCT)
- Migrate schema with DMS
- Migrate data with DMS (full load + CDC)
- Optimize: DISTKEY, SORTKEY, WLM
- Test and validate
Complexity: Medium (schema conversion required)
Migrating to BigQuery
From: Redshift, Snowflake, Oracle
Migration Path:
- Assess with BigQuery Migration Assistant
- Convert schema (SQL dialect differences)
- Migrate data (GCS → BigQuery)
- Optimize: Partitioning, clustering
- Update BI tools
Complexity: Medium (SQL dialect conversion)
Migrating to Snowflake
From: Redshift, BigQuery, Oracle
Migration Path:
- Assess with Snowflake Migration Assessment
- Convert schema (minimal changes)
- Migrate data (Snowpipe or bulk)
- Optimize: Clustering keys, micro-partitions
- Update integrations
Complexity: Low (SQL compatibility)
Migrating to Databricks
From: Spark, Redshift, Snowflake
Migration Path:
- Convert to Delta Lake format
- Rewrite ETL as Spark jobs
- Migrate data (S3/ADLS/GCS → Delta)
- Optimize: Partitioning, Z-ORDER
- Set up Unity Catalog
Complexity: High (requires Spark expertise)
Senior Level Considerations
Vendor Lock-In
Hidden Costs
| Service | Hidden Costs to Watch |
|---|---|
| Redshift | Cluster idle time, concurrency scaling |
| BigQuery | On-demand pricing (use capacity commitments) |
| Snowflake | Warehouse idle time, storage cost |
| Databricks | Cluster idle time, DBU pricing |
Decision Framework
Primary Considerations
Key Takeaways
- Redshift: Best for AWS, predictable workloads, cost optimization
- BigQuery: Best for serverless, variable workloads, simplicity
- Snowflake: Best for multi-cloud, advanced features, enterprise
- Databricks: Best for lakehouse, ML/AI, open formats
- Cost: Redshift (cheapest with reserved), BigQuery, Databricks, Snowflake (most expensive)
- Performance: BigQuery (fastest), Snowflake, Redshift, Databricks
- Features: Snowflake and Databricks (most features)
- Lock-in: Databricks (least), Snowflake, BigQuery, Redshift (most)
Back to Module 3