Amazon Redshift Guide
AWS Data Warehouse Service
Overview
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It’s designed for high-performance analysis using SQL and business intelligence tools, with optimized storage and compute for structured and semi-structured data.
Redshift Architecture
Node Architecture
Key Components:
- Leader Node: Manages client connections, query planning, result aggregation
- Compute Nodes: Execute queries, store data
- Slices: Parallel processing units within compute nodes
- Node Types: DC2 (dense compute), RA3 (managed storage)
Redshift Table Types
Distribution Styles
Distribution Selection:
| Style | Use Case | Example |
|---|---|---|
| AUTO | Let Redshift decide | Default for new tables |
| EVEN | No obvious distribution key | Logs, events |
| KEY | Large table joins | fact_sales DISTKEY (customer_id) |
| ALL | Small dimension tables | dim_customer DISTSTYLE ALL |
Sort Keys
-- Compound sort key (best for single-column queries)CREATE TABLE events ( event_id BIGINT, event_time TIMESTAMP SORTKEY, user_id BIGINT, event_type VARCHAR(50))DISTSTYLE AUTOSORTKEY (event_time);
-- Interleaved sort key (best for multi-column queries)CREATE TABLE sales ( sale_id BIGINT, sale_date SORTKEY, customer_id SORTKEY, product_id SORTKEY, amount DECIMAL(18,2))DISTSTYLE AUTOINTERLEAVED SORTKEY (sale_date, customer_id, product_id);Sort Key Selection:
- Compound: Use for single-column WHERE clauses
- Interleaved: Use for multi-column WHERE clauses
- Best Practice: Choose frequently filtered columns (dates, IDs)
Redshift Performance Tuning
Query Optimization
-- 1. Use appropriate distribution styleCREATE TABLE fact_sales ( sale_id BIGINT IDENTITY(1,1), customer_id BIGINT NOT NULL, product_id BIGINT NOT NULL, sale_date DATE NOT NULL, amount DECIMAL(18,2) NOT NULL)DISTSTYLE KEYDISTKEY (customer_id)SORTKEY (sale_date);
-- 2. Create materialized views for complex queriesCREATE MATERIALIZED VIEW mv_customer_daily_sales ASSELECT customer_id, sale_date, COUNT(*) as sales_count, SUM(amount) as total_amountFROM fact_salesGROUP BY customer_id, sale_date;
-- 3. Use approximate functions for large datasetsSELECT customer_id, COUNT(*) as exact_count, APPROX_COUNT_DISTINCT(product_id) as distinct_products, -- Faster AVG(amount) as avg_amountFROM fact_salesWHERE sale_date >= '2025-01-01'GROUP BY customer_id;
-- 4. Use WHERE before GROUP BY (filter early)SELECT customer_id, SUM(amount) as total_amountFROM fact_salesWHERE sale_date >= '2025-01-01' -- Filter first AND amount > 100 -- Filter firstGROUP BY customer_id;Workload Management (WLM)
-- Create WLM queues for different workloadsCREATE WLM QUEUE configurationAS -- Queue 1: ETL loads (high concurrency) etl_queue WITH ( QUERY_GROUP = 'etl', CONCURRENCY = 10, MEMORY_PERCENT = 30, TIMEOUT = 3600 ),
-- Queue 2: BI queries (low latency) bi_queue WITH ( QUERY_GROUP = 'bi', CONCURRENCY = 20, MEMORY_PERCENT = 40, TIMEOUT = 60 ),
-- Queue 3: Ad-hoc analytics (medium latency) analytics_queue WITH ( QUERY_GROUP = 'analytics', CONCURRENCY = 5, MEMORY_PERCENT = 20, TIMEOUT = 600 ),
-- Superuser queue (for maintenance) superuser_queue WITH ( QUERY_GROUP = 'superuser', CONCURRENCY = 5, MEMORY_PERCENT = 10 );Concurrency Scaling
Enable Concurrency Scaling:
-- Enable for a WLM queueALTER WLM QUEUE configurationSET concurrency_scaling = 'auto';
-- Concurrency scaling clusters:-- - Automatically added during load spikes-- - Handle bursty workloads-- - No extra cost for most eligible queriesRedshift Spectrum
Querying S3 Data
-- Create external schema for S3 dataCREATE EXTERNAL SCHEMA spectrum_schemaFROM data catalogDATABASE 'spectrum_db'IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole'CREATE EXTERNAL DATABASE IF NOT EXISTS;
-- Create external table for S3 dataCREATE EXTERNAL TABLE spectrum_schema.web_logs ( log_id BIGINT, user_id BIGINT, event_time TIMESTAMP, url VARCHAR(1024), status_code INT)ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'OUTPUTFORMAT 'org.apache.hadoop.mapred.TextOutputFormat'LOCATION 's3://bucket/web-logs/';
-- Query S3 data directly (no loading required)SELECT user_id, COUNT(*) as page_views, SUM(CASE WHEN status_code = 200 THEN 1 ELSE 0 END) as successful_requestsFROM spectrum_schema.web_logsWHERE event_time >= '2025-01-01'GROUP BY user_id;Use Cases:
- Lakehouse pattern: Query S3 data lake + Redshift local tables
- Cold data: Keep historical data in S3, hot data in Redshift
- Cost optimization: Reduce Redshift storage by offloading to S3
Redshift ML
Machine Learning with SQL
-- Create ML model from training dataCREATE MODEL model_customer_churnFROM ( SELECT customer_id, age, tenure, monthly_spend, support_calls, churned FROM training_data)TARGET churnedFUNCTION predict_churnIAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftMLRole'SETTINGS ( S3_BUCKET 'bucket', MAX_RUNTIME 7200);
-- Use model for predictionSELECT customer_id, predict_churn(age, tenure, monthly_spend, support_calls) as churn_probability, CASE WHEN predict_churn(age, tenure, monthly_spend, support_calls) > 0.7 THEN 'high_risk' WHEN predict_churn(age, tenure, monthly_spend, support_calls) > 0.4 THEN 'medium_risk' ELSE 'low_risk' END as risk_levelFROM customer_dataWHERE tenure < 12;Redshift ML Features:
- No ML expertise: Train models with SQL
- AutoML: Automatic model selection and hyperparameter tuning
- Inference: Use models directly in SQL queries
- Use cases: Churn prediction, fraud detection, forecasting
Redshift Cost Optimization
Node Type Selection
| Node Type | vCPU | Memory | Storage | Cost/Hour | Use Case |
|---|---|---|---|---|---|
| DC2.Large | 2 | 15 GB | 160 GB SSD | $0.25 | Development |
| DC2.8XLarge | 32 | 244 GB | 2.56 TB SSD | $4.80 | Production |
| RA3.XLPlus | 4 | 32 GB | 32 TB managed | $1.97 | Large storage |
| RA3.4XLarge | 12 | 96 GB | 128 TB managed | $4.84 | Production |
| RA3.16XLarge | 48 | 384 GB | 128 TB managed | $19.34 | High throughput |
Cost Optimization Strategy:
Reserved Instances
-- 1-Year Reserved Instance (Partial Upfront)-- DC2.8XLarge: $4.80/hour on-demand → $2.40/hour reserved-- Savings: 50%
-- 3-Year Reserved Instance (All Upfront)-- DC2.8XLarge: $4.80/hour on-demand → $1.50/hour reserved-- Savings: 68%
-- Recommendation:-- - Use reserved instances for production workloads-- - Use on-demand for development/testing-- - Use Spot instances for ETL (if using RA3 with managed storage)Storage Optimization
-- 1. Use appropriate encoding (compression)CREATE TABLE compressed_events ( event_id BIGINT ENCODE AZ64, user_id BIGINT ENCODE ZSTD, event_time TIMESTAMP ENCODE DELTA, event_type VARCHAR(50) ENCODE LZO, payload SUPER ENCODE ZSTD)DISTSTYLE AUTOSORTKEY (event_time);
-- 2. Use columnar storage for analytics-- Redshift automatically uses columnar storage
-- 3. Use SORTKEY for data skipping-- Redshift skips blocks that don't match WHERE clause
-- 4. VACUUM to reclaim spaceVACUUM FULL fact_sales; -- Reclaims space from deleted/updated rows
-- 5. ANALYZE to update statisticsANALYZE fact_sales; -- Updates query planner statisticsRedshift Monitoring
CloudWatch Metrics
import boto3
cloudwatch = boto3.client('cloudwatch')
# Critical Redshift metricsmetrics = [ 'CPUUtilization', # CPU usage 'NetworkReceiveThroughput', # Network in 'NetworkTransmitThroughput', # Network out 'DatabaseConnections', # Active connections 'ReadIOPS', # Read I/O operations 'WriteIOPS', # Write I/O operations 'ReadLatency', # Read latency 'WriteLatency', # Write latency 'ReadThroughput', # Read throughput 'WriteThroughput', # Write throughput 'MaintenanceMode', # Is cluster in maintenance? 'PercentageDiskSpaceUsed' # Disk usage]
# Alert on high CPUcloudwatch.put_metric_alarm( AlarmName='redshift-high-cpu', MetricName='CPUUtilization', Namespace='AWS/Redshift', Statistic='Average', Period=300, EvaluationPeriods=2, Threshold=80, ComparisonOperator='GreaterThanThreshold')Query Monitoring
-- Query the system tables for query historySELECT query, userid, querytxt, starttime, endtime, datediff(seconds, starttime, endtime) as duration_seconds, aborted, execution_timeFROM STL_QUERYWHERE starttime >= '2025-01-27'ORDER BY starttime DESCLIMIT 100;
-- Find slow queriesSELECT query, querytxt, AVG(datediff(seconds, starttime, endtime)) as avg_duration, COUNT(*) as execution_countFROM STL_QUERYWHERE starttime >= '2025-01-27'GROUP BY query, querytxtORDER BY avg_duration DESCLIMIT 10;
-- Find queries with high disk I/OSELECT query, querytxt, SUM(bytes) as total_bytes, COUNT(*) as num_scansFROM STL_SCANWHERE starttime >= '2025-01-27'GROUP BY query, querytxtORDER BY total_bytes DESCLIMIT 10;Redshift Security
Encryption
-- 1. Enable encryption at rest-- Enable during cluster creation or modify existing cluster
-- 2. Use SSL for client connections-- Force SSL connections:ALTER USER data_engineer PASSWORD 'secure_password' REQUIRE SSL;
-- 3. Column-level encryptionCREATE TABLE encrypted_data ( user_id BIGINT, sensitive_data VARCHAR(1024) ENCRYPT);Access Control
-- 1. Create user groupsCREATE GROUP data_engineers;CREATE GROUP data_analysts;CREATE GROUP readonly_users;
-- 2. Grant schema-level permissionsGRANT USAGE ON SCHEMA public TO GROUP data_engineers;GRANT CREATE ON SCHEMA public TO GROUP data_engineers;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO GROUP data_analysts;GRANT SELECT ON ALL TABLES IN SCHEMA public TO GROUP readonly_users;
-- 3. Row-level security (RLS)CREATE TABLE sales_rep_data ( rep_id BIGINT, customer_id BIGINT, sale_amount DECIMAL(18,2));
-- Create RLS policyALTER TABLE sales_rep_data ROW LEVEL SECURITY ON;
CREATE RLS POLICY rep_policy ON sales_rep_dataTO GROUP data_analystsUSING (rep_id = current_user_id());
-- 4. Column-level maskingCREATE VIEW masked_customer_data ASSELECT customer_id, email, REGEXP_REPLACE(email, '(?<=.).(?=.*@)', '*') as masked_email, -- Mask email phone_number, CONCAT('***-***-', SUBSTRING(phone_number, 9, 4)) as masked_phone -- Mask phoneFROM customers;Redshift Migration
Schema Migration
-- 1. Assess compatibility-- Use AWS Schema Conversion Tool (SCT) for:-- - Oracle → Redshift-- - SQL Server → Redshift-- - PostgreSQL → Redshift-- - MySQL → Redshift
-- 2. Migrate schema-- Extract DDL from source database-- Apply Redshift-specific optimizations:-- - DISTKEY for frequently joined columns-- - SORTKEY for frequently filtered columns-- - ENCODE for compression
-- 3. Migrate data-- Use AWS DMS (Database Migration Service) for:-- - Full load + CDC (Change Data Capture)-- - Minimal downtime-- - Schema conversion
-- 4. Optimize post-migration-- - VACUUM to reclaim space-- - ANALYZE to update statistics-- - Create materialized views for frequent queriesBest Practices
DO
-- 1. Use DISTKEY for frequently joined columnsCREATE TABLE fact_sales ( sale_id BIGINT, customer_id BIGINT DISTKEY, sale_date DATE SORTKEY, amount DECIMAL(18,2))DISTSTYLE KEYDISTKEY (customer_id);
-- 2. Use SORTKEY for frequently filtered columnsCREATE TABLE events ( event_id BIGINT, event_time TIMESTAMP SORTKEY, user_id BIGINT, event_type VARCHAR(50))DISTSTYLE AUTOSORTKEY (event_time);
-- 3. Use appropriate encodingCREATE TABLE compressed_data ( id BIGINT ENCODE AZ64, data VARCHAR(1024) ENCODE LZO, timestamp TIMESTAMP ENCODE DELTA);
-- 4. Use WLM for workload management-- Create queues for ETL, BI, analytics
-- 5. Use Spectrum for cold data-- Keep hot data in Redshift, cold data in S3DON’T
-- 1. Don't use SELECT *-- Select only needed columns
-- 2. Don't use CROSS JOIN-- Expensive, use INNER JOIN instead
-- 3. Don't use ORDER BY without LIMIT-- Expensive for large datasets
-- 4. Don't use VARCHAR(MAX) unnecessarily-- Use appropriate VARCHAR length
-- 5. Don't ignore VACUUM and ANALYZE-- Run regularly for performanceKey Takeaways
- Architecture: Leader node (planning) + compute nodes (execution)
- Distribution: EVEN, KEY, ALL, AUTO - choose based on join patterns
- Sort Keys: Compound (single column) vs Interleaved (multi-column)
- Performance: Use WLM, materialized views, appropriate distribution/sort keys
- Spectrum: Query S3 data directly without loading
- ML: Train and use ML models with SQL (Redshift ML)
- Cost: Use RA3 for large storage, reserved instances for production
- Security: Encryption at rest, SSL in transit, RLS for row-level security
Back to Module 3