Skip to content

Amazon Redshift Guide

AWS Data Warehouse Service


Overview

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It’s designed for high-performance analysis using SQL and business intelligence tools, with optimized storage and compute for structured and semi-structured data.


Redshift Architecture

Node Architecture

Key Components:

  • Leader Node: Manages client connections, query planning, result aggregation
  • Compute Nodes: Execute queries, store data
  • Slices: Parallel processing units within compute nodes
  • Node Types: DC2 (dense compute), RA3 (managed storage)

Redshift Table Types

Distribution Styles

Distribution Selection:

StyleUse CaseExample
AUTOLet Redshift decideDefault for new tables
EVENNo obvious distribution keyLogs, events
KEYLarge table joinsfact_sales DISTKEY (customer_id)
ALLSmall dimension tablesdim_customer DISTSTYLE ALL

Sort Keys

-- Compound sort key (best for single-column queries)
CREATE TABLE events (
event_id BIGINT,
event_time TIMESTAMP SORTKEY,
user_id BIGINT,
event_type VARCHAR(50)
)
DISTSTYLE AUTO
SORTKEY (event_time);
-- Interleaved sort key (best for multi-column queries)
CREATE TABLE sales (
sale_id BIGINT,
sale_date SORTKEY,
customer_id SORTKEY,
product_id SORTKEY,
amount DECIMAL(18,2)
)
DISTSTYLE AUTO
INTERLEAVED SORTKEY (sale_date, customer_id, product_id);

Sort Key Selection:

  • Compound: Use for single-column WHERE clauses
  • Interleaved: Use for multi-column WHERE clauses
  • Best Practice: Choose frequently filtered columns (dates, IDs)

Redshift Performance Tuning

Query Optimization

-- 1. Use appropriate distribution style
CREATE TABLE fact_sales (
sale_id BIGINT IDENTITY(1,1),
customer_id BIGINT NOT NULL,
product_id BIGINT NOT NULL,
sale_date DATE NOT NULL,
amount DECIMAL(18,2) NOT NULL
)
DISTSTYLE KEY
DISTKEY (customer_id)
SORTKEY (sale_date);
-- 2. Create materialized views for complex queries
CREATE MATERIALIZED VIEW mv_customer_daily_sales AS
SELECT
customer_id,
sale_date,
COUNT(*) as sales_count,
SUM(amount) as total_amount
FROM fact_sales
GROUP BY customer_id, sale_date;
-- 3. Use approximate functions for large datasets
SELECT
customer_id,
COUNT(*) as exact_count,
APPROX_COUNT_DISTINCT(product_id) as distinct_products, -- Faster
AVG(amount) as avg_amount
FROM fact_sales
WHERE sale_date >= '2025-01-01'
GROUP BY customer_id;
-- 4. Use WHERE before GROUP BY (filter early)
SELECT
customer_id,
SUM(amount) as total_amount
FROM fact_sales
WHERE sale_date >= '2025-01-01' -- Filter first
AND amount > 100 -- Filter first
GROUP BY customer_id;

Workload Management (WLM)

-- Create WLM queues for different workloads
CREATE WLM QUEUE configuration
AS
-- Queue 1: ETL loads (high concurrency)
etl_queue WITH (
QUERY_GROUP = 'etl',
CONCURRENCY = 10,
MEMORY_PERCENT = 30,
TIMEOUT = 3600
),
-- Queue 2: BI queries (low latency)
bi_queue WITH (
QUERY_GROUP = 'bi',
CONCURRENCY = 20,
MEMORY_PERCENT = 40,
TIMEOUT = 60
),
-- Queue 3: Ad-hoc analytics (medium latency)
analytics_queue WITH (
QUERY_GROUP = 'analytics',
CONCURRENCY = 5,
MEMORY_PERCENT = 20,
TIMEOUT = 600
),
-- Superuser queue (for maintenance)
superuser_queue WITH (
QUERY_GROUP = 'superuser',
CONCURRENCY = 5,
MEMORY_PERCENT = 10
);

Concurrency Scaling

Enable Concurrency Scaling:

-- Enable for a WLM queue
ALTER WLM QUEUE configuration
SET concurrency_scaling = 'auto';
-- Concurrency scaling clusters:
-- - Automatically added during load spikes
-- - Handle bursty workloads
-- - No extra cost for most eligible queries

Redshift Spectrum

Querying S3 Data

-- Create external schema for S3 data
CREATE EXTERNAL SCHEMA spectrum_schema
FROM data catalog
DATABASE 'spectrum_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
-- Create external table for S3 data
CREATE EXTERNAL TABLE spectrum_schema.web_logs (
log_id BIGINT,
user_id BIGINT,
event_time TIMESTAMP,
url VARCHAR(1024),
status_code INT
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json' = 'true'
)
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.mapred.TextOutputFormat'
LOCATION 's3://bucket/web-logs/';
-- Query S3 data directly (no loading required)
SELECT
user_id,
COUNT(*) as page_views,
SUM(CASE WHEN status_code = 200 THEN 1 ELSE 0 END) as successful_requests
FROM spectrum_schema.web_logs
WHERE event_time >= '2025-01-01'
GROUP BY user_id;

Use Cases:

  • Lakehouse pattern: Query S3 data lake + Redshift local tables
  • Cold data: Keep historical data in S3, hot data in Redshift
  • Cost optimization: Reduce Redshift storage by offloading to S3

Redshift ML

Machine Learning with SQL

-- Create ML model from training data
CREATE MODEL model_customer_churn
FROM (
SELECT
customer_id,
age,
tenure,
monthly_spend,
support_calls,
churned
FROM training_data
)
TARGET churned
FUNCTION predict_churn
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftMLRole'
SETTINGS (
S3_BUCKET 'bucket',
MAX_RUNTIME 7200
);
-- Use model for prediction
SELECT
customer_id,
predict_churn(age, tenure, monthly_spend, support_calls) as churn_probability,
CASE
WHEN predict_churn(age, tenure, monthly_spend, support_calls) > 0.7 THEN 'high_risk'
WHEN predict_churn(age, tenure, monthly_spend, support_calls) > 0.4 THEN 'medium_risk'
ELSE 'low_risk'
END as risk_level
FROM customer_data
WHERE tenure < 12;

Redshift ML Features:

  • No ML expertise: Train models with SQL
  • AutoML: Automatic model selection and hyperparameter tuning
  • Inference: Use models directly in SQL queries
  • Use cases: Churn prediction, fraud detection, forecasting

Redshift Cost Optimization

Node Type Selection

Node TypevCPUMemoryStorageCost/HourUse Case
DC2.Large215 GB160 GB SSD$0.25Development
DC2.8XLarge32244 GB2.56 TB SSD$4.80Production
RA3.XLPlus432 GB32 TB managed$1.97Large storage
RA3.4XLarge1296 GB128 TB managed$4.84Production
RA3.16XLarge48384 GB128 TB managed$19.34High throughput

Cost Optimization Strategy:

Reserved Instances

-- 1-Year Reserved Instance (Partial Upfront)
-- DC2.8XLarge: $4.80/hour on-demand → $2.40/hour reserved
-- Savings: 50%
-- 3-Year Reserved Instance (All Upfront)
-- DC2.8XLarge: $4.80/hour on-demand → $1.50/hour reserved
-- Savings: 68%
-- Recommendation:
-- - Use reserved instances for production workloads
-- - Use on-demand for development/testing
-- - Use Spot instances for ETL (if using RA3 with managed storage)

Storage Optimization

-- 1. Use appropriate encoding (compression)
CREATE TABLE compressed_events (
event_id BIGINT ENCODE AZ64,
user_id BIGINT ENCODE ZSTD,
event_time TIMESTAMP ENCODE DELTA,
event_type VARCHAR(50) ENCODE LZO,
payload SUPER ENCODE ZSTD
)
DISTSTYLE AUTO
SORTKEY (event_time);
-- 2. Use columnar storage for analytics
-- Redshift automatically uses columnar storage
-- 3. Use SORTKEY for data skipping
-- Redshift skips blocks that don't match WHERE clause
-- 4. VACUUM to reclaim space
VACUUM FULL fact_sales; -- Reclaims space from deleted/updated rows
-- 5. ANALYZE to update statistics
ANALYZE fact_sales; -- Updates query planner statistics

Redshift Monitoring

CloudWatch Metrics

import boto3
cloudwatch = boto3.client('cloudwatch')
# Critical Redshift metrics
metrics = [
'CPUUtilization', # CPU usage
'NetworkReceiveThroughput', # Network in
'NetworkTransmitThroughput', # Network out
'DatabaseConnections', # Active connections
'ReadIOPS', # Read I/O operations
'WriteIOPS', # Write I/O operations
'ReadLatency', # Read latency
'WriteLatency', # Write latency
'ReadThroughput', # Read throughput
'WriteThroughput', # Write throughput
'MaintenanceMode', # Is cluster in maintenance?
'PercentageDiskSpaceUsed' # Disk usage
]
# Alert on high CPU
cloudwatch.put_metric_alarm(
AlarmName='redshift-high-cpu',
MetricName='CPUUtilization',
Namespace='AWS/Redshift',
Statistic='Average',
Period=300,
EvaluationPeriods=2,
Threshold=80,
ComparisonOperator='GreaterThanThreshold'
)

Query Monitoring

-- Query the system tables for query history
SELECT
query,
userid,
querytxt,
starttime,
endtime,
datediff(seconds, starttime, endtime) as duration_seconds,
aborted,
execution_time
FROM STL_QUERY
WHERE starttime >= '2025-01-27'
ORDER BY starttime DESC
LIMIT 100;
-- Find slow queries
SELECT
query,
querytxt,
AVG(datediff(seconds, starttime, endtime)) as avg_duration,
COUNT(*) as execution_count
FROM STL_QUERY
WHERE starttime >= '2025-01-27'
GROUP BY query, querytxt
ORDER BY avg_duration DESC
LIMIT 10;
-- Find queries with high disk I/O
SELECT
query,
querytxt,
SUM(bytes) as total_bytes,
COUNT(*) as num_scans
FROM STL_SCAN
WHERE starttime >= '2025-01-27'
GROUP BY query, querytxt
ORDER BY total_bytes DESC
LIMIT 10;

Redshift Security

Encryption

-- 1. Enable encryption at rest
-- Enable during cluster creation or modify existing cluster
-- 2. Use SSL for client connections
-- Force SSL connections:
ALTER USER data_engineer PASSWORD 'secure_password' REQUIRE SSL;
-- 3. Column-level encryption
CREATE TABLE encrypted_data (
user_id BIGINT,
sensitive_data VARCHAR(1024) ENCRYPT
);

Access Control

-- 1. Create user groups
CREATE GROUP data_engineers;
CREATE GROUP data_analysts;
CREATE GROUP readonly_users;
-- 2. Grant schema-level permissions
GRANT USAGE ON SCHEMA public TO GROUP data_engineers;
GRANT CREATE ON SCHEMA public TO GROUP data_engineers;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO GROUP data_analysts;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO GROUP readonly_users;
-- 3. Row-level security (RLS)
CREATE TABLE sales_rep_data (
rep_id BIGINT,
customer_id BIGINT,
sale_amount DECIMAL(18,2)
);
-- Create RLS policy
ALTER TABLE sales_rep_data ROW LEVEL SECURITY ON;
CREATE RLS POLICY rep_policy ON sales_rep_data
TO GROUP data_analysts
USING (rep_id = current_user_id());
-- 4. Column-level masking
CREATE VIEW masked_customer_data AS
SELECT
customer_id,
email,
REGEXP_REPLACE(email, '(?<=.).(?=.*@)', '*') as masked_email, -- Mask email
phone_number,
CONCAT('***-***-', SUBSTRING(phone_number, 9, 4)) as masked_phone -- Mask phone
FROM customers;

Redshift Migration

Schema Migration

-- 1. Assess compatibility
-- Use AWS Schema Conversion Tool (SCT) for:
-- - Oracle → Redshift
-- - SQL Server → Redshift
-- - PostgreSQL → Redshift
-- - MySQL → Redshift
-- 2. Migrate schema
-- Extract DDL from source database
-- Apply Redshift-specific optimizations:
-- - DISTKEY for frequently joined columns
-- - SORTKEY for frequently filtered columns
-- - ENCODE for compression
-- 3. Migrate data
-- Use AWS DMS (Database Migration Service) for:
-- - Full load + CDC (Change Data Capture)
-- - Minimal downtime
-- - Schema conversion
-- 4. Optimize post-migration
-- - VACUUM to reclaim space
-- - ANALYZE to update statistics
-- - Create materialized views for frequent queries

Best Practices

DO

-- 1. Use DISTKEY for frequently joined columns
CREATE TABLE fact_sales (
sale_id BIGINT,
customer_id BIGINT DISTKEY,
sale_date DATE SORTKEY,
amount DECIMAL(18,2)
)
DISTSTYLE KEY
DISTKEY (customer_id);
-- 2. Use SORTKEY for frequently filtered columns
CREATE TABLE events (
event_id BIGINT,
event_time TIMESTAMP SORTKEY,
user_id BIGINT,
event_type VARCHAR(50)
)
DISTSTYLE AUTO
SORTKEY (event_time);
-- 3. Use appropriate encoding
CREATE TABLE compressed_data (
id BIGINT ENCODE AZ64,
data VARCHAR(1024) ENCODE LZO,
timestamp TIMESTAMP ENCODE DELTA
);
-- 4. Use WLM for workload management
-- Create queues for ETL, BI, analytics
-- 5. Use Spectrum for cold data
-- Keep hot data in Redshift, cold data in S3

DON’T

-- 1. Don't use SELECT *
-- Select only needed columns
-- 2. Don't use CROSS JOIN
-- Expensive, use INNER JOIN instead
-- 3. Don't use ORDER BY without LIMIT
-- Expensive for large datasets
-- 4. Don't use VARCHAR(MAX) unnecessarily
-- Use appropriate VARCHAR length
-- 5. Don't ignore VACUUM and ANALYZE
-- Run regularly for performance

Key Takeaways

  1. Architecture: Leader node (planning) + compute nodes (execution)
  2. Distribution: EVEN, KEY, ALL, AUTO - choose based on join patterns
  3. Sort Keys: Compound (single column) vs Interleaved (multi-column)
  4. Performance: Use WLM, materialized views, appropriate distribution/sort keys
  5. Spectrum: Query S3 data directly without loading
  6. ML: Train and use ML models with SQL (Redshift ML)
  7. Cost: Use RA3 for large storage, reserved instances for production
  8. Security: Encryption at rest, SSL in transit, RLS for row-level security

Back to Module 3