Amazon Redshift Guide

AWS Data Warehouse Service

Overview

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It’s designed for high-performance analysis using SQL and business intelligence tools, with optimized storage and compute for structured and semi-structured data.

Redshift Architecture

Node Architecture

Key Components:

Leader Node: Manages client connections, query planning, result aggregation
Compute Nodes: Execute queries, store data
Slices: Parallel processing units within compute nodes
Node Types: DC2 (dense compute), RA3 (managed storage)

Redshift Table Types

Distribution Styles

Distribution Selection:

Style	Use Case	Example
AUTO	Let Redshift decide	Default for new tables
EVEN	No obvious distribution key	Logs, events
KEY	Large table joins	`fact_sales` DISTKEY (customer_id)
ALL	Small dimension tables	`dim_customer` DISTSTYLE ALL

Sort Keys

-- Compound sort key (best for single-column queries)
CREATE TABLE events (
    event_id BIGINT,
    event_time TIMESTAMP SORTKEY,
    user_id BIGINT,
    event_type VARCHAR(50)
)
DISTSTYLE AUTO
SORTKEY (event_time);

-- Interleaved sort key (best for multi-column queries)
CREATE TABLE sales (
    sale_id BIGINT,
    sale_date SORTKEY,
    customer_id SORTKEY,
    product_id SORTKEY,
    amount DECIMAL(18,2)
)
DISTSTYLE AUTO
INTERLEAVED SORTKEY (sale_date, customer_id, product_id);

Sort Key Selection:

Compound: Use for single-column WHERE clauses
Interleaved: Use for multi-column WHERE clauses
Best Practice: Choose frequently filtered columns (dates, IDs)

Redshift Performance Tuning

Query Optimization

-- 1. Use appropriate distribution style
CREATE TABLE fact_sales (
    sale_id BIGINT IDENTITY(1,1),
    customer_id BIGINT NOT NULL,
    product_id BIGINT NOT NULL,
    sale_date DATE NOT NULL,
    amount DECIMAL(18,2) NOT NULL
)
DISTSTYLE KEY
DISTKEY (customer_id)
SORTKEY (sale_date);

-- 2. Create materialized views for complex queries
CREATE MATERIALIZED VIEW mv_customer_daily_sales AS
SELECT
    customer_id,
    sale_date,
    COUNT(*) as sales_count,
    SUM(amount) as total_amount
FROM fact_sales
GROUP BY customer_id, sale_date;

-- 3. Use approximate functions for large datasets
SELECT
    customer_id,
    COUNT(*) as exact_count,
    APPROX_COUNT_DISTINCT(product_id) as distinct_products,  -- Faster
    AVG(amount) as avg_amount
FROM fact_sales
WHERE sale_date >= '2025-01-01'
GROUP BY customer_id;

-- 4. Use WHERE before GROUP BY (filter early)
SELECT
    customer_id,
    SUM(amount) as total_amount
FROM fact_sales
WHERE sale_date >= '2025-01-01'  -- Filter first
  AND amount > 100  -- Filter first
GROUP BY customer_id;

Workload Management (WLM)

-- Create WLM queues for different workloads
CREATE WLM QUEUE configuration
AS
  -- Queue 1: ETL loads (high concurrency)
  etl_queue WITH (
    QUERY_GROUP = 'etl',
    CONCURRENCY = 10,
    MEMORY_PERCENT = 30,
    TIMEOUT = 3600
  ),

  -- Queue 2: BI queries (low latency)
  bi_queue WITH (
    QUERY_GROUP = 'bi',
    CONCURRENCY = 20,
    MEMORY_PERCENT = 40,
    TIMEOUT = 60
  ),

  -- Queue 3: Ad-hoc analytics (medium latency)
  analytics_queue WITH (
    QUERY_GROUP = 'analytics',
    CONCURRENCY = 5,
    MEMORY_PERCENT = 20,
    TIMEOUT = 600
  ),

  -- Superuser queue (for maintenance)
  superuser_queue WITH (
    QUERY_GROUP = 'superuser',
    CONCURRENCY = 5,
    MEMORY_PERCENT = 10
  );

Concurrency Scaling

Enable Concurrency Scaling:

-- Enable for a WLM queue
ALTER WLM QUEUE configuration
SET concurrency_scaling = 'auto';

-- Concurrency scaling clusters:
-- - Automatically added during load spikes
-- - Handle bursty workloads
-- - No extra cost for most eligible queries

Redshift Spectrum

Querying S3 Data

-- Create external schema for S3 data
CREATE EXTERNAL SCHEMA spectrum_schema
FROM data catalog
DATABASE 'spectrum_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

-- Create external table for S3 data
CREATE EXTERNAL TABLE spectrum_schema.web_logs (
    log_id BIGINT,
    user_id BIGINT,
    event_time TIMESTAMP,
    url VARCHAR(1024),
    status_code INT
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
    'ignore.malformed.json' = 'true'
)
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.mapred.TextOutputFormat'
LOCATION 's3://bucket/web-logs/';

-- Query S3 data directly (no loading required)
SELECT
    user_id,
    COUNT(*) as page_views,
    SUM(CASE WHEN status_code = 200 THEN 1 ELSE 0 END) as successful_requests
FROM spectrum_schema.web_logs
WHERE event_time >= '2025-01-01'
GROUP BY user_id;

Use Cases:

Lakehouse pattern: Query S3 data lake + Redshift local tables
Cold data: Keep historical data in S3, hot data in Redshift
Cost optimization: Reduce Redshift storage by offloading to S3

Redshift ML

Machine Learning with SQL

-- Create ML model from training data
CREATE MODEL model_customer_churn
FROM (
    SELECT
        customer_id,
        age,
        tenure,
        monthly_spend,
        support_calls,
        churned
    FROM training_data
)
TARGET churned
FUNCTION predict_churn
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftMLRole'
SETTINGS (
    S3_BUCKET 'bucket',
    MAX_RUNTIME 7200
);

-- Use model for prediction
SELECT
    customer_id,
    predict_churn(age, tenure, monthly_spend, support_calls) as churn_probability,
    CASE
        WHEN predict_churn(age, tenure, monthly_spend, support_calls) > 0.7 THEN 'high_risk'
        WHEN predict_churn(age, tenure, monthly_spend, support_calls) > 0.4 THEN 'medium_risk'
        ELSE 'low_risk'
    END as risk_level
FROM customer_data
WHERE tenure < 12;

Redshift ML Features:

No ML expertise: Train models with SQL
AutoML: Automatic model selection and hyperparameter tuning
Inference: Use models directly in SQL queries
Use cases: Churn prediction, fraud detection, forecasting

Redshift Cost Optimization

Node Type Selection

Node Type	vCPU	Memory	Storage	Cost/Hour	Use Case
DC2.Large	2	15 GB	160 GB SSD	$0.25	Development
DC2.8XLarge	32	244 GB	2.56 TB SSD	$4.80	Production
RA3.XLPlus	4	32 GB	32 TB managed	$1.97	Large storage
RA3.4XLarge	12	96 GB	128 TB managed	$4.84	Production
RA3.16XLarge	48	384 GB	128 TB managed	$19.34	High throughput

Cost Optimization Strategy:

Reserved Instances

-- 1-Year Reserved Instance (Partial Upfront)
-- DC2.8XLarge: $4.80/hour on-demand → $2.40/hour reserved
-- Savings: 50%

-- 3-Year Reserved Instance (All Upfront)
-- DC2.8XLarge: $4.80/hour on-demand → $1.50/hour reserved
-- Savings: 68%

-- Recommendation:
-- - Use reserved instances for production workloads
-- - Use on-demand for development/testing
-- - Use Spot instances for ETL (if using RA3 with managed storage)

Storage Optimization

-- 1. Use appropriate encoding (compression)
CREATE TABLE compressed_events (
    event_id BIGINT ENCODE AZ64,
    user_id BIGINT ENCODE ZSTD,
    event_time TIMESTAMP ENCODE DELTA,
    event_type VARCHAR(50) ENCODE LZO,
    payload SUPER ENCODE ZSTD
)
DISTSTYLE AUTO
SORTKEY (event_time);

-- 2. Use columnar storage for analytics
-- Redshift automatically uses columnar storage

-- 3. Use SORTKEY for data skipping
-- Redshift skips blocks that don't match WHERE clause

-- 4. VACUUM to reclaim space
VACUUM FULL fact_sales;  -- Reclaims space from deleted/updated rows

-- 5. ANALYZE to update statistics
ANALYZE fact_sales;  -- Updates query planner statistics

Redshift Monitoring

CloudWatch Metrics

import boto3

cloudwatch = boto3.client('cloudwatch')

# Critical Redshift metrics
metrics = [
    'CPUUtilization',  # CPU usage
    'NetworkReceiveThroughput',  # Network in
    'NetworkTransmitThroughput',  # Network out
    'DatabaseConnections',  # Active connections
    'ReadIOPS',  # Read I/O operations
    'WriteIOPS',  # Write I/O operations
    'ReadLatency',  # Read latency
    'WriteLatency',  # Write latency
    'ReadThroughput',  # Read throughput
    'WriteThroughput',  # Write throughput
    'MaintenanceMode',  # Is cluster in maintenance?
    'PercentageDiskSpaceUsed'  # Disk usage
]

# Alert on high CPU
cloudwatch.put_metric_alarm(
    AlarmName='redshift-high-cpu',
    MetricName='CPUUtilization',
    Namespace='AWS/Redshift',
    Statistic='Average',
    Period=300,
    EvaluationPeriods=2,
    Threshold=80,
    ComparisonOperator='GreaterThanThreshold'
)

Query Monitoring

-- Query the system tables for query history
SELECT
    query,
    userid,
    querytxt,
    starttime,
    endtime,
    datediff(seconds, starttime, endtime) as duration_seconds,
    aborted,
    execution_time
FROM STL_QUERY
WHERE starttime >= '2025-01-27'
ORDER BY starttime DESC
LIMIT 100;

-- Find slow queries
SELECT
    query,
    querytxt,
    AVG(datediff(seconds, starttime, endtime)) as avg_duration,
    COUNT(*) as execution_count
FROM STL_QUERY
WHERE starttime >= '2025-01-27'
GROUP BY query, querytxt
ORDER BY avg_duration DESC
LIMIT 10;

-- Find queries with high disk I/O
SELECT
    query,
    querytxt,
    SUM(bytes) as total_bytes,
    COUNT(*) as num_scans
FROM STL_SCAN
WHERE starttime >= '2025-01-27'
GROUP BY query, querytxt
ORDER BY total_bytes DESC
LIMIT 10;

Redshift Security

Encryption

-- 1. Enable encryption at rest
-- Enable during cluster creation or modify existing cluster

-- 2. Use SSL for client connections
-- Force SSL connections:
ALTER USER data_engineer PASSWORD 'secure_password' REQUIRE SSL;

-- 3. Column-level encryption
CREATE TABLE encrypted_data (
    user_id BIGINT,
    sensitive_data VARCHAR(1024) ENCRYPT
);

Access Control

-- 1. Create user groups
CREATE GROUP data_engineers;
CREATE GROUP data_analysts;
CREATE GROUP readonly_users;

-- 2. Grant schema-level permissions
GRANT USAGE ON SCHEMA public TO GROUP data_engineers;
GRANT CREATE ON SCHEMA public TO GROUP data_engineers;

GRANT SELECT ON ALL TABLES IN SCHEMA public TO GROUP data_analysts;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO GROUP readonly_users;

-- 3. Row-level security (RLS)
CREATE TABLE sales_rep_data (
    rep_id BIGINT,
    customer_id BIGINT,
    sale_amount DECIMAL(18,2)
);

-- Create RLS policy
ALTER TABLE sales_rep_data ROW LEVEL SECURITY ON;

CREATE RLS POLICY rep_policy ON sales_rep_data
TO GROUP data_analysts
USING (rep_id = current_user_id());

-- 4. Column-level masking
CREATE VIEW masked_customer_data AS
SELECT
    customer_id,
    email,
    REGEXP_REPLACE(email, '(?<=.).(?=.*@)', '*') as masked_email,  -- Mask email
    phone_number,
    CONCAT('***-***-', SUBSTRING(phone_number, 9, 4)) as masked_phone  -- Mask phone
FROM customers;

Redshift Migration

Schema Migration

-- 1. Assess compatibility
-- Use AWS Schema Conversion Tool (SCT) for:
-- - Oracle → Redshift
-- - SQL Server → Redshift
-- - PostgreSQL → Redshift
-- - MySQL → Redshift

-- 2. Migrate schema
-- Extract DDL from source database
-- Apply Redshift-specific optimizations:
-- - DISTKEY for frequently joined columns
-- - SORTKEY for frequently filtered columns
-- - ENCODE for compression

-- 3. Migrate data
-- Use AWS DMS (Database Migration Service) for:
-- - Full load + CDC (Change Data Capture)
-- - Minimal downtime
-- - Schema conversion

-- 4. Optimize post-migration
-- - VACUUM to reclaim space
-- - ANALYZE to update statistics
-- - Create materialized views for frequent queries

Best Practices

DO

-- 1. Use DISTKEY for frequently joined columns
CREATE TABLE fact_sales (
    sale_id BIGINT,
    customer_id BIGINT DISTKEY,
    sale_date DATE SORTKEY,
    amount DECIMAL(18,2)
)
DISTSTYLE KEY
DISTKEY (customer_id);

-- 2. Use SORTKEY for frequently filtered columns
CREATE TABLE events (
    event_id BIGINT,
    event_time TIMESTAMP SORTKEY,
    user_id BIGINT,
    event_type VARCHAR(50)
)
DISTSTYLE AUTO
SORTKEY (event_time);

-- 3. Use appropriate encoding
CREATE TABLE compressed_data (
    id BIGINT ENCODE AZ64,
    data VARCHAR(1024) ENCODE LZO,
    timestamp TIMESTAMP ENCODE DELTA
);

-- 4. Use WLM for workload management
-- Create queues for ETL, BI, analytics

-- 5. Use Spectrum for cold data
-- Keep hot data in Redshift, cold data in S3

DON’T

-- 1. Don't use SELECT *
-- Select only needed columns

-- 2. Don't use CROSS JOIN
-- Expensive, use INNER JOIN instead

-- 3. Don't use ORDER BY without LIMIT
-- Expensive for large datasets

-- 4. Don't use VARCHAR(MAX) unnecessarily
-- Use appropriate VARCHAR length

-- 5. Don't ignore VACUUM and ANALYZE
-- Run regularly for performance

Key Takeaways

Architecture: Leader node (planning) + compute nodes (execution)
Distribution: EVEN, KEY, ALL, AUTO - choose based on join patterns
Sort Keys: Compound (single column) vs Interleaved (multi-column)
Performance: Use WLM, materialized views, appropriate distribution/sort keys
Spectrum: Query S3 data directly without loading
ML: Train and use ML models with SQL (Redshift ML)
Cost: Use RA3 for large storage, reserved instances for production
Security: Encryption at rest, SSL in transit, RLS for row-level security

Back to Module 3