Skip to content

Snowflake Guide

Multi-Cloud Data Warehouse


Overview

Snowflake is a fully managed, multi-cloud data warehouse built on a unique architecture that separates compute and storage. It enables elastic performance, zero-copy cloning, and secure data sharing across clouds and organizations.


Snowflake Architecture

Multi-Cluster Architecture

Key Innovations:

  • Compute-Storage Separation: Scale independently
  • Multi-Cluster Warehouses: Auto-scale for concurrency
  • Zero-Copy Cloning: Instant data duplication
  • Time Travel: Query historical data
  • Data Sharing: Share data across accounts

Snowflake Table Design

Clustering Keys

-- Clustering for optimized queries
CREATE TABLE sales (
sale_id BIGINT IDENTITY(1,1),
customer_id BIGINT NOT NULL,
product_id BIGINT NOT NULL,
sale_date DATE NOT NULL,
amount DECIMAL(18,2) NOT NULL,
region VARCHAR(50)
)
CLUSTER BY (sale_date, customer_id);
-- Automatic clustering (recommended)
ALTER TABLE sales CLUSTER BY (sale_date, customer_id);
-- Check clustering depth
SELECT
SYSTEM$CLUSTERING_INFORMATION('sales');
-- Benefits:
-- - Data skipping (fewer micro-partitions scanned)
-- - Faster queries
-- - Lower cost (fewer credits used)

Materialized Views

-- Materialized view for pre-computed results
CREATE MATERIALIZED VIEW mv_customer_daily_sales AS
SELECT
customer_id,
sale_date,
COUNT(*) as sales_count,
SUM(amount) as total_amount,
AVG(amount) as avg_amount
FROM sales
GROUP BY customer_id, sale_date;
-- Query materialized view
SELECT * FROM mv_customer_daily_sales
WHERE sale_date >= '2025-01-01';
-- Benefits:
-- - Pre-computed results
-- - Automatic refresh
-- - Query rewrite (automatic)

Snowflake Performance Optimization

Query Optimization

-- 1. Use appropriate warehouse size
-- Small: X-Small (1 credit/hr) - Development
-- Medium: Medium (4 credits/hr) - Testing
-- Large: Large (8 credits/hr) - Production
-- XLarge: 4X-Large (64 credits/hr) - Heavy workloads
-- 2. Use result caching
-- Snowflake automatically caches query results for 24 hours
-- Re-running the same query = 0 credits (cache hit)
-- 3. Use materialized views
CREATE MATERIALIZED VIEW mv_daily_revenue AS
SELECT
sale_date,
SUM(amount) as total_revenue,
COUNT(*) as sales_count
FROM sales
GROUP BY sale_date;
-- 4. Use clustering keys
-- Reduces micro-partitions scanned
-- 5. Use search optimization
ALTER TABLE sales ADD SEARCH OPTIMIZATION;
-- Benefits:
-- - Faster point queries
-- - Better performance on subqueries
-- - Optimized for SELECT * WHERE col = value

Warehouse Sizing

Warehouse Selection:

WarehouseCredits/HourUse Case
X-Small1Development, testing
Small2Light ETL, small teams
Medium4Production analytics
Large8Heavy workloads
X-Large16High concurrency
2X-Large32Enterprise workloads
4X-Large64Maximum performance

Snowflake Cost Optimization

Pricing Model

Cost Optimization Strategies

-- 1. Use auto-suspend
ALTER WAREHOUSE compute_wh SET
AUTO_SUSPEND = 300, -- Suspend after 5 minutes of inactivity
AUTO_RESUME = TRUE; -- Auto-resume on query
-- 2. Use multi-cluster warehouses for concurrency
CREATE WAREHOUSE analytics_wh
WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'
WAREHOUSE_SIZE = 'MEDIUM'
MAX_CLUSTER_COUNT = 10
MIN_CLUSTER_COUNT = 1
SCALING_POLICY = 'STANDARD'; -- Auto-scale based on load
-- 3. Use smaller warehouses for development
-- Use X-Small for dev, Medium for production
-- 4. Use result caching
-- Cached results = 0 credits
-- 5. Use query acceleration
-- Cost-based optimization for faster queries
ALTER WAREHOUSE compute_wh SET
QUERY_ACCELERATION_MAX_SCALE_FACTOR = 4;
-- Benefits:
-- - Cost-aware query acceleration
-- - Only pay for acceleration when needed
-- - Reduces warehouse uptime

Snowflake Time Travel

Query Historical Data

-- Time Travel: Query historical data
-- Retention: 1 day (Standard), 7 days (Enterprise), 90 days (Virtual Warehouse)
-- Query data from 1 hour ago
SELECT * FROM sales
AT(OFFSET => -60 * 60) -- 1 hour ago
WHERE sale_date = '2025-01-27';
-- Query data from a specific timestamp
SELECT * FROM sales
AT(TIMESTAMP => '2025-01-27 09:00:00'::TIMESTAMP)
WHERE sale_date >= '2025-01-27';
-- Restore dropped table
UNDROP TABLE sales;
-- Clone table at a specific point in time
CREATE TABLE sales_010125 CLONE sales
AT(TIMESTAMP => '2025-01-01 00:00:00'::TIMESTAMP);
-- Benefits:
-- - Restore accidental deletes
-- - Audit data changes
-- - Debug production issues

Snowflake Zero-Copy Cloning

Instant Data Duplication

-- Clone entire database
CREATE DATABASE dev CLONE prod;
-- Clone schema
CREATE SCHEMA dev.schema CLONE prod.schema;
-- Clone table
CREATE TABLE sales_backup CLONE sales;
-- Clone at a specific point in time
CREATE TABLE sales_yesterday CLONE sales
BEFORE(TIMESTAMP => '2025-01-26 00:00:00'::TIMESTAMP);
-- Benefits:
-- - Instant (no data copy)
-- - No additional storage cost (initially)
-- - Independent modifications
-- - Use cases: Dev/test, data sandbox, backup

Snowflake Data Sharing

Secure Data Sharing

-- Create a share
CREATE SHARE sales_share;
-- Grant access to objects
GRANT USAGE ON DATABASE sales_db TO SHARE sales_share;
GRANT USAGE ON SCHEMA sales_db.public TO SHARE sales_share;
GRANT SELECT ON TABLE sales_db.public.sales TO SHARE sales_share;
-- Add consumers
ALTER SHARE sales_share ADD ACCOUNTS = 'consumer_account_1', 'consumer_account_2';
-- Benefits:
-- - No data copying
-- - Real-time access
-- - Secure (grant-based)
-- - Monetize data (data marketplace)

Snowflake Streams and Tasks

Change Data Capture (CDC)

-- 1. Create a stream on a table
CREATE STREAM sales_stream ON TABLE sales;
-- 2. Query the stream for changes
SELECT * FROM sales_stream;
-- Stream metadata:
-- - METADATA$ACTION: INSERT, DELETE
-- - METADATA$ISUPDATE: TRUE/FALSE
-- - METADATA$ROW_ID: Unique row identifier
-- 3. Create a task for scheduled execution
CREATE TASK merge_sales_task
WAREHOUSE = compute_wh
SCHEDULE = 'USING CRON 0 * * * America/Los_Angeles' -- Every hour
AS
MERGE INTO sales_target t
USING sales_stream s
ON t.sale_id = s.sale_id
WHEN MATCHED AND METADATA$ACTION = 'DELETE' THEN
DELETE
WHEN MATCHED AND METADATA$ACTION = 'INSERT' THEN
UPDATE SET amount = s.amount
WHEN NOT MATCHED AND METADATA$ACTION = 'INSERT' THEN
INSERT (sale_id, customer_id, product_id, sale_date, amount)
VALUES (s.sale_id, s.customer_id, s.product_id, s.sale_date, s.amount);
-- 4. Resume task stream after processing
ALTER TASK merge_sales_task RESUME;
-- Benefits:
-- - Real-time CDC
-- - No additional ETL tools
-- - Native Snowflake functionality

Snowflake Security

Access Control

-- 1. Role-based access control (RBAC)
CREATE ROLE data_engineer;
CREATE ROLE data_analyst;
CREATE ROLE readonly_user;
-- Grant privileges
GRANT ROLE data_engineer TO USER user1;
GRANT ROLE data_analyst TO USER user2;
GRANT ROLE readonly_user TO USER user3;
-- Grant schema access
GRANT USAGE ON DATABASE sales_db TO ROLE data_engineer;
GRANT CREATE ON SCHEMA sales_db.public TO ROLE data_engineer;
-- Grant table access
GRANT SELECT ON ALL TABLES IN SCHEMA sales_db.public TO ROLE data_analyst;
GRANT SELECT ON ALL TABLES IN SCHEMA sales_db.public TO ROLE readonly_user;
-- 2. Row-level security (RLS)
CREATE ROW ACCESS POLICY us_only_policy AS (region_column VARCHAR) RETURNS BOOLEAN ->
CASE
WHEN region_column = 'US' THEN TRUE
WHEN HAS_ROLE('ADMIN') THEN TRUE -- Admin can see all
ELSE FALSE
END;
-- Apply RLS to table
ALTER TABLE sales ADD ROW ACCESS POLICY us_only_policy ON (region);
-- 3. Column-level security
CREATE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('ADMIN', 'SUPPORT') THEN val
ELSE REGEXP_REPLACE(val, '(?<=.).(?=.*@)', '*') -- Mask email
END;
-- Apply masking policy
ALTER TABLE customers MODIFY COLUMN email SET MASKING POLICY email_mask;

Encryption

-- Snowflake automatic encryption:
-- - All data encrypted at rest (AES-256)
-- - All data encrypted in transit (TLS)
-- - No configuration required
-- Bring Your Own Key (BYOK) for advanced security
-- Use cloud KMS (AWS KMS, GCP KMS, Azure Key Vault)
-- Contact Snowflake support for BYOK setup

Snowflake Monitoring

Query Performance

-- 1. Query history
SELECT
query_id,
user_name,
query_text,
start_time,
end_time,
total_elapsed_time / 1000 as duration_seconds,
total_elapsed_time / 1000 / 3600 * warehouse_size as credits_used,
execution_status
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD(days, -7, CURRENT_TIMESTAMP())
ORDER BY start_time DESC;
-- 2. Identify expensive queries
SELECT
query_text,
SUM(total_elapsed_time) / 1000 as total_duration_seconds,
COUNT(*) as execution_count,
AVG(total_elapsed_time) / 1000 as avg_duration_seconds
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD(days, -7, CURRENT_TIMESTAMP())
GROUP BY query_text
ORDER BY total_duration_seconds DESC
LIMIT 10;
-- 3. Warehouse utilization
SELECT
warehouse_name,
warehouse_size,
AVG(credits_used) as avg_credits_used,
AVG(running_query_count) as avg_concurrent_queries,
AVG(avg_running_query_duration_seconds) as avg_query_duration
FROM snowflake.account_usage.warehouse_metrics
WHERE start_time >= DATEADD(days, -7, CURRENT_TIMESTAMP())
GROUP BY warehouse_name, warehouse_size;

Snowflake vs. Alternatives

FeatureSnowflakeBigQueryRedshift
ArchitectureCompute-Storage separationServerlessShared-nothing
CloudsAWS, GCP, AzureGCP onlyAWS only
Pricing$2-6/TB$5/TB$5/TB + cluster
ConcurrencyMulti-clusterUnlimitedLimited by cluster
Time Travel90 days7 daysNone
Data SharingNativeVia BigQueryVia Redshift
Best ForMulti-cloud, data sharingServerless, costAWS ecosystem

Best Practices

DO

-- 1. Use clustering keys
CREATE TABLE sales (
sale_id BIGINT,
customer_id BIGINT,
sale_date DATE
)
CLUSTER BY (sale_date, customer_id);
-- 2. Use auto-suspend
ALTER WAREHOUSE compute_wh SET AUTO_SUSPEND = 300;
-- 3. Use result caching
-- Query caching is automatic
-- 4. Use streams and tasks for CDC
CREATE STREAM sales_stream ON TABLE sales;
-- 5. Use zero-copy cloning for dev/test
CREATE DATABASE dev CLONE prod;

DON’T

-- 1. Don't over-provision warehouses
-- Use auto-scaling for variable workloads
-- 2. Don't ignore clustering
-- Unclustered tables → Poor performance
-- 3. Don't forget auto-suspend
-- Idle warehouses waste credits
-- 4. Don't use SELECT * for large tables
-- Select only needed columns
-- 5. Don't ignore query history
-- Monitor for expensive queries

Key Takeaways

  1. Multi-cloud: AWS, GCP, Azure support
  2. Compute-Storage separation: Scale independently
  3. Zero-Copy Cloning: Instant data duplication
  4. Time Travel: Query historical data (1-90 days)
  5. Data Sharing: Secure, real-time data sharing
  6. Streams and Tasks: Native CDC functionality
  7. Cost: $2-6/TB, use auto-suspend and multi-cluster
  8. Security: RBAC, RLS, masking, automatic encryption

Back to Module 3