Snowflake Guide
Multi-Cloud Data Warehouse
Overview
Snowflake is a fully managed, multi-cloud data warehouse built on a unique architecture that separates compute and storage. It enables elastic performance, zero-copy cloning, and secure data sharing across clouds and organizations.
Snowflake Architecture
Multi-Cluster Architecture
Key Innovations:
- Compute-Storage Separation: Scale independently
- Multi-Cluster Warehouses: Auto-scale for concurrency
- Zero-Copy Cloning: Instant data duplication
- Time Travel: Query historical data
- Data Sharing: Share data across accounts
Snowflake Table Design
Clustering Keys
-- Clustering for optimized queriesCREATE TABLE sales ( sale_id BIGINT IDENTITY(1,1), customer_id BIGINT NOT NULL, product_id BIGINT NOT NULL, sale_date DATE NOT NULL, amount DECIMAL(18,2) NOT NULL, region VARCHAR(50))CLUSTER BY (sale_date, customer_id);
-- Automatic clustering (recommended)ALTER TABLE sales CLUSTER BY (sale_date, customer_id);
-- Check clustering depthSELECT SYSTEM$CLUSTERING_INFORMATION('sales');
-- Benefits:-- - Data skipping (fewer micro-partitions scanned)-- - Faster queries-- - Lower cost (fewer credits used)Materialized Views
-- Materialized view for pre-computed resultsCREATE MATERIALIZED VIEW mv_customer_daily_sales ASSELECT customer_id, sale_date, COUNT(*) as sales_count, SUM(amount) as total_amount, AVG(amount) as avg_amountFROM salesGROUP BY customer_id, sale_date;
-- Query materialized viewSELECT * FROM mv_customer_daily_salesWHERE sale_date >= '2025-01-01';
-- Benefits:-- - Pre-computed results-- - Automatic refresh-- - Query rewrite (automatic)Snowflake Performance Optimization
Query Optimization
-- 1. Use appropriate warehouse size-- Small: X-Small (1 credit/hr) - Development-- Medium: Medium (4 credits/hr) - Testing-- Large: Large (8 credits/hr) - Production-- XLarge: 4X-Large (64 credits/hr) - Heavy workloads
-- 2. Use result caching-- Snowflake automatically caches query results for 24 hours-- Re-running the same query = 0 credits (cache hit)
-- 3. Use materialized viewsCREATE MATERIALIZED VIEW mv_daily_revenue ASSELECT sale_date, SUM(amount) as total_revenue, COUNT(*) as sales_countFROM salesGROUP BY sale_date;
-- 4. Use clustering keys-- Reduces micro-partitions scanned
-- 5. Use search optimizationALTER TABLE sales ADD SEARCH OPTIMIZATION;
-- Benefits:-- - Faster point queries-- - Better performance on subqueries-- - Optimized for SELECT * WHERE col = valueWarehouse Sizing
Warehouse Selection:
| Warehouse | Credits/Hour | Use Case |
|---|---|---|
| X-Small | 1 | Development, testing |
| Small | 2 | Light ETL, small teams |
| Medium | 4 | Production analytics |
| Large | 8 | Heavy workloads |
| X-Large | 16 | High concurrency |
| 2X-Large | 32 | Enterprise workloads |
| 4X-Large | 64 | Maximum performance |
Snowflake Cost Optimization
Pricing Model
Cost Optimization Strategies
-- 1. Use auto-suspendALTER WAREHOUSE compute_wh SET AUTO_SUSPEND = 300, -- Suspend after 5 minutes of inactivity AUTO_RESUME = TRUE; -- Auto-resume on query
-- 2. Use multi-cluster warehouses for concurrencyCREATE WAREHOUSE analytics_wh WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED' WAREHOUSE_SIZE = 'MEDIUM' MAX_CLUSTER_COUNT = 10 MIN_CLUSTER_COUNT = 1 SCALING_POLICY = 'STANDARD'; -- Auto-scale based on load
-- 3. Use smaller warehouses for development-- Use X-Small for dev, Medium for production
-- 4. Use result caching-- Cached results = 0 credits
-- 5. Use query acceleration-- Cost-based optimization for faster queriesALTER WAREHOUSE compute_wh SET QUERY_ACCELERATION_MAX_SCALE_FACTOR = 4;
-- Benefits:-- - Cost-aware query acceleration-- - Only pay for acceleration when needed-- - Reduces warehouse uptimeSnowflake Time Travel
Query Historical Data
-- Time Travel: Query historical data-- Retention: 1 day (Standard), 7 days (Enterprise), 90 days (Virtual Warehouse)
-- Query data from 1 hour agoSELECT * FROM salesAT(OFFSET => -60 * 60) -- 1 hour agoWHERE sale_date = '2025-01-27';
-- Query data from a specific timestampSELECT * FROM salesAT(TIMESTAMP => '2025-01-27 09:00:00'::TIMESTAMP)WHERE sale_date >= '2025-01-27';
-- Restore dropped tableUNDROP TABLE sales;
-- Clone table at a specific point in timeCREATE TABLE sales_010125 CLONE salesAT(TIMESTAMP => '2025-01-01 00:00:00'::TIMESTAMP);
-- Benefits:-- - Restore accidental deletes-- - Audit data changes-- - Debug production issuesSnowflake Zero-Copy Cloning
Instant Data Duplication
-- Clone entire databaseCREATE DATABASE dev CLONE prod;
-- Clone schemaCREATE SCHEMA dev.schema CLONE prod.schema;
-- Clone tableCREATE TABLE sales_backup CLONE sales;
-- Clone at a specific point in timeCREATE TABLE sales_yesterday CLONE salesBEFORE(TIMESTAMP => '2025-01-26 00:00:00'::TIMESTAMP);
-- Benefits:-- - Instant (no data copy)-- - No additional storage cost (initially)-- - Independent modifications-- - Use cases: Dev/test, data sandbox, backupSnowflake Data Sharing
Secure Data Sharing
-- Create a shareCREATE SHARE sales_share;
-- Grant access to objectsGRANT USAGE ON DATABASE sales_db TO SHARE sales_share;GRANT USAGE ON SCHEMA sales_db.public TO SHARE sales_share;GRANT SELECT ON TABLE sales_db.public.sales TO SHARE sales_share;
-- Add consumersALTER SHARE sales_share ADD ACCOUNTS = 'consumer_account_1', 'consumer_account_2';
-- Benefits:-- - No data copying-- - Real-time access-- - Secure (grant-based)-- - Monetize data (data marketplace)Snowflake Streams and Tasks
Change Data Capture (CDC)
-- 1. Create a stream on a tableCREATE STREAM sales_stream ON TABLE sales;
-- 2. Query the stream for changesSELECT * FROM sales_stream;
-- Stream metadata:-- - METADATA$ACTION: INSERT, DELETE-- - METADATA$ISUPDATE: TRUE/FALSE-- - METADATA$ROW_ID: Unique row identifier
-- 3. Create a task for scheduled executionCREATE TASK merge_sales_task WAREHOUSE = compute_wh SCHEDULE = 'USING CRON 0 * * * America/Los_Angeles' -- Every hourAS MERGE INTO sales_target t USING sales_stream s ON t.sale_id = s.sale_id WHEN MATCHED AND METADATA$ACTION = 'DELETE' THEN DELETE WHEN MATCHED AND METADATA$ACTION = 'INSERT' THEN UPDATE SET amount = s.amount WHEN NOT MATCHED AND METADATA$ACTION = 'INSERT' THEN INSERT (sale_id, customer_id, product_id, sale_date, amount) VALUES (s.sale_id, s.customer_id, s.product_id, s.sale_date, s.amount);
-- 4. Resume task stream after processingALTER TASK merge_sales_task RESUME;
-- Benefits:-- - Real-time CDC-- - No additional ETL tools-- - Native Snowflake functionalitySnowflake Security
Access Control
-- 1. Role-based access control (RBAC)CREATE ROLE data_engineer;CREATE ROLE data_analyst;CREATE ROLE readonly_user;
-- Grant privilegesGRANT ROLE data_engineer TO USER user1;GRANT ROLE data_analyst TO USER user2;GRANT ROLE readonly_user TO USER user3;
-- Grant schema accessGRANT USAGE ON DATABASE sales_db TO ROLE data_engineer;GRANT CREATE ON SCHEMA sales_db.public TO ROLE data_engineer;
-- Grant table accessGRANT SELECT ON ALL TABLES IN SCHEMA sales_db.public TO ROLE data_analyst;GRANT SELECT ON ALL TABLES IN SCHEMA sales_db.public TO ROLE readonly_user;
-- 2. Row-level security (RLS)CREATE ROW ACCESS POLICY us_only_policy AS (region_column VARCHAR) RETURNS BOOLEAN -> CASE WHEN region_column = 'US' THEN TRUE WHEN HAS_ROLE('ADMIN') THEN TRUE -- Admin can see all ELSE FALSE END;
-- Apply RLS to tableALTER TABLE sales ADD ROW ACCESS POLICY us_only_policy ON (region);
-- 3. Column-level securityCREATE MASKING POLICY email_mask AS (val STRING) RETURNS STRING -> CASE WHEN CURRENT_ROLE() IN ('ADMIN', 'SUPPORT') THEN val ELSE REGEXP_REPLACE(val, '(?<=.).(?=.*@)', '*') -- Mask email END;
-- Apply masking policyALTER TABLE customers MODIFY COLUMN email SET MASKING POLICY email_mask;Encryption
-- Snowflake automatic encryption:-- - All data encrypted at rest (AES-256)-- - All data encrypted in transit (TLS)-- - No configuration required
-- Bring Your Own Key (BYOK) for advanced security-- Use cloud KMS (AWS KMS, GCP KMS, Azure Key Vault)-- Contact Snowflake support for BYOK setupSnowflake Monitoring
Query Performance
-- 1. Query historySELECT query_id, user_name, query_text, start_time, end_time, total_elapsed_time / 1000 as duration_seconds, total_elapsed_time / 1000 / 3600 * warehouse_size as credits_used, execution_statusFROM snowflake.account_usage.query_historyWHERE start_time >= DATEADD(days, -7, CURRENT_TIMESTAMP())ORDER BY start_time DESC;
-- 2. Identify expensive queriesSELECT query_text, SUM(total_elapsed_time) / 1000 as total_duration_seconds, COUNT(*) as execution_count, AVG(total_elapsed_time) / 1000 as avg_duration_secondsFROM snowflake.account_usage.query_historyWHERE start_time >= DATEADD(days, -7, CURRENT_TIMESTAMP())GROUP BY query_textORDER BY total_duration_seconds DESCLIMIT 10;
-- 3. Warehouse utilizationSELECT warehouse_name, warehouse_size, AVG(credits_used) as avg_credits_used, AVG(running_query_count) as avg_concurrent_queries, AVG(avg_running_query_duration_seconds) as avg_query_durationFROM snowflake.account_usage.warehouse_metricsWHERE start_time >= DATEADD(days, -7, CURRENT_TIMESTAMP())GROUP BY warehouse_name, warehouse_size;Snowflake vs. Alternatives
| Feature | Snowflake | BigQuery | Redshift |
|---|---|---|---|
| Architecture | Compute-Storage separation | Serverless | Shared-nothing |
| Clouds | AWS, GCP, Azure | GCP only | AWS only |
| Pricing | $2-6/TB | $5/TB | $5/TB + cluster |
| Concurrency | Multi-cluster | Unlimited | Limited by cluster |
| Time Travel | 90 days | 7 days | None |
| Data Sharing | Native | Via BigQuery | Via Redshift |
| Best For | Multi-cloud, data sharing | Serverless, cost | AWS ecosystem |
Best Practices
DO
-- 1. Use clustering keysCREATE TABLE sales ( sale_id BIGINT, customer_id BIGINT, sale_date DATE)CLUSTER BY (sale_date, customer_id);
-- 2. Use auto-suspendALTER WAREHOUSE compute_wh SET AUTO_SUSPEND = 300;
-- 3. Use result caching-- Query caching is automatic
-- 4. Use streams and tasks for CDCCREATE STREAM sales_stream ON TABLE sales;
-- 5. Use zero-copy cloning for dev/testCREATE DATABASE dev CLONE prod;DON’T
-- 1. Don't over-provision warehouses-- Use auto-scaling for variable workloads
-- 2. Don't ignore clustering-- Unclustered tables → Poor performance
-- 3. Don't forget auto-suspend-- Idle warehouses waste credits
-- 4. Don't use SELECT * for large tables-- Select only needed columns
-- 5. Don't ignore query history-- Monitor for expensive queriesKey Takeaways
- Multi-cloud: AWS, GCP, Azure support
- Compute-Storage separation: Scale independently
- Zero-Copy Cloning: Instant data duplication
- Time Travel: Query historical data (1-90 days)
- Data Sharing: Secure, real-time data sharing
- Streams and Tasks: Native CDC functionality
- Cost: $2-6/TB, use auto-suspend and multi-cluster
- Security: RBAC, RLS, masking, automatic encryption
Back to Module 3