Google BigQuery Guide
Serverless Data Warehouse
Overview
Google BigQuery is a fully managed, serverless, petabyte-scale data warehouse. It enables SQL-based analysis on massive datasets with zero infrastructure management, excellent performance, and seamless integration with the Google Cloud ecosystem.
BigQuery Architecture
Serverless Architecture
Key Components:
- Dremel: Distributed query execution engine
- Colossus: Distributed file system (Google’s internal)
- Capacitor: Columnar storage format
- Jupiter: High-bandwidth network (1 Petabit/s)
Serverless Benefits:
- No cluster provisioning
- Auto-scaling query execution
- No infrastructure management
- Pay per query (no idle costs)
BigQuery Table Design
Partitioning
-- 1. Ingestion-time partitioning (automatic)CREATE TABLE dataset.events ( event_id BIGINT, user_id BIGINT, event_type STRING, payload JSON)PARTITION BY ingestion_timestamp()OPTIONS ( partition_expiration_days = 30, -- Auto-delete old partitions require_partition_filter = true -- Require partition filter in queries);
-- 2. Time-based partitioning (column-based)CREATE TABLE dataset.sales ( sale_id BIGINT, customer_id BIGINT, sale_date DATE, amount NUMERIC)PARTITION BY sale_dateOPTIONS ( partition_expiration_days = 365);
-- 3. Integer range partitioningCREATE TABLE dataset.users ( user_id BIGINT, username STRING, country STRING)PARTITION BY RANGE_BUCKET(user_id, buckets(0, 1000000, 2000000, 3000000));Clustering
-- Clustered table for optimized queriesCREATE TABLE dataset.events_clustered ( event_id BIGINT, event_time TIMESTAMP, user_id BIGINT, event_type STRING, payload JSON)PARTITION BY DATE(event_time)CLUSTER BY user_id, event_type -- Sort within partitionsOPTIONS ( partition_expiration_days = 30);
-- Benefits:-- - Faster queries (data skipping)-- - Lower cost (fewer bytes scanned)-- - Automatic maintenance
-- Best practice: Partition by date, cluster by high-cardinality columnsNested and Repeated Fields
-- Nested and repeated fields (semi-structured data)CREATE TABLE dataset.user_events ( user_id BIGINT, events ARRAY<STRUCT< event_time TIMESTAMP, event_type STRING, properties ARRAY<STRUCT< key STRING, value STRING >> >>);
-- Query nested fieldsSELECT user_id, event.event_time, event.event_type, prop.key, prop.valueFROM dataset.user_events,UNNEST(events) AS event,UNNEST(event.properties) AS propWHERE event.event_type = 'page_view';
-- Benefits:-- - No need for separate tables-- - Preserve data relationships-- - Efficient storage (columnar)BigQuery Performance Optimization
Query Optimization
-- 1. Use partition pruningSELECT user_id, COUNT(*) as event_countFROM dataset.eventsWHERE event_time >= '2025-01-01' -- Partition filter AND event_time < '2025-02-01'GROUP BY user_id;
-- 2. Use clustering for data skippingSELECT user_id, event_type, COUNT(*) as event_countFROM dataset.events_clusteredWHERE event_time >= '2025-01-01' AND user_id = 12345 -- Clustered column (data skipping) AND event_type = 'click' -- Clustered column (data skipping)GROUP BY user_id, event_type;
-- 3. Use SELECT * sparingly-- Bad: Selects all columnsSELECT * FROM dataset.events;
-- Good: Select only needed columnsSELECT user_id, event_time, event_type FROM dataset.events;
-- 4. Use approximate functions for large datasetsSELECT user_id, COUNT(*) as exact_count, COUNT(DISTINCT user_id) as exact_distinct, APPROX_COUNT_DISTINCT(user_id) as approx_distinct -- FasterFROM dataset.eventsGROUP BY user_id;
-- 5. Use materialized views for pre-computed resultsCREATE MATERIALIZED VIEW dataset.mv_daily_sales ASSELECT sale_date, customer_id, COUNT(*) as sales_count, SUM(amount) as total_amount, AVG(amount) as avg_amountFROM dataset.salesGROUP BY sale_date, customer_id;
-- Query materialized view (automatic query rewrite)SELECT * FROM dataset.mv_daily_salesWHERE sale_date >= '2025-01-01';Execution Details
from google.cloud import bigquery
# Get query execution planclient = bigquery.Client()
job = client.query("SELECT user_id, COUNT(*) FROM dataset.events GROUP BY user_id")
# Get execution detailsjob.result() # Wait for completion
# Print query planprint(f"Bytes processed: {job.total_bytes_billed}")print(f"Slot time: {job.slot_millis} ms")print(f"Cache hit: {job.cache_hit}")
# Detailed execution statsfor stage in job.query_plan: print(f"Stage: {stage.name}") print(f" Steps: {stage.steps}") print(f" Wait ratio: {stage.wait_ratio}")BigQuery Cost Optimization
Pricing Model
Cost Optimization Strategies
-- 1. Partition and cluster for cost reduction-- Clustered tables scan less data → Lower cost
-- 2. Use _PARTITIONTIME for partitioned tablesSELECT user_id, COUNT(*) as event_countFROM dataset.eventsWHERE _PARTITIONTIME >= '2025-01-01' -- Partition filter AND _PARTITIONTIME < '2025-02-01'GROUP BY user_id;
-- 3. Use materialized views-- BigQuery automatically rewrites queries to use materialized views-- Reduces cost (pre-computed results)
-- 4. Use query caching-- Repeated queries use cached results (no cost)SELECT * FROM dataset.events WHERE event_time >= '2025-01-01';
-- 5. Limit data scanned-- Bad: Scans entire tableSELECT * FROM dataset.large_table;
-- Good: Limits data scannedSELECT * FROM dataset.large_table WHERE date_column >= '2025-01-01' LIMIT 1000;Capacity Commitments
from google.cloud import bigquery_reservations_v1 as reservations
# Create capacity commitment (slots)client = reservations.ReservationServiceClient()
# Annual commitment (20% discount)commitment = reservations.CapacityCommitment()commitment.slot_count = 100commitment.plan = reservations.CapacityCommitment.CommitmentPlan.ANNUAL
response = client.create_capacity_commitment( parent="projects/my-project/locations/us", capacity_commitment=commitment)
# Benefits:# - Predictable monthly cost# - 20-50% discount vs. on-demand# - Dedicated slots (no queuing)BigQuery ML
Machine Learning with SQL
-- 1. Train a model (logistic regression)CREATE OR REPLACE MODEL `dataset.customer_churn_model`OPTIONS ( model_type = 'LOGISTIC_REG', input_label_cols = ['churned'], max_iterations = 100, learn_rate = 0.4, l1_reg = 0.1, l2_reg = 0.1)ASSELECT age, tenure, monthly_spend, support_calls, churnedFROM `project.training_data`WHERE training_split = 'train';
-- 2. Evaluate modelSELECT *FROM ML.EVALUATE( MODEL `dataset.customer_churn_model`, ( SELECT age, tenure, monthly_spend, support_calls, churned FROM `project.training_data` WHERE training_split = 'test' ));
-- 3. Predict with modelSELECT customer_id, predicted_churned, probability_of_churnFROM ML.PREDICT( MODEL `dataset.customer_churn_model`, ( SELECT customer_id, age, tenure, monthly_spend, support_calls FROM `project.customers` WHERE tenure < 12 ));
-- 4. Explain predictions (feature importance)SELECT *FROM ML.EXPLAIN_PREDICT( MODEL `dataset.customer_churn_model`, ( SELECT customer_id, age, tenure, monthly_spend, support_calls FROM `project.customers` LIMIT 10 ));Supported ML Models
| Model Type | Use Case | SQL Function |
|---|---|---|
| Linear Regression | Forecasting | ML.LINEAR_REG |
| Logistic Regression | Classification | ML.LOGISTIC_REG |
| K-Means | Clustering | ML.KMEANS |
| Matrix Factorization | Recommendations | ML.FACTORIZE |
| XGBoost | Classification/Regression | ML.XGBOOST |
| TensorFlow | Deep learning | ML.TENSORFLOW |
BigQuery Streaming
Streaming Inserts
from google.cloud import bigquery
# Streaming insert (low latency)client = bigquery.Client()
table_id = "project.dataset.events"
# Stream datarows_to_insert = [ {"event_id": 1, "user_id": 123, "event_type": "click", "event_time": "2025-01-27T10:00:00"}, {"event_id": 2, "user_id": 456, "event_type": "view", "event_time": "2025-01-27T10:00:01"},]
errors = client.insert_rows_json(table_id, rows_to_insert)
# Streaming inserts:# - Available immediately for query# - Costs: $0.01 per 100 MB (free up to 1 TB/month)# - Limit: 100,000 rows per second per tableBest Practices for Streaming
-- 1. Use partitioned tables for streaming-- Partition by ingestion time for streaming workloads
-- 2. Use INSERT DML instead of streaming inserts for bulk dataINSERT INTO `project.dataset.events` (event_id, user_id, event_type, event_time)VALUES (1, 123, 'click', '2025-01-27T10:00:00'), (2, 456, 'view', '2025-01-27T10:00:01');
-- 3. Use load jobs for bulk data (cheaper than streaming)-- Load from GCS: $0 per TB (free)-- Streaming: $0.01 per 100 MBBigQuery Security
Access Control
-- 1. Role-based access control-- Grant dataset accessGRANT `roles/bigquery.dataViewer`ON SCHEMA `project.dataset`TO USER 'user@example.com';
-- Grant table accessGRANT `roles/bigquery.dataEditor`ON TABLE `project.dataset.events`TO GROUP 'data-engineers@example.com';
-- 2. Column-level securityCREATE OR REPLACE TABLE `project.dataset.users` ASSELECT user_id, email, -- Mask sensitive data CASE WHEN EXISTS(SELECT 1 FROM UNNEST([current_user()]) WHERE email IN ('admin@example.com')) THEN email ELSE SHA256(email) END as masked_email, phone_numberFROM `source_dataset.users`;
-- 3. Row-level securityCREATE TABLE `project.dataset.sales_with_region` ASSELECT *, CASE WHEN region = 'US' THEN 'US' WHEN region = 'EU' THEN 'EU' ELSE 'OTHER' END as regionFROM `project.dataset.sales`;
-- Policy tag for row-level securityEncryption
# Customer-managed encryption keys (CMEK)from google.cloud import bigquery
client = bigquery.Client()
table = client.get_table("project.dataset.events")
# Configure CMEKtable.encryption_configuration = bigquery.EncryptionConfiguration( kms_key_name="projects/my-project/locations/us/keyRings/my-keyRing/cryptoKeys/my-key")
table = client.update_table(table, ["encryption_configuration"])
# Benefits:# - Control your own encryption keys# - Rotate keys as needed# - Meet compliance requirements (SOC2, HIPAA)BigQuery Data Transfer
Automated Data Loading
from google.cloud import bigquery_datatransfer_v1 as datatransfer
# Create transfer configclient = datatransfer.DataTransferServiceClient()
transfer_config = { "destination_dataset_id": "my_dataset", "display_name": "Daily GCS Load", "data_source_id": "google_cloud_storage", # GCS to BigQuery "params": { "destination_table_name_template": "events_{run_date}", "file_format": "PARQUET", "write_disposition": "WRITE_TRUNCATE", "field_delimiter": ",", }, "schedule_options": { "disable_auto_scheduling": False, # Auto-schedule }, "schedule": "every day 00:00", # Daily at midnight "data_source_id": "google_cloud_storage",}
response = client.create_transfer_config( parent="projects/my-project/locations/us", transfer_config=transfer_config)
# Automated data loading:# - GCS → BigQuery (daily)# - S3 → BigQuery (via AWS DMS)# - Cloud Storage → BigQuery (on schedule)BigQuery Monitoring
Query Metrics
from google.cloud import bigquery
# Get query metricsclient = bigquery.Client()
# Query audit logquery_job = client.query("""SELECT job_id, user_email, query, start_time, end_time, total_bytes_processed, total_slot_time, cache_hitFROM `region-us.INFORMATION_SCHEMA.JOBS`WHERE start_time >= TIMESTAMP('2025-01-27')ORDER BY start_time DESCLIMIT 100""")
# Analyze query costsquery_job = client.query("""SELECT user_email, SUM(total_bytes_processed) / 1e12 as TB_processed, SUM(total_bytes_processed) / 1e12 * 5.00 as cost_usdFROM `region-us.INFORMATION_SCHEMA.JOBS`WHERE start_time >= TIMESTAMP('2025-01-01')GROUP BY user_emailORDER BY cost_usd DESC""")BigQuery vs. Alternatives
| Feature | BigQuery | Redshift | Snowflake |
|---|---|---|---|
| Architecture | Serverless | Provisioned | Serverless |
| Pricing | $5/TB | $5/TB + cluster | $2-6/TB |
| Concurrency | Unlimited | Limited by cluster | Unlimited |
| Setup Time | Minutes | Hours | Minutes |
| Auto-scaling | Yes | No | Yes |
| Best For | Variable workloads | Predictable workloads | Multi-cloud |
Best Practices
DO
-- 1. Partition and cluster tablesCREATE TABLE dataset.events ( event_id BIGINT, event_time TIMESTAMP, user_id BIGINT, event_type STRING)PARTITION BY DATE(event_time)CLUSTER BY user_id, event_type;
-- 2. Use SELECT * sparinglySELECT user_id, event_time FROM dataset.events;
-- 3. Use approximate functions for large datasetsSELECT APPROX_COUNT_DISTINCT(user_id) FROM dataset.events;
-- 4. Use materialized views for pre-computed resultsCREATE MATERIALIZED VIEW dataset.mv_daily_sales ASSELECT sale_date, SUM(amount) as totalFROM dataset.salesGROUP BY sale_date;
-- 5. Monitor query costsSELECT user_email, SUM(total_bytes_processed) FROM jobs GROUP BY user_email;DON’T
-- 1. Don't use SELECT * for large tables-- Selects all columns → High cost
-- 2. Don't ignore partition filters-- Without partition filter → Full table scan
-- 3. Don't use wildcard tables without LIMIT-- Scans all tables → High cost
-- 4. Don't use streaming inserts for bulk data-- Use load jobs instead (cheaper)
-- 5. Don't forget partition expiration-- Old partitions accumulate → Higher storage costKey Takeaways
- Serverless: No cluster management, auto-scaling
- Performance: Partition and cluster for optimal query performance
- Cost: $5/TB scanned, use partitioning and clustering to reduce
- ML: Build ML models directly in SQL (BigQuery ML)
- Streaming: Streaming inserts for real-time data (1 TB/month free)
- Security: Column-level security, row-level security, CMEK
- Monitoring: Information Schema for query metrics and costs
- Use Cases: Variable workloads, serverless, best-in-class warehouse
Back to Module 3