Google BigQuery Guide

Serverless Data Warehouse

Overview

Google BigQuery is a fully managed, serverless, petabyte-scale data warehouse. It enables SQL-based analysis on massive datasets with zero infrastructure management, excellent performance, and seamless integration with the Google Cloud ecosystem.

BigQuery Architecture

Serverless Architecture

Key Components:

Dremel: Distributed query execution engine
Colossus: Distributed file system (Google’s internal)
Capacitor: Columnar storage format
Jupiter: High-bandwidth network (1 Petabit/s)

Serverless Benefits:

No cluster provisioning
Auto-scaling query execution
No infrastructure management
Pay per query (no idle costs)

BigQuery Table Design

Partitioning

-- 1. Ingestion-time partitioning (automatic)
CREATE TABLE dataset.events (
    event_id BIGINT,
    user_id BIGINT,
    event_type STRING,
    payload JSON
)
PARTITION BY ingestion_timestamp()
OPTIONS (
    partition_expiration_days = 30,  -- Auto-delete old partitions
    require_partition_filter = true  -- Require partition filter in queries
);

-- 2. Time-based partitioning (column-based)
CREATE TABLE dataset.sales (
    sale_id BIGINT,
    customer_id BIGINT,
    sale_date DATE,
    amount NUMERIC
)
PARTITION BY sale_date
OPTIONS (
    partition_expiration_days = 365
);

-- 3. Integer range partitioning
CREATE TABLE dataset.users (
    user_id BIGINT,
    username STRING,
    country STRING
)
PARTITION BY RANGE_BUCKET(user_id, buckets(0, 1000000, 2000000, 3000000));

Clustering

-- Clustered table for optimized queries
CREATE TABLE dataset.events_clustered (
    event_id BIGINT,
    event_time TIMESTAMP,
    user_id BIGINT,
    event_type STRING,
    payload JSON
)
PARTITION BY DATE(event_time)
CLUSTER BY user_id, event_type  -- Sort within partitions
OPTIONS (
    partition_expiration_days = 30
);

-- Benefits:
-- - Faster queries (data skipping)
-- - Lower cost (fewer bytes scanned)
-- - Automatic maintenance

-- Best practice: Partition by date, cluster by high-cardinality columns

Nested and Repeated Fields

-- Nested and repeated fields (semi-structured data)
CREATE TABLE dataset.user_events (
    user_id BIGINT,
    events ARRAY<STRUCT<
        event_time TIMESTAMP,
        event_type STRING,
        properties ARRAY<STRUCT<
            key STRING,
            value STRING
        >>
    >>
);

-- Query nested fields
SELECT
    user_id,
    event.event_time,
    event.event_type,
    prop.key,
    prop.value
FROM dataset.user_events,
UNNEST(events) AS event,
UNNEST(event.properties) AS prop
WHERE event.event_type = 'page_view';

-- Benefits:
-- - No need for separate tables
-- - Preserve data relationships
-- - Efficient storage (columnar)

BigQuery Performance Optimization

Query Optimization

-- 1. Use partition pruning
SELECT
    user_id,
    COUNT(*) as event_count
FROM dataset.events
WHERE event_time >= '2025-01-01'  -- Partition filter
  AND event_time < '2025-02-01'
GROUP BY user_id;

-- 2. Use clustering for data skipping
SELECT
    user_id,
    event_type,
    COUNT(*) as event_count
FROM dataset.events_clustered
WHERE event_time >= '2025-01-01'
  AND user_id = 12345  -- Clustered column (data skipping)
  AND event_type = 'click'  -- Clustered column (data skipping)
GROUP BY user_id, event_type;

-- 3. Use SELECT * sparingly
-- Bad: Selects all columns
SELECT * FROM dataset.events;

-- Good: Select only needed columns
SELECT user_id, event_time, event_type FROM dataset.events;

-- 4. Use approximate functions for large datasets
SELECT
    user_id,
    COUNT(*) as exact_count,
    COUNT(DISTINCT user_id) as exact_distinct,
    APPROX_COUNT_DISTINCT(user_id) as approx_distinct  -- Faster
FROM dataset.events
GROUP BY user_id;

-- 5. Use materialized views for pre-computed results
CREATE MATERIALIZED VIEW dataset.mv_daily_sales AS
SELECT
    sale_date,
    customer_id,
    COUNT(*) as sales_count,
    SUM(amount) as total_amount,
    AVG(amount) as avg_amount
FROM dataset.sales
GROUP BY sale_date, customer_id;

-- Query materialized view (automatic query rewrite)
SELECT * FROM dataset.mv_daily_sales
WHERE sale_date >= '2025-01-01';

Execution Details

from google.cloud import bigquery

# Get query execution plan
client = bigquery.Client()

job = client.query("SELECT user_id, COUNT(*) FROM dataset.events GROUP BY user_id")

# Get execution details
job.result()  # Wait for completion

# Print query plan
print(f"Bytes processed: {job.total_bytes_billed}")
print(f"Slot time: {job.slot_millis} ms")
print(f"Cache hit: {job.cache_hit}")

# Detailed execution stats
for stage in job.query_plan:
    print(f"Stage: {stage.name}")
    print(f"  Steps: {stage.steps}")
    print(f"  Wait ratio: {stage.wait_ratio}")

BigQuery Cost Optimization

Pricing Model

Cost Optimization Strategies

-- 1. Partition and cluster for cost reduction
-- Clustered tables scan less data → Lower cost

-- 2. Use _PARTITIONTIME for partitioned tables
SELECT
    user_id,
    COUNT(*) as event_count
FROM dataset.events
WHERE _PARTITIONTIME >= '2025-01-01'  -- Partition filter
  AND _PARTITIONTIME < '2025-02-01'
GROUP BY user_id;

-- 3. Use materialized views
-- BigQuery automatically rewrites queries to use materialized views
-- Reduces cost (pre-computed results)

-- 4. Use query caching
-- Repeated queries use cached results (no cost)
SELECT * FROM dataset.events WHERE event_time >= '2025-01-01';

-- 5. Limit data scanned
-- Bad: Scans entire table
SELECT * FROM dataset.large_table;

-- Good: Limits data scanned
SELECT * FROM dataset.large_table WHERE date_column >= '2025-01-01' LIMIT 1000;

Capacity Commitments

from google.cloud import bigquery_reservations_v1 as reservations

# Create capacity commitment (slots)
client = reservations.ReservationServiceClient()

# Annual commitment (20% discount)
commitment = reservations.CapacityCommitment()
commitment.slot_count = 100
commitment.plan = reservations.CapacityCommitment.CommitmentPlan.ANNUAL

response = client.create_capacity_commitment(
    parent="projects/my-project/locations/us",
    capacity_commitment=commitment
)

# Benefits:
# - Predictable monthly cost
# - 20-50% discount vs. on-demand
# - Dedicated slots (no queuing)

BigQuery ML

Machine Learning with SQL

-- 1. Train a model (logistic regression)
CREATE OR REPLACE MODEL `dataset.customer_churn_model`
OPTIONS (
    model_type = 'LOGISTIC_REG',
    input_label_cols = ['churned'],
    max_iterations = 100,
    learn_rate = 0.4,
    l1_reg = 0.1,
    l2_reg = 0.1
)
AS
SELECT
    age,
    tenure,
    monthly_spend,
    support_calls,
    churned
FROM `project.training_data`
WHERE training_split = 'train';

-- 2. Evaluate model
SELECT
    *
FROM ML.EVALUATE(
    MODEL `dataset.customer_churn_model`,
    (
        SELECT
            age,
            tenure,
            monthly_spend,
            support_calls,
            churned
        FROM `project.training_data`
        WHERE training_split = 'test'
    )
);

-- 3. Predict with model
SELECT
    customer_id,
    predicted_churned,
    probability_of_churn
FROM ML.PREDICT(
    MODEL `dataset.customer_churn_model`,
    (
        SELECT
            customer_id,
            age,
            tenure,
            monthly_spend,
            support_calls
        FROM `project.customers`
        WHERE tenure < 12
    )
);

-- 4. Explain predictions (feature importance)
SELECT
    *
FROM ML.EXPLAIN_PREDICT(
    MODEL `dataset.customer_churn_model`,
    (
        SELECT
            customer_id,
            age,
            tenure,
            monthly_spend,
            support_calls
        FROM `project.customers`
        LIMIT 10
    )
);

Supported ML Models

Model Type	Use Case	SQL Function
Linear Regression	Forecasting	`ML.LINEAR_REG`
Logistic Regression	Classification	`ML.LOGISTIC_REG`
K-Means	Clustering	`ML.KMEANS`
Matrix Factorization	Recommendations	`ML.FACTORIZE`
XGBoost	Classification/Regression	`ML.XGBOOST`
TensorFlow	Deep learning	`ML.TENSORFLOW`

BigQuery Streaming

Streaming Inserts

from google.cloud import bigquery

# Streaming insert (low latency)
client = bigquery.Client()

table_id = "project.dataset.events"

# Stream data
rows_to_insert = [
    {"event_id": 1, "user_id": 123, "event_type": "click", "event_time": "2025-01-27T10:00:00"},
    {"event_id": 2, "user_id": 456, "event_type": "view", "event_time": "2025-01-27T10:00:01"},
]

errors = client.insert_rows_json(table_id, rows_to_insert)

# Streaming inserts:
# - Available immediately for query
# - Costs: $0.01 per 100 MB (free up to 1 TB/month)
# - Limit: 100,000 rows per second per table

Best Practices for Streaming

-- 1. Use partitioned tables for streaming
-- Partition by ingestion time for streaming workloads

-- 2. Use INSERT DML instead of streaming inserts for bulk data
INSERT INTO `project.dataset.events` (event_id, user_id, event_type, event_time)
VALUES
    (1, 123, 'click', '2025-01-27T10:00:00'),
    (2, 456, 'view', '2025-01-27T10:00:01');

-- 3. Use load jobs for bulk data (cheaper than streaming)
-- Load from GCS: $0 per TB (free)
-- Streaming: $0.01 per 100 MB

BigQuery Security

Access Control

-- 1. Role-based access control
-- Grant dataset access
GRANT `roles/bigquery.dataViewer`
ON SCHEMA `project.dataset`
TO USER 'user@example.com';

-- Grant table access
GRANT `roles/bigquery.dataEditor`
ON TABLE `project.dataset.events`
TO GROUP 'data-engineers@example.com';

-- 2. Column-level security
CREATE OR REPLACE TABLE `project.dataset.users` AS
SELECT
    user_id,
    email,
    -- Mask sensitive data
    CASE
        WHEN EXISTS(SELECT 1 FROM UNNEST([current_user()]) WHERE email IN ('admin@example.com'))
        THEN email
        ELSE SHA256(email)
    END as masked_email,
    phone_number
FROM `source_dataset.users`;

-- 3. Row-level security
CREATE TABLE `project.dataset.sales_with_region` AS
SELECT
    *,
    CASE
        WHEN region = 'US' THEN 'US'
        WHEN region = 'EU' THEN 'EU'
        ELSE 'OTHER'
    END as region
FROM `project.dataset.sales`;

-- Policy tag for row-level security

Encryption

# Customer-managed encryption keys (CMEK)
from google.cloud import bigquery

client = bigquery.Client()

table = client.get_table("project.dataset.events")

# Configure CMEK
table.encryption_configuration = bigquery.EncryptionConfiguration(
    kms_key_name="projects/my-project/locations/us/keyRings/my-keyRing/cryptoKeys/my-key"
)

table = client.update_table(table, ["encryption_configuration"])

# Benefits:
# - Control your own encryption keys
# - Rotate keys as needed
# - Meet compliance requirements (SOC2, HIPAA)

BigQuery Data Transfer

Automated Data Loading

from google.cloud import bigquery_datatransfer_v1 as datatransfer

# Create transfer config
client = datatransfer.DataTransferServiceClient()

transfer_config = {
    "destination_dataset_id": "my_dataset",
    "display_name": "Daily GCS Load",
    "data_source_id": "google_cloud_storage",  # GCS to BigQuery
    "params": {
        "destination_table_name_template": "events_{run_date}",
        "file_format": "PARQUET",
        "write_disposition": "WRITE_TRUNCATE",
        "field_delimiter": ",",
    },
    "schedule_options": {
        "disable_auto_scheduling": False,  # Auto-schedule
    },
    "schedule": "every day 00:00",  # Daily at midnight
    "data_source_id": "google_cloud_storage",
}

response = client.create_transfer_config(
    parent="projects/my-project/locations/us",
    transfer_config=transfer_config
)

# Automated data loading:
# - GCS → BigQuery (daily)
# - S3 → BigQuery (via AWS DMS)
# - Cloud Storage → BigQuery (on schedule)

BigQuery Monitoring

Query Metrics

from google.cloud import bigquery

# Get query metrics
client = bigquery.Client()

# Query audit log
query_job = client.query("""
SELECT
    job_id,
    user_email,
    query,
    start_time,
    end_time,
    total_bytes_processed,
    total_slot_time,
    cache_hit
FROM `region-us.INFORMATION_SCHEMA.JOBS`
WHERE start_time >= TIMESTAMP('2025-01-27')
ORDER BY start_time DESC
LIMIT 100
""")

# Analyze query costs
query_job = client.query("""
SELECT
    user_email,
    SUM(total_bytes_processed) / 1e12 as TB_processed,
    SUM(total_bytes_processed) / 1e12 * 5.00 as cost_usd
FROM `region-us.INFORMATION_SCHEMA.JOBS`
WHERE start_time >= TIMESTAMP('2025-01-01')
GROUP BY user_email
ORDER BY cost_usd DESC
""")

BigQuery vs. Alternatives

Feature	BigQuery	Redshift	Snowflake
Architecture	Serverless	Provisioned	Serverless
Pricing	$5/TB	$5/TB + cluster	$2-6/TB
Concurrency	Unlimited	Limited by cluster	Unlimited
Setup Time	Minutes	Hours	Minutes
Auto-scaling	Yes	No	Yes
Best For	Variable workloads	Predictable workloads	Multi-cloud

Best Practices

DO

-- 1. Partition and cluster tables
CREATE TABLE dataset.events (
    event_id BIGINT,
    event_time TIMESTAMP,
    user_id BIGINT,
    event_type STRING
)
PARTITION BY DATE(event_time)
CLUSTER BY user_id, event_type;

-- 2. Use SELECT * sparingly
SELECT user_id, event_time FROM dataset.events;

-- 3. Use approximate functions for large datasets
SELECT APPROX_COUNT_DISTINCT(user_id) FROM dataset.events;

-- 4. Use materialized views for pre-computed results
CREATE MATERIALIZED VIEW dataset.mv_daily_sales AS
SELECT sale_date, SUM(amount) as total
FROM dataset.sales
GROUP BY sale_date;

-- 5. Monitor query costs
SELECT user_email, SUM(total_bytes_processed) FROM jobs GROUP BY user_email;

DON’T

-- 1. Don't use SELECT * for large tables
-- Selects all columns → High cost

-- 2. Don't ignore partition filters
-- Without partition filter → Full table scan

-- 3. Don't use wildcard tables without LIMIT
-- Scans all tables → High cost

-- 4. Don't use streaming inserts for bulk data
-- Use load jobs instead (cheaper)

-- 5. Don't forget partition expiration
-- Old partitions accumulate → Higher storage cost

Key Takeaways

Serverless: No cluster management, auto-scaling
Performance: Partition and cluster for optimal query performance
Cost: $5/TB scanned, use partitioning and clustering to reduce
ML: Build ML models directly in SQL (BigQuery ML)
Streaming: Streaming inserts for real-time data (1 TB/month free)
Security: Column-level security, row-level security, CMEK
Monitoring: Information Schema for query metrics and costs
Use Cases: Variable workloads, serverless, best-in-class warehouse

Back to Module 3