Skip to content

Google BigQuery Guide

Serverless Data Warehouse


Overview

Google BigQuery is a fully managed, serverless, petabyte-scale data warehouse. It enables SQL-based analysis on massive datasets with zero infrastructure management, excellent performance, and seamless integration with the Google Cloud ecosystem.


BigQuery Architecture

Serverless Architecture

Key Components:

  • Dremel: Distributed query execution engine
  • Colossus: Distributed file system (Google’s internal)
  • Capacitor: Columnar storage format
  • Jupiter: High-bandwidth network (1 Petabit/s)

Serverless Benefits:

  • No cluster provisioning
  • Auto-scaling query execution
  • No infrastructure management
  • Pay per query (no idle costs)

BigQuery Table Design

Partitioning

-- 1. Ingestion-time partitioning (automatic)
CREATE TABLE dataset.events (
event_id BIGINT,
user_id BIGINT,
event_type STRING,
payload JSON
)
PARTITION BY ingestion_timestamp()
OPTIONS (
partition_expiration_days = 30, -- Auto-delete old partitions
require_partition_filter = true -- Require partition filter in queries
);
-- 2. Time-based partitioning (column-based)
CREATE TABLE dataset.sales (
sale_id BIGINT,
customer_id BIGINT,
sale_date DATE,
amount NUMERIC
)
PARTITION BY sale_date
OPTIONS (
partition_expiration_days = 365
);
-- 3. Integer range partitioning
CREATE TABLE dataset.users (
user_id BIGINT,
username STRING,
country STRING
)
PARTITION BY RANGE_BUCKET(user_id, buckets(0, 1000000, 2000000, 3000000));

Clustering

-- Clustered table for optimized queries
CREATE TABLE dataset.events_clustered (
event_id BIGINT,
event_time TIMESTAMP,
user_id BIGINT,
event_type STRING,
payload JSON
)
PARTITION BY DATE(event_time)
CLUSTER BY user_id, event_type -- Sort within partitions
OPTIONS (
partition_expiration_days = 30
);
-- Benefits:
-- - Faster queries (data skipping)
-- - Lower cost (fewer bytes scanned)
-- - Automatic maintenance
-- Best practice: Partition by date, cluster by high-cardinality columns

Nested and Repeated Fields

-- Nested and repeated fields (semi-structured data)
CREATE TABLE dataset.user_events (
user_id BIGINT,
events ARRAY<STRUCT<
event_time TIMESTAMP,
event_type STRING,
properties ARRAY<STRUCT<
key STRING,
value STRING
>>
>>
);
-- Query nested fields
SELECT
user_id,
event.event_time,
event.event_type,
prop.key,
prop.value
FROM dataset.user_events,
UNNEST(events) AS event,
UNNEST(event.properties) AS prop
WHERE event.event_type = 'page_view';
-- Benefits:
-- - No need for separate tables
-- - Preserve data relationships
-- - Efficient storage (columnar)

BigQuery Performance Optimization

Query Optimization

-- 1. Use partition pruning
SELECT
user_id,
COUNT(*) as event_count
FROM dataset.events
WHERE event_time >= '2025-01-01' -- Partition filter
AND event_time < '2025-02-01'
GROUP BY user_id;
-- 2. Use clustering for data skipping
SELECT
user_id,
event_type,
COUNT(*) as event_count
FROM dataset.events_clustered
WHERE event_time >= '2025-01-01'
AND user_id = 12345 -- Clustered column (data skipping)
AND event_type = 'click' -- Clustered column (data skipping)
GROUP BY user_id, event_type;
-- 3. Use SELECT * sparingly
-- Bad: Selects all columns
SELECT * FROM dataset.events;
-- Good: Select only needed columns
SELECT user_id, event_time, event_type FROM dataset.events;
-- 4. Use approximate functions for large datasets
SELECT
user_id,
COUNT(*) as exact_count,
COUNT(DISTINCT user_id) as exact_distinct,
APPROX_COUNT_DISTINCT(user_id) as approx_distinct -- Faster
FROM dataset.events
GROUP BY user_id;
-- 5. Use materialized views for pre-computed results
CREATE MATERIALIZED VIEW dataset.mv_daily_sales AS
SELECT
sale_date,
customer_id,
COUNT(*) as sales_count,
SUM(amount) as total_amount,
AVG(amount) as avg_amount
FROM dataset.sales
GROUP BY sale_date, customer_id;
-- Query materialized view (automatic query rewrite)
SELECT * FROM dataset.mv_daily_sales
WHERE sale_date >= '2025-01-01';

Execution Details

from google.cloud import bigquery
# Get query execution plan
client = bigquery.Client()
job = client.query("SELECT user_id, COUNT(*) FROM dataset.events GROUP BY user_id")
# Get execution details
job.result() # Wait for completion
# Print query plan
print(f"Bytes processed: {job.total_bytes_billed}")
print(f"Slot time: {job.slot_millis} ms")
print(f"Cache hit: {job.cache_hit}")
# Detailed execution stats
for stage in job.query_plan:
print(f"Stage: {stage.name}")
print(f" Steps: {stage.steps}")
print(f" Wait ratio: {stage.wait_ratio}")

BigQuery Cost Optimization

Pricing Model

Cost Optimization Strategies

-- 1. Partition and cluster for cost reduction
-- Clustered tables scan less data → Lower cost
-- 2. Use _PARTITIONTIME for partitioned tables
SELECT
user_id,
COUNT(*) as event_count
FROM dataset.events
WHERE _PARTITIONTIME >= '2025-01-01' -- Partition filter
AND _PARTITIONTIME < '2025-02-01'
GROUP BY user_id;
-- 3. Use materialized views
-- BigQuery automatically rewrites queries to use materialized views
-- Reduces cost (pre-computed results)
-- 4. Use query caching
-- Repeated queries use cached results (no cost)
SELECT * FROM dataset.events WHERE event_time >= '2025-01-01';
-- 5. Limit data scanned
-- Bad: Scans entire table
SELECT * FROM dataset.large_table;
-- Good: Limits data scanned
SELECT * FROM dataset.large_table WHERE date_column >= '2025-01-01' LIMIT 1000;

Capacity Commitments

from google.cloud import bigquery_reservations_v1 as reservations
# Create capacity commitment (slots)
client = reservations.ReservationServiceClient()
# Annual commitment (20% discount)
commitment = reservations.CapacityCommitment()
commitment.slot_count = 100
commitment.plan = reservations.CapacityCommitment.CommitmentPlan.ANNUAL
response = client.create_capacity_commitment(
parent="projects/my-project/locations/us",
capacity_commitment=commitment
)
# Benefits:
# - Predictable monthly cost
# - 20-50% discount vs. on-demand
# - Dedicated slots (no queuing)

BigQuery ML

Machine Learning with SQL

-- 1. Train a model (logistic regression)
CREATE OR REPLACE MODEL `dataset.customer_churn_model`
OPTIONS (
model_type = 'LOGISTIC_REG',
input_label_cols = ['churned'],
max_iterations = 100,
learn_rate = 0.4,
l1_reg = 0.1,
l2_reg = 0.1
)
AS
SELECT
age,
tenure,
monthly_spend,
support_calls,
churned
FROM `project.training_data`
WHERE training_split = 'train';
-- 2. Evaluate model
SELECT
*
FROM ML.EVALUATE(
MODEL `dataset.customer_churn_model`,
(
SELECT
age,
tenure,
monthly_spend,
support_calls,
churned
FROM `project.training_data`
WHERE training_split = 'test'
)
);
-- 3. Predict with model
SELECT
customer_id,
predicted_churned,
probability_of_churn
FROM ML.PREDICT(
MODEL `dataset.customer_churn_model`,
(
SELECT
customer_id,
age,
tenure,
monthly_spend,
support_calls
FROM `project.customers`
WHERE tenure < 12
)
);
-- 4. Explain predictions (feature importance)
SELECT
*
FROM ML.EXPLAIN_PREDICT(
MODEL `dataset.customer_churn_model`,
(
SELECT
customer_id,
age,
tenure,
monthly_spend,
support_calls
FROM `project.customers`
LIMIT 10
)
);

Supported ML Models

Model TypeUse CaseSQL Function
Linear RegressionForecastingML.LINEAR_REG
Logistic RegressionClassificationML.LOGISTIC_REG
K-MeansClusteringML.KMEANS
Matrix FactorizationRecommendationsML.FACTORIZE
XGBoostClassification/RegressionML.XGBOOST
TensorFlowDeep learningML.TENSORFLOW

BigQuery Streaming

Streaming Inserts

from google.cloud import bigquery
# Streaming insert (low latency)
client = bigquery.Client()
table_id = "project.dataset.events"
# Stream data
rows_to_insert = [
{"event_id": 1, "user_id": 123, "event_type": "click", "event_time": "2025-01-27T10:00:00"},
{"event_id": 2, "user_id": 456, "event_type": "view", "event_time": "2025-01-27T10:00:01"},
]
errors = client.insert_rows_json(table_id, rows_to_insert)
# Streaming inserts:
# - Available immediately for query
# - Costs: $0.01 per 100 MB (free up to 1 TB/month)
# - Limit: 100,000 rows per second per table

Best Practices for Streaming

-- 1. Use partitioned tables for streaming
-- Partition by ingestion time for streaming workloads
-- 2. Use INSERT DML instead of streaming inserts for bulk data
INSERT INTO `project.dataset.events` (event_id, user_id, event_type, event_time)
VALUES
(1, 123, 'click', '2025-01-27T10:00:00'),
(2, 456, 'view', '2025-01-27T10:00:01');
-- 3. Use load jobs for bulk data (cheaper than streaming)
-- Load from GCS: $0 per TB (free)
-- Streaming: $0.01 per 100 MB

BigQuery Security

Access Control

-- 1. Role-based access control
-- Grant dataset access
GRANT `roles/bigquery.dataViewer`
ON SCHEMA `project.dataset`
TO USER 'user@example.com';
-- Grant table access
GRANT `roles/bigquery.dataEditor`
ON TABLE `project.dataset.events`
TO GROUP 'data-engineers@example.com';
-- 2. Column-level security
CREATE OR REPLACE TABLE `project.dataset.users` AS
SELECT
user_id,
email,
-- Mask sensitive data
CASE
WHEN EXISTS(SELECT 1 FROM UNNEST([current_user()]) WHERE email IN ('admin@example.com'))
THEN email
ELSE SHA256(email)
END as masked_email,
phone_number
FROM `source_dataset.users`;
-- 3. Row-level security
CREATE TABLE `project.dataset.sales_with_region` AS
SELECT
*,
CASE
WHEN region = 'US' THEN 'US'
WHEN region = 'EU' THEN 'EU'
ELSE 'OTHER'
END as region
FROM `project.dataset.sales`;
-- Policy tag for row-level security

Encryption

# Customer-managed encryption keys (CMEK)
from google.cloud import bigquery
client = bigquery.Client()
table = client.get_table("project.dataset.events")
# Configure CMEK
table.encryption_configuration = bigquery.EncryptionConfiguration(
kms_key_name="projects/my-project/locations/us/keyRings/my-keyRing/cryptoKeys/my-key"
)
table = client.update_table(table, ["encryption_configuration"])
# Benefits:
# - Control your own encryption keys
# - Rotate keys as needed
# - Meet compliance requirements (SOC2, HIPAA)

BigQuery Data Transfer

Automated Data Loading

from google.cloud import bigquery_datatransfer_v1 as datatransfer
# Create transfer config
client = datatransfer.DataTransferServiceClient()
transfer_config = {
"destination_dataset_id": "my_dataset",
"display_name": "Daily GCS Load",
"data_source_id": "google_cloud_storage", # GCS to BigQuery
"params": {
"destination_table_name_template": "events_{run_date}",
"file_format": "PARQUET",
"write_disposition": "WRITE_TRUNCATE",
"field_delimiter": ",",
},
"schedule_options": {
"disable_auto_scheduling": False, # Auto-schedule
},
"schedule": "every day 00:00", # Daily at midnight
"data_source_id": "google_cloud_storage",
}
response = client.create_transfer_config(
parent="projects/my-project/locations/us",
transfer_config=transfer_config
)
# Automated data loading:
# - GCS → BigQuery (daily)
# - S3 → BigQuery (via AWS DMS)
# - Cloud Storage → BigQuery (on schedule)

BigQuery Monitoring

Query Metrics

from google.cloud import bigquery
# Get query metrics
client = bigquery.Client()
# Query audit log
query_job = client.query("""
SELECT
job_id,
user_email,
query,
start_time,
end_time,
total_bytes_processed,
total_slot_time,
cache_hit
FROM `region-us.INFORMATION_SCHEMA.JOBS`
WHERE start_time >= TIMESTAMP('2025-01-27')
ORDER BY start_time DESC
LIMIT 100
""")
# Analyze query costs
query_job = client.query("""
SELECT
user_email,
SUM(total_bytes_processed) / 1e12 as TB_processed,
SUM(total_bytes_processed) / 1e12 * 5.00 as cost_usd
FROM `region-us.INFORMATION_SCHEMA.JOBS`
WHERE start_time >= TIMESTAMP('2025-01-01')
GROUP BY user_email
ORDER BY cost_usd DESC
""")

BigQuery vs. Alternatives

FeatureBigQueryRedshiftSnowflake
ArchitectureServerlessProvisionedServerless
Pricing$5/TB$5/TB + cluster$2-6/TB
ConcurrencyUnlimitedLimited by clusterUnlimited
Setup TimeMinutesHoursMinutes
Auto-scalingYesNoYes
Best ForVariable workloadsPredictable workloadsMulti-cloud

Best Practices

DO

-- 1. Partition and cluster tables
CREATE TABLE dataset.events (
event_id BIGINT,
event_time TIMESTAMP,
user_id BIGINT,
event_type STRING
)
PARTITION BY DATE(event_time)
CLUSTER BY user_id, event_type;
-- 2. Use SELECT * sparingly
SELECT user_id, event_time FROM dataset.events;
-- 3. Use approximate functions for large datasets
SELECT APPROX_COUNT_DISTINCT(user_id) FROM dataset.events;
-- 4. Use materialized views for pre-computed results
CREATE MATERIALIZED VIEW dataset.mv_daily_sales AS
SELECT sale_date, SUM(amount) as total
FROM dataset.sales
GROUP BY sale_date;
-- 5. Monitor query costs
SELECT user_email, SUM(total_bytes_processed) FROM jobs GROUP BY user_email;

DON’T

-- 1. Don't use SELECT * for large tables
-- Selects all columns → High cost
-- 2. Don't ignore partition filters
-- Without partition filter → Full table scan
-- 3. Don't use wildcard tables without LIMIT
-- Scans all tables → High cost
-- 4. Don't use streaming inserts for bulk data
-- Use load jobs instead (cheaper)
-- 5. Don't forget partition expiration
-- Old partitions accumulate → Higher storage cost

Key Takeaways

  1. Serverless: No cluster management, auto-scaling
  2. Performance: Partition and cluster for optimal query performance
  3. Cost: $5/TB scanned, use partitioning and clustering to reduce
  4. ML: Build ML models directly in SQL (BigQuery ML)
  5. Streaming: Streaming inserts for real-time data (1 TB/month free)
  6. Security: Column-level security, row-level security, CMEK
  7. Monitoring: Information Schema for query metrics and costs
  8. Use Cases: Variable workloads, serverless, best-in-class warehouse

Back to Module 3