Partition Pruning
Optimizing Query Performance with Partitioning
Overview
Partition pruning skips entire partitions based on query filters, dramatically reducing I/O and query time. It’s the most effective optimization for large, partitioned tables.
Partition Pruning Architecture
Pruning Process
Pruning Benefits:
- I/O reduction: 90-99% less data scanned
- Query speed: 10-100x faster queries
- Cost savings: Proportional to data scanned reduction
- Metadata filtering: Instant partition elimination
Partitioning Strategies
Strategy Selection
Time-Series Partitioning
-- Time-series partitioning best practices
-- Daily partitions (high volume)CREATE TABLE events_daily ( event_id BIGINT, event_time TIMESTAMP, user_id BIGINT, event_data STRING) PARTITIONED BY (DATE(event_time));
-- Monthly partitions (medium volume)CREATE TABLE transactions_monthly ( transaction_id BIGINT, transaction_time TIMESTAMP, customer_id BIGINT, amount DECIMAL(10,2)) PARTITIONED BY (to_date(substr(cast(transaction_time AS STRING), 1, 7)));
-- Yearly partitions (low volume)CREATE TABLE archive_yearly ( record_id BIGINT, created_at TIMESTAMP, data STRING) PARTITIONED BY (YEAR(created_at));
-- Query with partition pruningSELECT *FROM events_dailyWHERE DATE(event_time) = '2025-01-27';
-- Scans: 1/365 partitions (0.3%)Geographic Partitioning
-- Geographic partitioning
CREATE TABLE sales_by_region ( order_id BIGINT, order_date DATE, country_code STRING, region STRING, amount DECIMAL(10,2)) PARTITIONED BY (country_code);
-- Query with partition pruningSELECT *FROM sales_by_regionWHERE country_code = 'US';
-- Scans: 1/200 partitions (0.5%)Categorical Partitioning
-- Categorical partitioning (selective)
CREATE TABLE products_by_category ( product_id BIGINT, category STRING, subcategory STRING, price DECIMAL(10,2)) PARTITIONED BY (category);
-- Only partition top categories (avoid too many partitions)-- Electronics, Clothing, Home, Sports, Books (5 partitions)
-- Query with partition pruningSELECT *FROM products_by_categoryWHERE category = 'Electronics';
-- Scans: 1/5 partitions (20%)Partition Pruning Implementation
Spark Partitioning
# Partition pruning with Spark
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import *
spark = SparkSession.builder \ .appName("PartitionPruning") \ .getOrCreate()
# Write partitioned datadf = spark.read.json("s3://bucket/events/*.json")
# Partition by datedf.withColumn("event_date", to_date(col("event_time"))) \ .write \ .mode("overwrite") \ .partitionBy("event_date") \ .parquet("s3://bucket/output/events/")
# Query with partition pruningresult = spark.sql(""" SELECT * FROM parquet.`s3://bucket/output/events/` WHERE event_date = '2025-01-27'""")
# Spark automatically prunes partitions# Only scans partition event_date=2025-01-27Partition Discovery
# Partition discovery configuration
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
# Enable partition discoveryspark.read \ .format("parquet") \ .option("basePath", "s3://bucket/output/events/") \ .load("s3://bucket/output/events/event_date=*/") \ .createOrReplaceTempView("events")
# Query with pruningresult = spark.sql(""" SELECT * FROM events WHERE event_date = '2025-01-27'""")Partition Pruning Patterns
Single-Column Pruning
-- Single column pruning (most effective)
CREATE TABLE sales ( order_id BIGINT, order_date DATE, customer_id BIGINT, amount DECIMAL(10,2)) PARTITIONED BY (order_date);
-- Query with single filterSELECT *FROM salesWHERE order_date = '2025-01-27';
-- Pruning: Scans 1 partition (100% effective)Multi-Column Pruning
-- Multi-column pruning
CREATE TABLE sales_multi ( order_id BIGINT, order_date DATE, region STRING, customer_id BIGINT, amount DECIMAL(10,2)) PARTITIONED BY (order_date, region);
-- Query with multiple filtersSELECT *FROM sales_multiWHERE order_date = '2025-01-27' AND region = 'US';
-- Pruning: Scans 1 partition (highly effective)Range Pruning
-- Range partition pruning
CREATE TABLE events ( event_id BIGINT, event_time TIMESTAMP, user_id BIGINT, event_data STRING) PARTITIONED BY (DATE(event_time));
-- Range query with pruningSELECT *FROM eventsWHERE DATE(event_time) BETWEEN '2025-01-01' AND '2025-01-31';
-- Pruning: Scans 31 partitions (31/365 = 8.5%)-- Still 91.5% reductionPartition Pruning Best Practices
Partition Column Selection
| Criteria | Good | Bad |
|---|---|---|
| Cardinality | Low-medium (< 1000) | High (> 10000) |
| Filter usage | Frequently filtered | Rarely filtered |
| Data distribution | Even | Skewed |
| Time-based | Date/time | Random timestamp |
| Size | Balanced partitions | Variable sizes |
Anti-Patterns
-- Don't partition by high-cardinality columns
-- BAD: Partition by user_id (millions of values)CREATE TABLE events_bad ( event_id BIGINT, user_id BIGINT, event_time TIMESTAMP) PARTITIONED BY (user_id);
-- Creates millions of tiny partitions-- Breaks NameNode/metastore
-- GOOD: No partitioning or bucketingCREATE TABLE events_good ( event_id BIGINT, user_id BIGINT, event_time TIMESTAMP) CLUSTERED BY (user_id) INTO 32 BUCKETS;Partition Pruning Monitoring
Metrics
# Monitor partition pruning effectiveness
from pyspark.sql import SparkSession
spark = SparkSession.builder \ .appName("PruningMetrics") \ .getOrCreate()
def get_pruning_metrics(table: str, filter_sql: str) -> dict: """Get partition pruning metrics"""
# Run explain explain_sql = f"EXPLAIN SELECT * FROM {table} WHERE {filter_sql}" explain_df = spark.sql(explain_sql) explain_output = explain_df.collect()[0][0]
# Parse metrics metrics = { 'total_partitions': parse_metric(explain_output, "partitions"), 'scanned_partitions': parse_metric(explain_output, "scanned"), 'skipped_partitions': parse_metric(explain_output, "skipped"), 'pruning_ratio': 0.0 }
# Calculate pruning ratio if metrics['total_partitions'] > 0: metrics['pruning_ratio'] = metrics['skipped_partitions'] / metrics['total_partitions']
return metrics
# Example usagemetrics = get_pruning_metrics( "events", "event_date = '2025-01-27'")
print(f"Pruning ratio: {metrics['pruning_ratio']:.2%}")print(f"Partitions scanned: {metrics['scanned_partitions']}/{metrics['total_partitions']}")Partition Pruning Best Practices
DO
-- 1. Partition by date/time for time-series-- Most effective pattern
-- 2. Partition by low-cardinality columns-- < 1000 distinct values
-- 3. Use frequently filtered columns-- Enables pruning in most queries
-- 4. Balance partition sizes-- Aim for 256MB-1GB per partition
-- 5. Monitor pruning effectiveness-- Track pruning ratioDON’T
-- 1. Don't partition by high-cardinality columns-- Creates too many partitions
-- 2. Don't partition by rarely filtered columns-- No pruning benefit
-- 3. Don't create too many partitions-- Metadata explosion
-- 4. Don't partition unevenly-- Creates skewed partitions
-- 5. Don't ignore partition size-- Too small = overhead, too large = no benefitKey Takeaways
- Partition pruning: 90-99% I/O reduction
- Time-series: Daily/Monthly partitions best
- Cardinality: Keep < 1000 partitions
- Filters: Partition by frequently filtered columns
- Balance: Aim for 256MB-1GB per partition
- Monitor: Track pruning ratio, aim for > 90%
- Anti-patterns: High cardinality, rare filters
- Use When: All large tables, time-series data, filter-heavy queries
Back to Module 7