Skip to content

Partition Pruning

Optimizing Query Performance with Partitioning


Overview

Partition pruning skips entire partitions based on query filters, dramatically reducing I/O and query time. It’s the most effective optimization for large, partitioned tables.


Partition Pruning Architecture

Pruning Process

Pruning Benefits:

  • I/O reduction: 90-99% less data scanned
  • Query speed: 10-100x faster queries
  • Cost savings: Proportional to data scanned reduction
  • Metadata filtering: Instant partition elimination

Partitioning Strategies

Strategy Selection

Time-Series Partitioning

-- Time-series partitioning best practices
-- Daily partitions (high volume)
CREATE TABLE events_daily (
event_id BIGINT,
event_time TIMESTAMP,
user_id BIGINT,
event_data STRING
) PARTITIONED BY (DATE(event_time));
-- Monthly partitions (medium volume)
CREATE TABLE transactions_monthly (
transaction_id BIGINT,
transaction_time TIMESTAMP,
customer_id BIGINT,
amount DECIMAL(10,2)
) PARTITIONED BY (to_date(substr(cast(transaction_time AS STRING), 1, 7)));
-- Yearly partitions (low volume)
CREATE TABLE archive_yearly (
record_id BIGINT,
created_at TIMESTAMP,
data STRING
) PARTITIONED BY (YEAR(created_at));
-- Query with partition pruning
SELECT *
FROM events_daily
WHERE DATE(event_time) = '2025-01-27';
-- Scans: 1/365 partitions (0.3%)

Geographic Partitioning

-- Geographic partitioning
CREATE TABLE sales_by_region (
order_id BIGINT,
order_date DATE,
country_code STRING,
region STRING,
amount DECIMAL(10,2)
) PARTITIONED BY (country_code);
-- Query with partition pruning
SELECT *
FROM sales_by_region
WHERE country_code = 'US';
-- Scans: 1/200 partitions (0.5%)

Categorical Partitioning

-- Categorical partitioning (selective)
CREATE TABLE products_by_category (
product_id BIGINT,
category STRING,
subcategory STRING,
price DECIMAL(10,2)
) PARTITIONED BY (category);
-- Only partition top categories (avoid too many partitions)
-- Electronics, Clothing, Home, Sports, Books (5 partitions)
-- Query with partition pruning
SELECT *
FROM products_by_category
WHERE category = 'Electronics';
-- Scans: 1/5 partitions (20%)

Partition Pruning Implementation

Spark Partitioning

# Partition pruning with Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder \
.appName("PartitionPruning") \
.getOrCreate()
# Write partitioned data
df = spark.read.json("s3://bucket/events/*.json")
# Partition by date
df.withColumn("event_date", to_date(col("event_time"))) \
.write \
.mode("overwrite") \
.partitionBy("event_date") \
.parquet("s3://bucket/output/events/")
# Query with partition pruning
result = spark.sql("""
SELECT *
FROM parquet.`s3://bucket/output/events/`
WHERE event_date = '2025-01-27'
""")
# Spark automatically prunes partitions
# Only scans partition event_date=2025-01-27

Partition Discovery

# Partition discovery configuration
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
# Enable partition discovery
spark.read \
.format("parquet") \
.option("basePath", "s3://bucket/output/events/") \
.load("s3://bucket/output/events/event_date=*/") \
.createOrReplaceTempView("events")
# Query with pruning
result = spark.sql("""
SELECT *
FROM events
WHERE event_date = '2025-01-27'
""")

Partition Pruning Patterns

Single-Column Pruning

-- Single column pruning (most effective)
CREATE TABLE sales (
order_id BIGINT,
order_date DATE,
customer_id BIGINT,
amount DECIMAL(10,2)
) PARTITIONED BY (order_date);
-- Query with single filter
SELECT *
FROM sales
WHERE order_date = '2025-01-27';
-- Pruning: Scans 1 partition (100% effective)

Multi-Column Pruning

-- Multi-column pruning
CREATE TABLE sales_multi (
order_id BIGINT,
order_date DATE,
region STRING,
customer_id BIGINT,
amount DECIMAL(10,2)
) PARTITIONED BY (order_date, region);
-- Query with multiple filters
SELECT *
FROM sales_multi
WHERE order_date = '2025-01-27'
AND region = 'US';
-- Pruning: Scans 1 partition (highly effective)

Range Pruning

-- Range partition pruning
CREATE TABLE events (
event_id BIGINT,
event_time TIMESTAMP,
user_id BIGINT,
event_data STRING
) PARTITIONED BY (DATE(event_time));
-- Range query with pruning
SELECT *
FROM events
WHERE DATE(event_time) BETWEEN '2025-01-01' AND '2025-01-31';
-- Pruning: Scans 31 partitions (31/365 = 8.5%)
-- Still 91.5% reduction

Partition Pruning Best Practices

Partition Column Selection

CriteriaGoodBad
CardinalityLow-medium (< 1000)High (> 10000)
Filter usageFrequently filteredRarely filtered
Data distributionEvenSkewed
Time-basedDate/timeRandom timestamp
SizeBalanced partitionsVariable sizes

Anti-Patterns

-- Don't partition by high-cardinality columns
-- BAD: Partition by user_id (millions of values)
CREATE TABLE events_bad (
event_id BIGINT,
user_id BIGINT,
event_time TIMESTAMP
) PARTITIONED BY (user_id);
-- Creates millions of tiny partitions
-- Breaks NameNode/metastore
-- GOOD: No partitioning or bucketing
CREATE TABLE events_good (
event_id BIGINT,
user_id BIGINT,
event_time TIMESTAMP
) CLUSTERED BY (user_id) INTO 32 BUCKETS;

Partition Pruning Monitoring

Metrics

# Monitor partition pruning effectiveness
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PruningMetrics") \
.getOrCreate()
def get_pruning_metrics(table: str, filter_sql: str) -> dict:
"""Get partition pruning metrics"""
# Run explain
explain_sql = f"EXPLAIN SELECT * FROM {table} WHERE {filter_sql}"
explain_df = spark.sql(explain_sql)
explain_output = explain_df.collect()[0][0]
# Parse metrics
metrics = {
'total_partitions': parse_metric(explain_output, "partitions"),
'scanned_partitions': parse_metric(explain_output, "scanned"),
'skipped_partitions': parse_metric(explain_output, "skipped"),
'pruning_ratio': 0.0
}
# Calculate pruning ratio
if metrics['total_partitions'] > 0:
metrics['pruning_ratio'] = metrics['skipped_partitions'] / metrics['total_partitions']
return metrics
# Example usage
metrics = get_pruning_metrics(
"events",
"event_date = '2025-01-27'"
)
print(f"Pruning ratio: {metrics['pruning_ratio']:.2%}")
print(f"Partitions scanned: {metrics['scanned_partitions']}/{metrics['total_partitions']}")

Partition Pruning Best Practices

DO

-- 1. Partition by date/time for time-series
-- Most effective pattern
-- 2. Partition by low-cardinality columns
-- < 1000 distinct values
-- 3. Use frequently filtered columns
-- Enables pruning in most queries
-- 4. Balance partition sizes
-- Aim for 256MB-1GB per partition
-- 5. Monitor pruning effectiveness
-- Track pruning ratio

DON’T

-- 1. Don't partition by high-cardinality columns
-- Creates too many partitions
-- 2. Don't partition by rarely filtered columns
-- No pruning benefit
-- 3. Don't create too many partitions
-- Metadata explosion
-- 4. Don't partition unevenly
-- Creates skewed partitions
-- 5. Don't ignore partition size
-- Too small = overhead, too large = no benefit

Key Takeaways

  1. Partition pruning: 90-99% I/O reduction
  2. Time-series: Daily/Monthly partitions best
  3. Cardinality: Keep < 1000 partitions
  4. Filters: Partition by frequently filtered columns
  5. Balance: Aim for 256MB-1GB per partition
  6. Monitor: Track pruning ratio, aim for > 90%
  7. Anti-patterns: High cardinality, rare filters
  8. Use When: All large tables, time-series data, filter-heavy queries

Back to Module 7