Data Skipping
Predicate Pushdown for Performance
Overview
Data skipping allows query engines to skip reading irrelevant data files or partitions based on statistics, bloom filters, and indexes. It’s essential for query performance on large datasets.
Data Skipping Architecture
Skipping Levels
Skipping Impact:
- Partition pruning: Skip entire partitions (90-99% reduction)
- File skipping: Skip entire files (50-90% reduction)
- Row group skipping: Skip row groups (20-50% reduction)
- Bloom filters: Skip exact values (30-70% reduction)
Partition Pruning
Static Partitioning
-- Partition pruning example
-- Table partitioned by dateCREATE TABLE sales ( order_id BIGINT, customer_id BIGINT, order_date DATE, amount DECIMAL(10,2)) PARTITIONED BY (order_date);
-- Query with partition pruning-- Only scans partitions for Jan 2025SELECT *FROM salesWHERE order_date BETWEEN '2025-01-01' AND '2025-01-31';
-- Explain plan shows partition pruningEXPLAINSELECT * FROM salesWHERE order_date = '2025-01-15';
-- Output: "Partition filters: order_date = 2025-01-15"-- Scanned: 1/365 partitions (0.3%)Dynamic Partitioning
# Dynamic partitioning with Spark
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import *
spark = SparkSession.builder \ .appName("DynamicPartitioning") \ .enableHiveSupport() \ .getOrCreate()
# Write with dynamic partitioningdf = spark.read.json("s3://bucket/sales/*.json")
# Partition by date and countrydf.write \ .mode("overwrite") \ .partitionBy("order_date", "country") \ .saveAsTable("sales_partitioned")
# Query with partition pruningresult = spark.sql(""" SELECT * FROM sales_partitioned WHERE order_date = '2025-01-15' AND country = 'US'""")
# Only scans 1 partitionFile Skipping
Min/Max Statistics
# File-level statistics with Parquet
from pyspark.sql import SparkSession
spark = SparkSession.builder \ .appName("FileSkipping") \ .getOrCreate()
# Write with statisticsdf = spark.read.json("s3://bucket/data/*.json")
df.write \ .mode("overwrite") \ .parquet("s3://bucket/output/")
# Query with filterresult = spark.sql(""" SELECT * FROM parquet.`s3://bucket/output/` WHERE event_date >= '2025-01-01' AND amount > 100""")
# Parquet min/max statistics skip files:# Files where max(event_date) < '2025-01-01'# Files where min(amount) <= 100Column Statistics
-- Collect statistics for file skipping
-- Analyze table (PostgreSQL, Redshift, BigQuery)ANALYZE sales;
-- Collect statistics (Snowflake)ALTER TABLE sales SET STATISTICS_ESTIMATE(AUTO);
-- Compute statistics (Spark)ANALYZE TABLE sales COMPUTE STATISTICS FOR COLUMNS order_date, amount;
-- Now queries can skip files based on statisticsSELECT * FROM salesWHERE order_date >= '2025-01-01' AND amount > 100;Bloom Filters
Bloom Filter Basics
Bloom Filter Properties:
- Space efficient: 10-100x less than original data
- Fast lookup: O(1) hash operations
- False positives: Possible (need verification)
- False negatives: Impossible (100% recall)
Bloom Filter Implementation
# Bloom filters with Parquet
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import *
spark = SparkSession.builder \ .appName("BloomFilters") \ .config("spark.sql.parquet.filterPushdown", "true") \ .config("spark.sql.parquet.enableBloomFilter", "true") \ .getOrCreate()
# Write with bloom filterdf = spark.read.json("s3://bucket/data/*.json")
df.write \ .mode("overwrite") \ .option("parquet.bloom.filter.enabled", "true") \ .option("parquet.bloom.filter.columns", "customer_id,order_id") \ .parquet("s3://bucket/output/")
# Query with bloom filterresult = spark.sql(""" SELECT * FROM parquet.`s3://bucket/output/` WHERE customer_id = 12345""")
# Bloom filter skips files where customer_id cannot existBloom Filter Configuration
# Bloom filter parameters
# fpp: False positive probability (0.01 = 1%)# ndv: Number of distinct values
df.write \ .mode("overwrite") \ .option("parquet.bloom.filter.enabled", "true") \ .option("parquet.bloom.filter.customer_id.fpp", "0.01") \ .option("parquet.bloom.filter.customer_id.ndv", "1000000") \ .parquet("s3://bucket/output/")Data Skipping Strategies
Strategy Selection
Best Practices
| Pattern | Strategy | Skipping Ratio |
|---|---|---|
| Time-series | Date partition + min/max | 90-99% |
| High cardinality | Bloom filter | 30-70% |
| Range queries | Z-Order + min/max | 50-80% |
| Point lookups | Bloom filter + file stats | 40-90% |
| Multi-column | Composite bloom filters | 20-60% |
Data Skipping Tools
Tool Support
| Tool | Partition Pruning | File Skipping | Bloom Filters | Z-Order |
|---|---|---|---|---|
| Spark | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
| Presto/Trino | ✅ Yes | ✅ Yes | ✅ Yes | ⚠️ Limited |
| BigQuery | ✅ Yes | ✅ Yes | ❌ No | ❌ No |
| Snowflake | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No |
| Redshift | ✅ Yes | ✅ Yes | ⚠️ Zone maps | ❌ No |
Data Skipping Monitoring
Metrics
# Monitor data skipping effectiveness
from pyspark.sql import SparkSession
spark = SparkSession.builder \ .appName("SkippingMetrics") \ .getOrCreate()
def get_skipping_metrics(query: str) -> dict: """Get data skipping metrics"""
# Run explain explain_df = spark.sql(f"EXPLAIN {query}") explain_output = explain_df.collect()[0][0]
# Parse metrics metrics = { 'scanned_files': parse_metric(explain_output, "scanned files"), 'skipped_files': parse_metric(explain_output, "skipped files"), 'scanned_row_groups': parse_metric(explain_output, "scanned row groups"), 'skipped_row_groups': parse_metric(explain_output, "skipped row groups"), 'data_read_mb': parse_metric(explain_output, "data read"), 'skipping_ratio': 0.0 }
# Calculate skipping ratio total_files = metrics['scanned_files'] + metrics['skipped_files'] if total_files > 0: metrics['skipping_ratio'] = metrics['skipped_files'] / total_files
return metrics
# Example usagemetrics = get_skipping_metrics(""" SELECT * FROM sales WHERE order_date >= '2025-01-01'""")
print(f"Skipping ratio: {metrics['skipping_ratio']:.2%}")print(f"Files scanned: {metrics['scanned_files']}")print(f"Files skipped: {metrics['skipped_files']}")Data Skipping Best Practices
DO
# 1. Partition by filter columns# Enables partition pruning
# 2. Use bloom filters for point lookups# Fast exact match skipping
# 3. Collect column statistics# Enables file-level skipping
# 4. Monitor skipping effectiveness# Track skipping ratio
# 5. Combine with Z-Ordering# Multi-dimensional skippingDON’T
# 1. Don't partition by high-cardinality columns# Creates too many small partitions
# 2. Don't skip statistics collection# Essential for file skipping
# 3. Don't ignore bloom filter size# Larger filters = more memory
# 4. Don't use bloom filters for range queries# Not effective for ranges
# 5. Don't forget column statistics# Needed for min/max skippingKey Takeaways
- Partition pruning: Skip entire partitions (90-99% reduction)
- File skipping: Min/max statistics at file level (50-90% reduction)
- Bloom filters: Fast exact match lookup (30-70% reduction)
- Z-Ordering: Multi-dimensional clustering (50-80% reduction)
- Monitoring: Track skipping ratio, aim for > 80%
- Combine strategies: Maximize skipping with multiple techniques
- Tool support: Spark, Presto, BigQuery, Snowflake
- Use When: All query workloads, large datasets, filter-heavy queries
Back to Module 7