Skip to content

Data Skipping

Predicate Pushdown for Performance


Overview

Data skipping allows query engines to skip reading irrelevant data files or partitions based on statistics, bloom filters, and indexes. It’s essential for query performance on large datasets.


Data Skipping Architecture

Skipping Levels

Skipping Impact:

  • Partition pruning: Skip entire partitions (90-99% reduction)
  • File skipping: Skip entire files (50-90% reduction)
  • Row group skipping: Skip row groups (20-50% reduction)
  • Bloom filters: Skip exact values (30-70% reduction)

Partition Pruning

Static Partitioning

-- Partition pruning example
-- Table partitioned by date
CREATE TABLE sales (
order_id BIGINT,
customer_id BIGINT,
order_date DATE,
amount DECIMAL(10,2)
) PARTITIONED BY (order_date);
-- Query with partition pruning
-- Only scans partitions for Jan 2025
SELECT *
FROM sales
WHERE order_date BETWEEN '2025-01-01' AND '2025-01-31';
-- Explain plan shows partition pruning
EXPLAIN
SELECT * FROM sales
WHERE order_date = '2025-01-15';
-- Output: "Partition filters: order_date = 2025-01-15"
-- Scanned: 1/365 partitions (0.3%)

Dynamic Partitioning

# Dynamic partitioning with Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder \
.appName("DynamicPartitioning") \
.enableHiveSupport() \
.getOrCreate()
# Write with dynamic partitioning
df = spark.read.json("s3://bucket/sales/*.json")
# Partition by date and country
df.write \
.mode("overwrite") \
.partitionBy("order_date", "country") \
.saveAsTable("sales_partitioned")
# Query with partition pruning
result = spark.sql("""
SELECT *
FROM sales_partitioned
WHERE order_date = '2025-01-15'
AND country = 'US'
""")
# Only scans 1 partition

File Skipping

Min/Max Statistics

# File-level statistics with Parquet
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("FileSkipping") \
.getOrCreate()
# Write with statistics
df = spark.read.json("s3://bucket/data/*.json")
df.write \
.mode("overwrite") \
.parquet("s3://bucket/output/")
# Query with filter
result = spark.sql("""
SELECT *
FROM parquet.`s3://bucket/output/`
WHERE event_date >= '2025-01-01'
AND amount > 100
""")
# Parquet min/max statistics skip files:
# Files where max(event_date) < '2025-01-01'
# Files where min(amount) <= 100

Column Statistics

-- Collect statistics for file skipping
-- Analyze table (PostgreSQL, Redshift, BigQuery)
ANALYZE sales;
-- Collect statistics (Snowflake)
ALTER TABLE sales SET STATISTICS_ESTIMATE(AUTO);
-- Compute statistics (Spark)
ANALYZE TABLE sales COMPUTE STATISTICS FOR COLUMNS order_date, amount;
-- Now queries can skip files based on statistics
SELECT * FROM sales
WHERE order_date >= '2025-01-01'
AND amount > 100;

Bloom Filters

Bloom Filter Basics

Bloom Filter Properties:

  • Space efficient: 10-100x less than original data
  • Fast lookup: O(1) hash operations
  • False positives: Possible (need verification)
  • False negatives: Impossible (100% recall)

Bloom Filter Implementation

# Bloom filters with Parquet
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder \
.appName("BloomFilters") \
.config("spark.sql.parquet.filterPushdown", "true") \
.config("spark.sql.parquet.enableBloomFilter", "true") \
.getOrCreate()
# Write with bloom filter
df = spark.read.json("s3://bucket/data/*.json")
df.write \
.mode("overwrite") \
.option("parquet.bloom.filter.enabled", "true") \
.option("parquet.bloom.filter.columns", "customer_id,order_id") \
.parquet("s3://bucket/output/")
# Query with bloom filter
result = spark.sql("""
SELECT *
FROM parquet.`s3://bucket/output/`
WHERE customer_id = 12345
""")
# Bloom filter skips files where customer_id cannot exist

Bloom Filter Configuration

# Bloom filter parameters
# fpp: False positive probability (0.01 = 1%)
# ndv: Number of distinct values
df.write \
.mode("overwrite") \
.option("parquet.bloom.filter.enabled", "true") \
.option("parquet.bloom.filter.customer_id.fpp", "0.01") \
.option("parquet.bloom.filter.customer_id.ndv", "1000000") \
.parquet("s3://bucket/output/")

Data Skipping Strategies

Strategy Selection

Best Practices

PatternStrategySkipping Ratio
Time-seriesDate partition + min/max90-99%
High cardinalityBloom filter30-70%
Range queriesZ-Order + min/max50-80%
Point lookupsBloom filter + file stats40-90%
Multi-columnComposite bloom filters20-60%

Data Skipping Tools

Tool Support

ToolPartition PruningFile SkippingBloom FiltersZ-Order
Spark✅ Yes✅ Yes✅ Yes✅ Yes
Presto/Trino✅ Yes✅ Yes✅ Yes⚠️ Limited
BigQuery✅ Yes✅ Yes❌ No❌ No
Snowflake✅ Yes✅ Yes✅ Yes❌ No
Redshift✅ Yes✅ Yes⚠️ Zone maps❌ No

Data Skipping Monitoring

Metrics

# Monitor data skipping effectiveness
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("SkippingMetrics") \
.getOrCreate()
def get_skipping_metrics(query: str) -> dict:
"""Get data skipping metrics"""
# Run explain
explain_df = spark.sql(f"EXPLAIN {query}")
explain_output = explain_df.collect()[0][0]
# Parse metrics
metrics = {
'scanned_files': parse_metric(explain_output, "scanned files"),
'skipped_files': parse_metric(explain_output, "skipped files"),
'scanned_row_groups': parse_metric(explain_output, "scanned row groups"),
'skipped_row_groups': parse_metric(explain_output, "skipped row groups"),
'data_read_mb': parse_metric(explain_output, "data read"),
'skipping_ratio': 0.0
}
# Calculate skipping ratio
total_files = metrics['scanned_files'] + metrics['skipped_files']
if total_files > 0:
metrics['skipping_ratio'] = metrics['skipped_files'] / total_files
return metrics
# Example usage
metrics = get_skipping_metrics("""
SELECT *
FROM sales
WHERE order_date >= '2025-01-01'
""")
print(f"Skipping ratio: {metrics['skipping_ratio']:.2%}")
print(f"Files scanned: {metrics['scanned_files']}")
print(f"Files skipped: {metrics['skipped_files']}")

Data Skipping Best Practices

DO

# 1. Partition by filter columns
# Enables partition pruning
# 2. Use bloom filters for point lookups
# Fast exact match skipping
# 3. Collect column statistics
# Enables file-level skipping
# 4. Monitor skipping effectiveness
# Track skipping ratio
# 5. Combine with Z-Ordering
# Multi-dimensional skipping

DON’T

# 1. Don't partition by high-cardinality columns
# Creates too many small partitions
# 2. Don't skip statistics collection
# Essential for file skipping
# 3. Don't ignore bloom filter size
# Larger filters = more memory
# 4. Don't use bloom filters for range queries
# Not effective for ranges
# 5. Don't forget column statistics
# Needed for min/max skipping

Key Takeaways

  1. Partition pruning: Skip entire partitions (90-99% reduction)
  2. File skipping: Min/max statistics at file level (50-90% reduction)
  3. Bloom filters: Fast exact match lookup (30-70% reduction)
  4. Z-Ordering: Multi-dimensional clustering (50-80% reduction)
  5. Monitoring: Track skipping ratio, aim for > 80%
  6. Combine strategies: Maximize skipping with multiple techniques
  7. Tool support: Spark, Presto, BigQuery, Snowflake
  8. Use When: All query workloads, large datasets, filter-heavy queries

Back to Module 7