Skip to content

The Small Files Problem

The Most Expensive Performance Issue in Data Platforms


Overview

The small files problem is one of the most impactful and expensive issues in data platforms at scale. Having thousands or millions of small files can increase query costs by 10-100x, degrade performance, and overwhelm metadata systems. This document covers detection, mitigation, and prevention strategies.


The Problem Explained

What Are Small Files?

Small file: File significantly smaller than optimal size (typically < 128MB for Parquet).

Why it matters:

  • Parquet/ORC are columnar formats designed for larger files
  • Metadata overhead per file (NameNode, Hive Metastore)
  • Query planning overhead (list and analyze many files)
  • Inefficient I/O (many small reads vs. fewer large reads)

Impact Illustration

Result: 100x query performance degradation.


Root Causes

Cause 1: Streaming Micro-Batches

# Bad: Spark Structured Streaming default
df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/checkpoint")
.start("/path/to/delta")

Problem: Each micro-batch creates small files (10-100MB per file).

Cause 2: High Partition Cardinality

-- Bad: Partition by high-cardinality columns
PARTITIONED BY (customer_id) -- 10M customers = 10M partitions!
-- Better: Partition by date
PARTITIONED BY (order_date)

Problem: Each partition directory contains few files, creating millions of small files.

Cause 3: Over-Partitioning

-- Bad: Too many partition keys
PARTITIONED BY (date, hour, country, product_category)
-- Result: 365 × 24 × 50 × 10 = 4,380,000 partitions!

Cause 4: Frequent Writes

# Bad: Writing every few minutes
for batch in batches:
df.write.format("delta").mode("append").save(path)

Problem: Each write creates small files.


Detection

Metrics to Monitor

# Analyze file size distribution
def analyze_file_distribution(table_path):
files = list_files(table_path)
sizes = [file.size for file in files]
return {
"file_count": len(files),
"min_size": min(sizes),
"max_size": max(sizes),
"avg_size": sum(sizes) / len(sizes),
"p50_size": np.percentile(sizes, 50),
"p95_size": np.percentile(sizes, 95),
"files_below_128MB": sum(1 for s in sizes if s < 128 * 1024 * 1024)
}
# Alert if > 50% of files are small
if result["files_below_128MB"] / result["file_count"] > 0.5:
alert("Small files problem detected!")

Automated Detection

# AWS S3 example
import boto3
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
small_files = []
for page in paginator.paginate(Bucket='my-bucket', Prefix='table/'):
for obj in page.get('Contents', []):
if obj['Size'] < 128 * 1024 * 1024: # < 128MB
small_files.append(obj['Key'])
if len(small_files) > 1000:
print(f"WARNING: {len(small_files)} small files detected")

Solutions

Solution 1: File Size Configuration

Spark Structured Streaming:

# Configure optimal file size
df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/checkpoint")
.option("maxFilesPerTrigger", 1000)
.option("maxRecordsPerFile", 10000000) # 10M rows
.start("/path/to/delta")

Spark Batch:

# Control file size
df.coalesce(10) \
.write \
.format("delta") \
.mode("overwrite") \
.save("/path/to/delta")
# Or use repartition
df.repartition(10) \
.write \
.format("delta") \
.mode("overwrite") \
.save("/path/to/delta")

Solution 2: Continuous Compaction

Delta Lake OPTIMIZE:

-- Optimize table (merge small files)
OPTIMIZE table_name;
-- Optimize specific partition
OPTIMIZE table_name WHERE date = '2025-01-27';
-- Optimize with Z-Ordering
OPTIMIZE table_name
WHERE date >= '2025-01-01'
ZORDER BY (customer_id, event_type);

Automated Compaction:

# Schedule regular compaction
from airflow import DAG
from airflow.operators.python import PythonOperator
def compact_table(table_name):
spark.sql(f"OPTIMIZE {table_name}")
with DAG('compaction_dag', schedule_interval='0 2 * * *') as dag:
compact_orders = PythonOperator(
task_id='compact_orders',
python_callable=compact_table,
op_kwargs={'table_name': 'orders'}
)

Solution 3: Adaptive Execution

# Spark adaptive query execution
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128MB")
# Spark automatically coalesces small partitions

Solution 4: Partitioning Strategy

-- Bad: High-cardinality partitioning
CREATE TABLE bad_partitioning (
...
) PARTITIONED BY (customer_id, event_date);
-- Good: Date-based partitioning
CREATE TABLE good_partitioning (
...
) PARTITIONED BY (event_date);
-- Use Z-Ordering for other dimensions
OPTIMIZE good_partitioning
ZORDER BY (customer_id, event_type);

Cost Impact Analysis

Storage vs. Compute Trade-off

Real-World Cost Example

Scenario: 100TB table, 10,000 queries/day

StrategyFile CountAvg File SizeQuery CostMonthly Cost
Small files1,000,000100MB$10.00$3,000,000
Optimal1,000100GB$0.10$30,000
Savings$2,970,000/month

Result: $35M+ annually saved by fixing small files problem.


Compaction Strategies

Strategy 1: Time-Based Compaction

# Compact every N hours
from datetime import datetime, timedelta
def compact_recent_partitions(table_name, hours=4):
"""
Compact partitions from last N hours.
"""
cutoff_time = datetime.now() - timedelta(hours=hours)
spark.sql(f"""
OPTIMIZE {table_name}
WHERE event_time >= '{cutoff_time}'
""")

Strategy 2: Size-Based Compaction

# Compact partitions with many small files
def compact_partition(table_name, partition_value):
"""
Check partition file count and compact if needed.
"""
file_count = spark.sql(f"""
SELECT COUNT(*) as file_count
FROM {table_name}
WHERE {partition_key} = '{partition_value}'
""").collect()[0]['file_count']
if file_count > 100:
spark.sql(f"""
OPTIMIZE {table_name}
WHERE {partition_key} = '{partition_value}'
""")

Strategy 3: Full Table Compaction

# Compact entire table (run during low-traffic hours)
def compact_table(table_name):
"""
Compact entire table with Z-Ordering.
"""
spark.sql(f"""
OPTIMIZE {table_name}
ZORDER BY (date, customer_id, event_type)
""")

Prevention Strategies

Prevention 1: Configure File Size at Write

# Set target file size
df.write \
.format("delta") \
.mode("overwrite") \
.option("maxRecordsPerFile", 10000000) \
.option("targetFileSize", "256MB") \
.save(path)

Prevention 2: Batch Streaming Writes

# Write micro-batches in larger batches
df.writeStream
.foreachBatch(lambda batch_df, batch_id: batch_df.write
.format("delta")
.mode("append")
.option("maxRecordsPerFile", 10000000)
.save(path))
.trigger(processingTime='5 minutes') # Less frequent
.start()

Prevention 3: Partition Properly

-- Bad: Too many partitions
PARTITIONED BY (date, hour, country)
-- Good: Single partition dimension
PARTITIONED BY (date)
-- Then use Z-Ordering for other dimensions
OPTIMIZE table ZORDER BY (country, customer_type)

Senior Level Considerations

The “Death Spiral”

Breaking the cycle:

  1. Stop adding partitions
  2. Compact existing data
  3. Implement file size monitoring
  4. Educate team on proper partitioning

Monitoring Dashboard

# Metrics to track
small_files_dashboard:
metrics:
- name: "Files below 128MB"
query: "SELECT COUNT(*) FROM file_stats WHERE size < 128MB"
alert: "If > 1000"
- name: "Partition file count"
query: "SELECT partition, COUNT(*) as file_count FROM file_stats GROUP BY partition"
alert: "If any partition > 100 files"
- name: "Average file size"
query: "SELECT AVG(size) as avg_size FROM file_stats"
alert: "If < 100MB"

Best Practices

DO

# 1. Monitor file size distribution
analyze_file_distribution(table_path)
# 2. Configure optimal file size
.option("maxRecordsPerFile", 10000000)
# 3. Run regular compaction
OPTIMIZE table_name WHERE date >= '2025-01-01'
# 4. Use Z-Ordering instead of over-partitioning
OPTIMIZE table_name ZORDER BY (dimension1, dimension2)
# 5. Alert on small file creation
if new_file_size < 128MB:
alert(f"Small file created: {file_path}")

DON’T

# 1. Don't ignore file size
# Always monitor and optimize
# 2. Don't partition by high-cardinality columns
# Avoid: PARTITIONED BY (customer_id)
# 3. Don't write every few minutes
# Batch writes to create larger files
# 4. Don't use default streaming settings
# Configure maxRecordsPerFile, maxFilesPerTrigger
# 5. Don't forget compaction
# Schedule regular OPTIMIZE jobs

Key Takeaways

  1. Problem: Small files = 10-100x query cost increase
  2. Detection: Monitor file size distribution, alert if > 50% small
  3. Solution 1: Configure file size at write (maxRecordsPerFile)
  4. Solution 2: Continuous compaction (OPTIMIZE)
  5. Solution 3: Proper partitioning (date, not high-cardinality)
  6. Prevention: Configure streaming, batch writes appropriately
  7. Cost impact: Can save $10K-100K/month at scale
  8. Automation: Monitor, compact, prevent continuously

Next: Compaction Strategies