The Small Files Problem
The Most Expensive Performance Issue in Data Platforms
Overview
The small files problem is one of the most impactful and expensive issues in data platforms at scale. Having thousands or millions of small files can increase query costs by 10-100x, degrade performance, and overwhelm metadata systems. This document covers detection, mitigation, and prevention strategies.
The Problem Explained
What Are Small Files?
Small file: File significantly smaller than optimal size (typically < 128MB for Parquet).
Why it matters:
- Parquet/ORC are columnar formats designed for larger files
- Metadata overhead per file (NameNode, Hive Metastore)
- Query planning overhead (list and analyze many files)
- Inefficient I/O (many small reads vs. fewer large reads)
Impact Illustration
Result: 100x query performance degradation.
Root Causes
Cause 1: Streaming Micro-Batches
# Bad: Spark Structured Streaming defaultdf.writeStream .format("delta") .outputMode("append") .option("checkpointLocation", "/checkpoint") .start("/path/to/delta")Problem: Each micro-batch creates small files (10-100MB per file).
Cause 2: High Partition Cardinality
-- Bad: Partition by high-cardinality columnsPARTITIONED BY (customer_id) -- 10M customers = 10M partitions!
-- Better: Partition by datePARTITIONED BY (order_date)Problem: Each partition directory contains few files, creating millions of small files.
Cause 3: Over-Partitioning
-- Bad: Too many partition keysPARTITIONED BY (date, hour, country, product_category)
-- Result: 365 × 24 × 50 × 10 = 4,380,000 partitions!Cause 4: Frequent Writes
# Bad: Writing every few minutesfor batch in batches: df.write.format("delta").mode("append").save(path)Problem: Each write creates small files.
Detection
Metrics to Monitor
# Analyze file size distributiondef analyze_file_distribution(table_path): files = list_files(table_path) sizes = [file.size for file in files]
return { "file_count": len(files), "min_size": min(sizes), "max_size": max(sizes), "avg_size": sum(sizes) / len(sizes), "p50_size": np.percentile(sizes, 50), "p95_size": np.percentile(sizes, 95), "files_below_128MB": sum(1 for s in sizes if s < 128 * 1024 * 1024) }
# Alert if > 50% of files are smallif result["files_below_128MB"] / result["file_count"] > 0.5: alert("Small files problem detected!")Automated Detection
# AWS S3 exampleimport boto3
s3 = boto3.client('s3')paginator = s3.get_paginator('list_objects_v2')
small_files = []for page in paginator.paginate(Bucket='my-bucket', Prefix='table/'): for obj in page.get('Contents', []): if obj['Size'] < 128 * 1024 * 1024: # < 128MB small_files.append(obj['Key'])
if len(small_files) > 1000: print(f"WARNING: {len(small_files)} small files detected")Solutions
Solution 1: File Size Configuration
Spark Structured Streaming:
# Configure optimal file sizedf.writeStream .format("delta") .outputMode("append") .option("checkpointLocation", "/checkpoint") .option("maxFilesPerTrigger", 1000) .option("maxRecordsPerFile", 10000000) # 10M rows .start("/path/to/delta")Spark Batch:
# Control file sizedf.coalesce(10) \ .write \ .format("delta") \ .mode("overwrite") \ .save("/path/to/delta")
# Or use repartitiondf.repartition(10) \ .write \ .format("delta") \ .mode("overwrite") \ .save("/path/to/delta")Solution 2: Continuous Compaction
Delta Lake OPTIMIZE:
-- Optimize table (merge small files)OPTIMIZE table_name;
-- Optimize specific partitionOPTIMIZE table_name WHERE date = '2025-01-27';
-- Optimize with Z-OrderingOPTIMIZE table_nameWHERE date >= '2025-01-01'ZORDER BY (customer_id, event_type);Automated Compaction:
# Schedule regular compactionfrom airflow import DAGfrom airflow.operators.python import PythonOperator
def compact_table(table_name): spark.sql(f"OPTIMIZE {table_name}")
with DAG('compaction_dag', schedule_interval='0 2 * * *') as dag: compact_orders = PythonOperator( task_id='compact_orders', python_callable=compact_table, op_kwargs={'table_name': 'orders'} )Solution 3: Adaptive Execution
# Spark adaptive query executionspark.conf.set("spark.sql.adaptive.enabled", "true")spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128MB")
# Spark automatically coalesces small partitionsSolution 4: Partitioning Strategy
-- Bad: High-cardinality partitioningCREATE TABLE bad_partitioning ( ...) PARTITIONED BY (customer_id, event_date);
-- Good: Date-based partitioningCREATE TABLE good_partitioning ( ...) PARTITIONED BY (event_date);
-- Use Z-Ordering for other dimensionsOPTIMIZE good_partitioningZORDER BY (customer_id, event_type);Cost Impact Analysis
Storage vs. Compute Trade-off
Real-World Cost Example
Scenario: 100TB table, 10,000 queries/day
| Strategy | File Count | Avg File Size | Query Cost | Monthly Cost |
|---|---|---|---|---|
| Small files | 1,000,000 | 100MB | $10.00 | $3,000,000 |
| Optimal | 1,000 | 100GB | $0.10 | $30,000 |
| Savings | $2,970,000/month |
Result: $35M+ annually saved by fixing small files problem.
Compaction Strategies
Strategy 1: Time-Based Compaction
# Compact every N hoursfrom datetime import datetime, timedelta
def compact_recent_partitions(table_name, hours=4): """ Compact partitions from last N hours. """ cutoff_time = datetime.now() - timedelta(hours=hours)
spark.sql(f""" OPTIMIZE {table_name} WHERE event_time >= '{cutoff_time}' """)Strategy 2: Size-Based Compaction
# Compact partitions with many small filesdef compact_partition(table_name, partition_value): """ Check partition file count and compact if needed. """ file_count = spark.sql(f""" SELECT COUNT(*) as file_count FROM {table_name} WHERE {partition_key} = '{partition_value}' """).collect()[0]['file_count']
if file_count > 100: spark.sql(f""" OPTIMIZE {table_name} WHERE {partition_key} = '{partition_value}' """)Strategy 3: Full Table Compaction
# Compact entire table (run during low-traffic hours)def compact_table(table_name): """ Compact entire table with Z-Ordering. """ spark.sql(f""" OPTIMIZE {table_name} ZORDER BY (date, customer_id, event_type) """)Prevention Strategies
Prevention 1: Configure File Size at Write
# Set target file sizedf.write \ .format("delta") \ .mode("overwrite") \ .option("maxRecordsPerFile", 10000000) \ .option("targetFileSize", "256MB") \ .save(path)Prevention 2: Batch Streaming Writes
# Write micro-batches in larger batchesdf.writeStream .foreachBatch(lambda batch_df, batch_id: batch_df.write .format("delta") .mode("append") .option("maxRecordsPerFile", 10000000) .save(path)) .trigger(processingTime='5 minutes') # Less frequent .start()Prevention 3: Partition Properly
-- Bad: Too many partitionsPARTITIONED BY (date, hour, country)
-- Good: Single partition dimensionPARTITIONED BY (date)-- Then use Z-Ordering for other dimensionsOPTIMIZE table ZORDER BY (country, customer_type)Senior Level Considerations
The “Death Spiral”
Breaking the cycle:
- Stop adding partitions
- Compact existing data
- Implement file size monitoring
- Educate team on proper partitioning
Monitoring Dashboard
# Metrics to tracksmall_files_dashboard: metrics: - name: "Files below 128MB" query: "SELECT COUNT(*) FROM file_stats WHERE size < 128MB" alert: "If > 1000"
- name: "Partition file count" query: "SELECT partition, COUNT(*) as file_count FROM file_stats GROUP BY partition" alert: "If any partition > 100 files"
- name: "Average file size" query: "SELECT AVG(size) as avg_size FROM file_stats" alert: "If < 100MB"Best Practices
DO
# 1. Monitor file size distributionanalyze_file_distribution(table_path)
# 2. Configure optimal file size.option("maxRecordsPerFile", 10000000)
# 3. Run regular compactionOPTIMIZE table_name WHERE date >= '2025-01-01'
# 4. Use Z-Ordering instead of over-partitioningOPTIMIZE table_name ZORDER BY (dimension1, dimension2)
# 5. Alert on small file creationif new_file_size < 128MB: alert(f"Small file created: {file_path}")DON’T
# 1. Don't ignore file size# Always monitor and optimize
# 2. Don't partition by high-cardinality columns# Avoid: PARTITIONED BY (customer_id)
# 3. Don't write every few minutes# Batch writes to create larger files
# 4. Don't use default streaming settings# Configure maxRecordsPerFile, maxFilesPerTrigger
# 5. Don't forget compaction# Schedule regular OPTIMIZE jobsKey Takeaways
- Problem: Small files = 10-100x query cost increase
- Detection: Monitor file size distribution, alert if > 50% small
- Solution 1: Configure file size at write (maxRecordsPerFile)
- Solution 2: Continuous compaction (OPTIMIZE)
- Solution 3: Proper partitioning (date, not high-cardinality)
- Prevention: Configure streaming, batch writes appropriately
- Cost impact: Can save $10K-100K/month at scale
- Automation: Monitor, compact, prevent continuously
Next: Compaction Strategies