Compression Codecs
Storage Cost vs. CPU Trade-offs
Overview
Compression reduces storage costs and I/O at the expense of CPU. Selecting the right codec is critical for balancing storage savings, query performance, and cost.
Compression Comparison
Codec Metrics
| Codec | Compression | Decompression | Ratio | Speed | Best For |
|---|---|---|---|---|---|
| Snappy | Low | Very Fast | 2-3x | Very Fast | General purpose |
| ZSTD | Medium | Fast | 3-5x | Fast | Balance |
| GZIP | High | Slow | 4-6x | Slow | Archive, cold data |
| LZ4 | Very Low | Very Fast | 2x | Very Fast | Real-time |
| Brotli | Very High | Medium | 5-8x | Medium | Web, text |
Cost vs. Performance
Compression by Format
Parquet Compression
# Parquet compression with Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder \ .appName("ParquetCompression") \ .getOrCreate()
# Write with different codecsdf = spark.read.json("s3://bucket/data/*.json")
# Snappy (default, balanced)df.write \ .mode("overwrite") \ .option("compression", "snappy") \ .parquet("s3://bucket/output/snappy/")
# ZSTD (recommended)df.write \ .mode("overwrite") \ .option("compression", "zstd") \ .option("zstd.level", 3) \ .parquet("s3://bucket/output/zstd/")
# GZIP (archive)df.write \ .mode("overwrite") \ .option("compression", "gzip") \ .parquet("s3://bucket/output/gzip/")
# LZ4 (fastest)df.write \ .mode("overwrite") \ .option("compression", "lz4") \ .parquet("s3://bucket/output/lz4/")Delta Lake Compression
# Delta Lake compression
from delta import *
spark = SparkSession.builder \ .appName("DeltaCompression") \ .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \ .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \ .getOrCreate()
# Write with compressiondf.write \ .format("delta") \ .mode("overwrite") \ .option("compression", "zstd") \ .save("s3://bucket/delta/customers/")
# Change compression on existing tablefrom delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "s3://bucket/delta/customers/")
# Rebuild with ZSTDdelta_table.rebuild()Compression Analysis
Benchmark Comparison
# Compression benchmark
import timeimport pandas as pdfrom pyspark.sql import SparkSession
spark = SparkSession.builder \ .appName("CompressionBenchmark") \ .getOrCreate()
def benchmark_compression( df, codecs: list = ["snappy", "zstd", "gzip", "lz4"]) -> pd.DataFrame: """Benchmark compression codecs"""
results = []
for codec in codecs: # Write with codec path = f"s3://bucket/temp/{codec}/"
start = time.time() df.write \ .mode("overwrite") \ .option("compression", codec) \ .parquet(path) write_time = time.time() - start
# Get file size file_size = get_directory_size(path)
# Read and measure time start = time.time() spark.read.parquet(path).count() read_time = time.time() - start
results.append({ 'codec': codec, 'write_time': write_time, 'read_time': read_time, 'file_size_mb': file_size / (1024 * 1024), 'compression_ratio': df_size / file_size })
return pd.DataFrame(results)
# Example results (typical)# codec write_time read_time file_size_mb compression_ratio# snappy 120 45 1024 2.5x# zstd 150 50 768 3.3x# gzip 200 80 640 4.0x# lz4 100 40 1280 2.0xCompression Strategy
Hot/Warm/Cold Tiering
Strategy by Use Case
| Use Case | Recommended Codec | Rationale |
|---|---|---|
| Real-time analytics | Snappy, LZ4 | Fast decompression |
| Data warehouse | ZSTD (level 3) | Balance of speed and size |
| Archive | GZIP, ZSTD (level 22) | Maximum compression |
| Machine learning | Snappy | Fast training data loading |
| Log data | ZSTD | Good compression, decent speed |
| JSON/XML | GZIP, Brotli | Text compresses well |
ZSTD Levels
Level Selection
# ZSTD level tuning
from pyspark.sql import SparkSession
spark = SparkSession.builder \ .appName("ZSTDLevels") \ .getOrCreate()
# ZSTD levels: 1-22# Lower = Faster, less compression# Higher = Slower, more compression
zstd_configs = { "fast": {"level": 1, "description": "Fastest, lowest compression"}, "default": {"level": 3, "description": "Default, balanced"}, "high": {"level": 10, "description": "High compression"}, "max": {"level": 22, "description": "Maximum compression, slowest"}}
def write_with_zstd(df, path: str, level: int = 3): """Write with ZSTD level"""
df.write \ .mode("overwrite") \ .option("compression", "zstd") \ .option("zstd.level", level) \ .parquet(path)
# Example: Fast compressionwrite_with_zstd(df, "s3://bucket/output/fast/", level=1)
# Example: Maximum compressionwrite_with_zstd(df, "s3://bucket/output/max/", level=22)Level Recommendations
| Data Type | Recommended Level | Reason |
|---|---|---|
| Hot data | 1-3 | Fast queries |
| Warm data | 3-7 | Balanced |
| Cold data | 10-22 | Storage savings |
| Streaming | 1-3 | Low latency |
| Batch ETL | 3-7 | Balanced |
Compression Best Practices
DO
# 1. Use ZSTD for most data# Best balance of speed and size
# 2. Use Snappy for hot data# Fastest decompression
# 3. Use GZIP for cold data# Maximum compression
# 4. Monitor compression ratios# Track storage savings
# 5. Benchmark your data# Compression varies by data typeDON’T
# 1. Don't use GZIP for hot data# Too slow for queries
# 2. Don't ignore CPU cost# Compression adds compute overhead
# 3. Don't use wrong codec for format# Some formats have optimal codecs
# 4. Don't forget decompression speed# Queries need to decompress
# 5. Don't compress already compressed data# Images, videos, compressed filesKey Takeaways
- Snappy: Fastest, lowest compression (2-3x)
- ZSTD: Best balance, 3-5x compression, tunable levels
- GZIP: Highest compression (4-6x), slowest
- Hot data: Use Snappy or ZSTD level 1-3
- Cold data: Use GZIP or ZSTD level 10-22
- Parquet: ZSTD recommended
- Delta Lake: ZSTD level 3 default
- Use When: All data storage, optimize storage vs. compute
Back to Module 7