Skip to content

Compression Codecs

Storage Cost vs. CPU Trade-offs


Overview

Compression reduces storage costs and I/O at the expense of CPU. Selecting the right codec is critical for balancing storage savings, query performance, and cost.


Compression Comparison

Codec Metrics

CodecCompressionDecompressionRatioSpeedBest For
SnappyLowVery Fast2-3xVery FastGeneral purpose
ZSTDMediumFast3-5xFastBalance
GZIPHighSlow4-6xSlowArchive, cold data
LZ4Very LowVery Fast2xVery FastReal-time
BrotliVery HighMedium5-8xMediumWeb, text

Cost vs. Performance


Compression by Format

Parquet Compression

# Parquet compression with Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("ParquetCompression") \
.getOrCreate()
# Write with different codecs
df = spark.read.json("s3://bucket/data/*.json")
# Snappy (default, balanced)
df.write \
.mode("overwrite") \
.option("compression", "snappy") \
.parquet("s3://bucket/output/snappy/")
# ZSTD (recommended)
df.write \
.mode("overwrite") \
.option("compression", "zstd") \
.option("zstd.level", 3) \
.parquet("s3://bucket/output/zstd/")
# GZIP (archive)
df.write \
.mode("overwrite") \
.option("compression", "gzip") \
.parquet("s3://bucket/output/gzip/")
# LZ4 (fastest)
df.write \
.mode("overwrite") \
.option("compression", "lz4") \
.parquet("s3://bucket/output/lz4/")

Delta Lake Compression

# Delta Lake compression
from delta import *
spark = SparkSession.builder \
.appName("DeltaCompression") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# Write with compression
df.write \
.format("delta") \
.mode("overwrite") \
.option("compression", "zstd") \
.save("s3://bucket/delta/customers/")
# Change compression on existing table
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "s3://bucket/delta/customers/")
# Rebuild with ZSTD
delta_table.rebuild()

Compression Analysis

Benchmark Comparison

# Compression benchmark
import time
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("CompressionBenchmark") \
.getOrCreate()
def benchmark_compression(
df,
codecs: list = ["snappy", "zstd", "gzip", "lz4"]
) -> pd.DataFrame:
"""Benchmark compression codecs"""
results = []
for codec in codecs:
# Write with codec
path = f"s3://bucket/temp/{codec}/"
start = time.time()
df.write \
.mode("overwrite") \
.option("compression", codec) \
.parquet(path)
write_time = time.time() - start
# Get file size
file_size = get_directory_size(path)
# Read and measure time
start = time.time()
spark.read.parquet(path).count()
read_time = time.time() - start
results.append({
'codec': codec,
'write_time': write_time,
'read_time': read_time,
'file_size_mb': file_size / (1024 * 1024),
'compression_ratio': df_size / file_size
})
return pd.DataFrame(results)
# Example results (typical)
# codec write_time read_time file_size_mb compression_ratio
# snappy 120 45 1024 2.5x
# zstd 150 50 768 3.3x
# gzip 200 80 640 4.0x
# lz4 100 40 1280 2.0x

Compression Strategy

Hot/Warm/Cold Tiering

Strategy by Use Case

Use CaseRecommended CodecRationale
Real-time analyticsSnappy, LZ4Fast decompression
Data warehouseZSTD (level 3)Balance of speed and size
ArchiveGZIP, ZSTD (level 22)Maximum compression
Machine learningSnappyFast training data loading
Log dataZSTDGood compression, decent speed
JSON/XMLGZIP, BrotliText compresses well

ZSTD Levels

Level Selection

# ZSTD level tuning
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("ZSTDLevels") \
.getOrCreate()
# ZSTD levels: 1-22
# Lower = Faster, less compression
# Higher = Slower, more compression
zstd_configs = {
"fast": {"level": 1, "description": "Fastest, lowest compression"},
"default": {"level": 3, "description": "Default, balanced"},
"high": {"level": 10, "description": "High compression"},
"max": {"level": 22, "description": "Maximum compression, slowest"}
}
def write_with_zstd(df, path: str, level: int = 3):
"""Write with ZSTD level"""
df.write \
.mode("overwrite") \
.option("compression", "zstd") \
.option("zstd.level", level) \
.parquet(path)
# Example: Fast compression
write_with_zstd(df, "s3://bucket/output/fast/", level=1)
# Example: Maximum compression
write_with_zstd(df, "s3://bucket/output/max/", level=22)

Level Recommendations

Data TypeRecommended LevelReason
Hot data1-3Fast queries
Warm data3-7Balanced
Cold data10-22Storage savings
Streaming1-3Low latency
Batch ETL3-7Balanced

Compression Best Practices

DO

# 1. Use ZSTD for most data
# Best balance of speed and size
# 2. Use Snappy for hot data
# Fastest decompression
# 3. Use GZIP for cold data
# Maximum compression
# 4. Monitor compression ratios
# Track storage savings
# 5. Benchmark your data
# Compression varies by data type

DON’T

# 1. Don't use GZIP for hot data
# Too slow for queries
# 2. Don't ignore CPU cost
# Compression adds compute overhead
# 3. Don't use wrong codec for format
# Some formats have optimal codecs
# 4. Don't forget decompression speed
# Queries need to decompress
# 5. Don't compress already compressed data
# Images, videos, compressed files

Key Takeaways

  1. Snappy: Fastest, lowest compression (2-3x)
  2. ZSTD: Best balance, 3-5x compression, tunable levels
  3. GZIP: Highest compression (4-6x), slowest
  4. Hot data: Use Snappy or ZSTD level 1-3
  5. Cold data: Use GZIP or ZSTD level 10-22
  6. Parquet: ZSTD recommended
  7. Delta Lake: ZSTD level 3 default
  8. Use When: All data storage, optimize storage vs. compute

Back to Module 7