Storage Formats Deep Dive

Parquet, ORC, Avro - When to Use Each

Overview

Storage format selection is one of the most impactful decisions for data platform cost and performance. This document provides a deep dive into the three major formats: Parquet, ORC, and Avro, including compression algorithms, encoding techniques, and selection criteria.

Quick Comparison

Format	Type	Best For	Ecosystem	Compression Ratio
Parquet	Columnar	Analytics, BI, ML	Universal	10-100x vs. CSV
ORC	Columnar	Hive, Presto/Trino	Hadoop ecosystem	10-100x vs. CSV
Avro	Row-based	Streaming, schema evolution	Kafka, Hadoop	5-30x vs. CSV
JSON/CSV	Row/text	Debugging, interchange	Universal	1x (baseline)

The Columnar Advantage

Why Columnar Wins for Analytics

Scenario	Row-Based	Columnar	Speedup
*SELECT FROM table**	Similar	Similar	1x
SELECT col1 FROM table	Read all	Read col1 only	10-50x
WHERE col2 > 100	Read all	Skip col1, col3	10-100x
AGG col3 GROUP BY col1	Read all	Read col1, col3 only	5-20x

Result: For typical analytical queries (projection + filtering), columnar formats provide 10-100x performance improvement.

Apache Parquet

Architecture

Key Concepts

Concept	Description	Size Target
Row Group	Logical horizontal partition	128MB-1GB
Column Chunk	Single column within row group	Varies
Page	Smallest unit of compression	1MB
Statistics	Min/max/null count per column chunk	Enable data skipping

Data Skipping in Parquet

Result: 50-90% I/O reduction for selective queries.

Parquet Compression

Codec	Compression Ratio	Speed	Use Case
Uncompressed	1x	Fastest	Testing
Snappy	5-10x	Fast	Default (balanced)
GZIP	10-20x	Slow	Storage-sensitive
ZSTD	10-15x	Medium	Recommended (best balance)
LZO	5-8x	Fast	Legacy systems

Recommendation: Use ZSTD for new projects (better compression than Snappy with similar speed).

Parquet Encoding

Encoding	Best For	Description
PLAIN	All	Simple storage
Dictionary	Low cardinality (< 1000 values)	10-100x compression
RLE	Repeated values	Run-length encoding
Delta Encoding	Sorted, monotonic	Store differences
Bit-Packing	Small integers	Pack multiple values per byte

Impact: Automatic encoding can add 20-50% additional compression on top of codec.

When to Use Parquet

Use Parquet when:

Analytical queries (projection, filtering, aggregation)
Columnar access patterns
Need compression + performance
Ecosystem support (all engines support it)
ML workloads (read columns into DataFrames)

Avoid Parquet when:

Row-heavy operations (full table scans with all columns)
Write-heavy workloads (columnar has write overhead)
Small files (< 128MB, overhead not worth it)

Optimized Row Columnar (ORC)

Architecture

Key Concepts

Concept	Description	Size Target
Stripe	Logical horizontal partition	64MB-256MB
Row Group	10,000 rows within stripe	Fixed
Index Data	Min/max, sum, bloom filters	Enable skipping
Row Data	Actual column values	Compressed

ORC vs. Parquet

Feature	Parquet	ORC
Ecosystem	Universal	Hive/Presto focused
Compression	Slightly better (generally)	Slightly worse
Predicate pushdown	Excellent	Excellent (bloom filters)
Nested data	Excellent support	Good support
ACID via Hive	Via OTF	Native Hive ACID
Write speed	Faster	Slower
Read speed	Similar	Similar (slightly faster in Hive)
Tool support	All tools	Hive, Presto, Spark

ORC Compression

Codec	Compression Ratio	Speed	Use Case
None	1x	Fastest	Testing
Snappy	5-10x	Fast	Default
ZLIB	15-25x	Slow	Storage-sensitive
LZO	5-8x	Fast	Legacy

Recommendation: Snappy for balance, ZLIB for storage-sensitive workloads.

When to Use ORC

Use ORC when:

Using Hive or Presto/Trino as primary engine
Need Hive ACID transactions (without OTF)
Working in Hadoop ecosystem
Need built-in bloom filters

Avoid ORC when:

Need broad ecosystem support (Parquet more universal)
Working with Databricks (optimizes for Parquet)
Need nested data support (Parquet better)

Apache Avro

Architecture

Key Concepts

Concept	Description	Benefit
Schema-first	Schema embedded or separate	Schema evolution without data rewrite
Row-based	Store records sequentially	Fast writes, full-row reads
Binary encoding	Compact binary format	Efficient storage
RPC support	Built-in remote procedure call	Service communication

Schema Evolution

Avro supports seamless schema evolution:

Change	Forward Compatible	Backward Compatible	Action
Add field with default	Yes	Yes	Safe
Add field without default	Yes	No	Requires all readers updated
Remove field	No	Yes	Requires all writers updated
Rename field	No	No	Use aliases
Change type	No	No	Safe promotions only

Avro Compression

Codec	Compression Ratio	Speed	Use Case
None	1x	Fastest	Testing
Snappy	3-8x	Fast	Default
Deflate	5-15x	Medium	Storage-sensitive
Bzip2	10-20x	Slow	Maximum compression

Note: Avro compresses less than Parquet/ORC because row-based data has less repetition.

When to Use Avro

Use Avro when:

Streaming data (Kafka, Kinesis, Pulsar)
Need schema evolution without data rewrite
Row-based access patterns (full row reads)
RPC between services
Fast writes are critical
Schema registry in use (Confluent Schema Registry)

Avoid Avro when:

Analytical queries (columnar better)
Need column-level access
Need high compression ratios
Large files (columnar wins for analytics)

Format Selection Decision Tree

Performance Benchmarks

Storage Comparison

Format	Storage Size vs. CSV	Write Speed	Read Speed (Full)	Read Speed (Selective)
CSV	1x (baseline)	Fast	Fast	Fast
JSON	1.5-2x	Medium	Medium	Medium
Avro (Snappy)	0.2-0.3x	Fast	Fast	Fast
Parquet (Snappy)	0.1-0.2x	Medium	Medium	Very Fast
Parquet (ZSTD)	0.08-0.15x	Medium	Medium	Very Fast
ORC (ZLIB)	0.05-0.1x	Slow	Medium	Very Fast

Lower storage = better. Parquet/ZSTD typically best balance.

Query Performance

Query Type	CSV	Avro	Parquet	ORC
SELECT *	1.0x	0.9x	0.8x	0.8x
SELECT 3 cols	1.0x	0.9x	0.3x	0.3x
WHERE col > 100	1.0x	1.0x	0.1x	0.1x
AGG GROUP BY	1.0x	0.9x	0.2x	0.2x
Point lookup	1.0x	0.8x	0.5x	0.5x

Lower = faster. Parquet/ORC dominate for analytical queries.

Cost Implications

Storage Cost Example

Scenario: 100TB raw data

Format	Storage	Monthly Cost (S3)	Annual Savings
CSV	100TB	$2,300	$0 (baseline)
JSON	150TB	$3,450	-$13,800
Avro (Snappy)	25TB	$575	+$20,700
Parquet (Snappy)	15TB	$345	+$23,400
Parquet (ZSTD)	10TB	$230	+$24,840

Result: Parquet/ZSTD saves $24,840 annually per 100TB vs. CSV.

Query Cost Example

Scenario: 10,000 queries/month, 1TB data scanned average

Format	Scan Size	Cost/Query	Monthly Cost	Annual Savings
CSV	1TB	$5.00	$50,000	$0 (baseline)
Parquet	0.1TB (10x skipping)	$0.50	$5,000	+$540,000

Result: Combined storage + query savings = $560,000+ annually per 100TB.

Best Practices

File Sizing

File Size	Read Performance	Metadata Overhead	Recommendation
< 64MB	Poor (many small reads)	High	Too small
128MB-512MB	Good	Low	Optimal for most
512MB-1GB	Excellent	Very Low	Good for large scans
> 1GB	No additional benefit	Very Low	Good, but parallelization limited

Target: Aim for 256MB Parquet files for most workloads.

Compression Selection

Scenario	Recommended
Default	Parquet + ZSTD
Max compression	Parquet + ZSTD-19 or ORC + ZLIB
Fast writes	Parquet + Snappy
Streaming	Avro + Snappy
Debug/landing	JSON (convert immediately)

Schema Design

DO:

Use appropriate types (INT vs. BIGINT, DECIMAL vs. DOUBLE)
Leverage dictionary encoding (low cardinality columns)
Partition high-cardinality, filter-heavy columns
Use nested types appropriately (Parquet handles well)

DON’T:

Store everything as STRING
Use JSON within Parquet strings (parse to struct)
Over-partition (see partitioning module)
Store timestamps as strings (use TIMESTAMP type)

Senior Level Gotchas

Gotcha 1: JSON in Parquet Strings

Problem: Storing JSON as string in Parquet column.

# Bad: JSON string in Parquet
df.write.parquet("/data", schema="id INT, data STRING")
# data contains: '{"user": "alice", "age": 30}'

Issue: No predicate pushdown, no compression benefit, can’t query fields.

Solution: Parse JSON to struct:

# Good: Parsed struct in Parquet
df.write.parquet("/data", schema="id INT, data STRUCT<user:STRING, age:INT>")
# Can now: SELECT data.user, data.age FROM table WHERE data.age > 25

Gotcha 2: Wrong Type Selection

Problem: Using STRING for everything.

Impact:

2-5x storage vs. proper types
No type-specific optimization
Requires parsing in queries

Solution: Use proper types:

INT vs. BIGINT (storage savings)
DECIMAL for money (not DOUBLE)
TIMESTAMP for dates (not STRING)
DATE for dates without time (save 8 bytes per value)

Gotcha 3: Not Configuring Compression

Problem: Default settings often not optimal.

Example: Spark default is Snappy, but ZSTD is better for most workloads.

Solution: Configure per workload:

# For hot data (frequent access): Snappy
df.write.option("compression", "snappy").parquet("/hot")

# For warm data: ZSTD level 3
df.write.option("compression", "zstd").option("zstd.level", 3).parquet("/warm")

# For cold data: ZSTD level 19
df.write.option("compression", "zstd").option("zstd.level", 19).parquet("/cold")

Gotcha 4: Small Files from Streaming

Problem: Each micro-batch creates small files → thousands of files.

Solution: Configure file size and compaction:

# Spark Structured Streaming
df.writeStream
  .option("path", "/data")
  .option("checkpointLocation", "/checkpoint")
  .option("maxFilesPerTrigger", 1000)
  .option("parquet.block.size", 256 * 1024 * 1024)  # 256MB
  .start()

Key Takeaways

Parquet is default choice: Best for analytics, universal support, excellent compression
Columnar wins for analytics: 10-100x performance for typical queries
Avro for streaming: Schema evolution, row-based, fast writes
ORC for Hive/Presto: Good alternative, but less universal
Compression matters: ZSTD recommended for new projects
File size critical: Target 256MB-512MB Parquet files
Type selection matters: Proper types save storage and improve performance

Next: Partitioning Strategies