Storage Formats Deep Dive
Parquet, ORC, Avro - When to Use Each
Overview
Storage format selection is one of the most impactful decisions for data platform cost and performance. This document provides a deep dive into the three major formats: Parquet, ORC, and Avro, including compression algorithms, encoding techniques, and selection criteria.
Quick Comparison
| Format | Type | Best For | Ecosystem | Compression Ratio |
|---|---|---|---|---|
| Parquet | Columnar | Analytics, BI, ML | Universal | 10-100x vs. CSV |
| ORC | Columnar | Hive, Presto/Trino | Hadoop ecosystem | 10-100x vs. CSV |
| Avro | Row-based | Streaming, schema evolution | Kafka, Hadoop | 5-30x vs. CSV |
| JSON/CSV | Row/text | Debugging, interchange | Universal | 1x (baseline) |
The Columnar Advantage
Why Columnar Wins for Analytics
| Scenario | Row-Based | Columnar | Speedup |
|---|---|---|---|
| SELECT * FROM table | Similar | Similar | 1x |
| SELECT col1 FROM table | Read all | Read col1 only | 10-50x |
| WHERE col2 > 100 | Read all | Skip col1, col3 | 10-100x |
| AGG col3 GROUP BY col1 | Read all | Read col1, col3 only | 5-20x |
Result: For typical analytical queries (projection + filtering), columnar formats provide 10-100x performance improvement.
Apache Parquet
Architecture
Key Concepts
| Concept | Description | Size Target |
|---|---|---|
| Row Group | Logical horizontal partition | 128MB-1GB |
| Column Chunk | Single column within row group | Varies |
| Page | Smallest unit of compression | 1MB |
| Statistics | Min/max/null count per column chunk | Enable data skipping |
Data Skipping in Parquet
Result: 50-90% I/O reduction for selective queries.
Parquet Compression
| Codec | Compression Ratio | Speed | Use Case |
|---|---|---|---|
| Uncompressed | 1x | Fastest | Testing |
| Snappy | 5-10x | Fast | Default (balanced) |
| GZIP | 10-20x | Slow | Storage-sensitive |
| ZSTD | 10-15x | Medium | Recommended (best balance) |
| LZO | 5-8x | Fast | Legacy systems |
Recommendation: Use ZSTD for new projects (better compression than Snappy with similar speed).
Parquet Encoding
| Encoding | Best For | Description |
|---|---|---|
| PLAIN | All | Simple storage |
| Dictionary | Low cardinality (< 1000 values) | 10-100x compression |
| RLE | Repeated values | Run-length encoding |
| Delta Encoding | Sorted, monotonic | Store differences |
| Bit-Packing | Small integers | Pack multiple values per byte |
Impact: Automatic encoding can add 20-50% additional compression on top of codec.
When to Use Parquet
Use Parquet when:
- Analytical queries (projection, filtering, aggregation)
- Columnar access patterns
- Need compression + performance
- Ecosystem support (all engines support it)
- ML workloads (read columns into DataFrames)
Avoid Parquet when:
- Row-heavy operations (full table scans with all columns)
- Write-heavy workloads (columnar has write overhead)
- Small files (< 128MB, overhead not worth it)
Optimized Row Columnar (ORC)
Architecture
Key Concepts
| Concept | Description | Size Target |
|---|---|---|
| Stripe | Logical horizontal partition | 64MB-256MB |
| Row Group | 10,000 rows within stripe | Fixed |
| Index Data | Min/max, sum, bloom filters | Enable skipping |
| Row Data | Actual column values | Compressed |
ORC vs. Parquet
| Feature | Parquet | ORC |
|---|---|---|
| Ecosystem | Universal | Hive/Presto focused |
| Compression | Slightly better (generally) | Slightly worse |
| Predicate pushdown | Excellent | Excellent (bloom filters) |
| Nested data | Excellent support | Good support |
| ACID via Hive | Via OTF | Native Hive ACID |
| Write speed | Faster | Slower |
| Read speed | Similar | Similar (slightly faster in Hive) |
| Tool support | All tools | Hive, Presto, Spark |
ORC Compression
| Codec | Compression Ratio | Speed | Use Case |
|---|---|---|---|
| None | 1x | Fastest | Testing |
| Snappy | 5-10x | Fast | Default |
| ZLIB | 15-25x | Slow | Storage-sensitive |
| LZO | 5-8x | Fast | Legacy |
Recommendation: Snappy for balance, ZLIB for storage-sensitive workloads.
When to Use ORC
Use ORC when:
- Using Hive or Presto/Trino as primary engine
- Need Hive ACID transactions (without OTF)
- Working in Hadoop ecosystem
- Need built-in bloom filters
Avoid ORC when:
- Need broad ecosystem support (Parquet more universal)
- Working with Databricks (optimizes for Parquet)
- Need nested data support (Parquet better)
Apache Avro
Architecture
Key Concepts
| Concept | Description | Benefit |
|---|---|---|
| Schema-first | Schema embedded or separate | Schema evolution without data rewrite |
| Row-based | Store records sequentially | Fast writes, full-row reads |
| Binary encoding | Compact binary format | Efficient storage |
| RPC support | Built-in remote procedure call | Service communication |
Schema Evolution
Avro supports seamless schema evolution:
| Change | Forward Compatible | Backward Compatible | Action |
|---|---|---|---|
| Add field with default | Yes | Yes | Safe |
| Add field without default | Yes | No | Requires all readers updated |
| Remove field | No | Yes | Requires all writers updated |
| Rename field | No | No | Use aliases |
| Change type | No | No | Safe promotions only |
Avro Compression
| Codec | Compression Ratio | Speed | Use Case |
|---|---|---|---|
| None | 1x | Fastest | Testing |
| Snappy | 3-8x | Fast | Default |
| Deflate | 5-15x | Medium | Storage-sensitive |
| Bzip2 | 10-20x | Slow | Maximum compression |
Note: Avro compresses less than Parquet/ORC because row-based data has less repetition.
When to Use Avro
Use Avro when:
- Streaming data (Kafka, Kinesis, Pulsar)
- Need schema evolution without data rewrite
- Row-based access patterns (full row reads)
- RPC between services
- Fast writes are critical
- Schema registry in use (Confluent Schema Registry)
Avoid Avro when:
- Analytical queries (columnar better)
- Need column-level access
- Need high compression ratios
- Large files (columnar wins for analytics)
Format Selection Decision Tree
Performance Benchmarks
Storage Comparison
| Format | Storage Size vs. CSV | Write Speed | Read Speed (Full) | Read Speed (Selective) |
|---|---|---|---|---|
| CSV | 1x (baseline) | Fast | Fast | Fast |
| JSON | 1.5-2x | Medium | Medium | Medium |
| Avro (Snappy) | 0.2-0.3x | Fast | Fast | Fast |
| Parquet (Snappy) | 0.1-0.2x | Medium | Medium | Very Fast |
| Parquet (ZSTD) | 0.08-0.15x | Medium | Medium | Very Fast |
| ORC (ZLIB) | 0.05-0.1x | Slow | Medium | Very Fast |
Lower storage = better. Parquet/ZSTD typically best balance.
Query Performance
| Query Type | CSV | Avro | Parquet | ORC |
|---|---|---|---|---|
| **SELECT *** | 1.0x | 0.9x | 0.8x | 0.8x |
| SELECT 3 cols | 1.0x | 0.9x | 0.3x | 0.3x |
| WHERE col > 100 | 1.0x | 1.0x | 0.1x | 0.1x |
| AGG GROUP BY | 1.0x | 0.9x | 0.2x | 0.2x |
| Point lookup | 1.0x | 0.8x | 0.5x | 0.5x |
Lower = faster. Parquet/ORC dominate for analytical queries.
Cost Implications
Storage Cost Example
Scenario: 100TB raw data
| Format | Storage | Monthly Cost (S3) | Annual Savings |
|---|---|---|---|
| CSV | 100TB | $2,300 | $0 (baseline) |
| JSON | 150TB | $3,450 | -$13,800 |
| Avro (Snappy) | 25TB | $575 | +$20,700 |
| Parquet (Snappy) | 15TB | $345 | +$23,400 |
| Parquet (ZSTD) | 10TB | $230 | +$24,840 |
Result: Parquet/ZSTD saves $24,840 annually per 100TB vs. CSV.
Query Cost Example
Scenario: 10,000 queries/month, 1TB data scanned average
| Format | Scan Size | Cost/Query | Monthly Cost | Annual Savings |
|---|---|---|---|---|
| CSV | 1TB | $5.00 | $50,000 | $0 (baseline) |
| Parquet | 0.1TB (10x skipping) | $0.50 | $5,000 | +$540,000 |
Result: Combined storage + query savings = $560,000+ annually per 100TB.
Best Practices
File Sizing
| File Size | Read Performance | Metadata Overhead | Recommendation |
|---|---|---|---|
| < 64MB | Poor (many small reads) | High | Too small |
| 128MB-512MB | Good | Low | Optimal for most |
| 512MB-1GB | Excellent | Very Low | Good for large scans |
| > 1GB | No additional benefit | Very Low | Good, but parallelization limited |
Target: Aim for 256MB Parquet files for most workloads.
Compression Selection
| Scenario | Recommended |
|---|---|
| Default | Parquet + ZSTD |
| Max compression | Parquet + ZSTD-19 or ORC + ZLIB |
| Fast writes | Parquet + Snappy |
| Streaming | Avro + Snappy |
| Debug/landing | JSON (convert immediately) |
Schema Design
DO:
- Use appropriate types (INT vs. BIGINT, DECIMAL vs. DOUBLE)
- Leverage dictionary encoding (low cardinality columns)
- Partition high-cardinality, filter-heavy columns
- Use nested types appropriately (Parquet handles well)
DON’T:
- Store everything as STRING
- Use JSON within Parquet strings (parse to struct)
- Over-partition (see partitioning module)
- Store timestamps as strings (use TIMESTAMP type)
Senior Level Gotchas
Gotcha 1: JSON in Parquet Strings
Problem: Storing JSON as string in Parquet column.
# Bad: JSON string in Parquetdf.write.parquet("/data", schema="id INT, data STRING")# data contains: '{"user": "alice", "age": 30}'Issue: No predicate pushdown, no compression benefit, can’t query fields.
Solution: Parse JSON to struct:
# Good: Parsed struct in Parquetdf.write.parquet("/data", schema="id INT, data STRUCT<user:STRING, age:INT>")# Can now: SELECT data.user, data.age FROM table WHERE data.age > 25Gotcha 2: Wrong Type Selection
Problem: Using STRING for everything.
Impact:
- 2-5x storage vs. proper types
- No type-specific optimization
- Requires parsing in queries
Solution: Use proper types:
- INT vs. BIGINT (storage savings)
- DECIMAL for money (not DOUBLE)
- TIMESTAMP for dates (not STRING)
- DATE for dates without time (save 8 bytes per value)
Gotcha 3: Not Configuring Compression
Problem: Default settings often not optimal.
Example: Spark default is Snappy, but ZSTD is better for most workloads.
Solution: Configure per workload:
# For hot data (frequent access): Snappydf.write.option("compression", "snappy").parquet("/hot")
# For warm data: ZSTD level 3df.write.option("compression", "zstd").option("zstd.level", 3).parquet("/warm")
# For cold data: ZSTD level 19df.write.option("compression", "zstd").option("zstd.level", 19).parquet("/cold")Gotcha 4: Small Files from Streaming
Problem: Each micro-batch creates small files → thousands of files.
Solution: Configure file size and compaction:
# Spark Structured Streamingdf.writeStream .option("path", "/data") .option("checkpointLocation", "/checkpoint") .option("maxFilesPerTrigger", 1000) .option("parquet.block.size", 256 * 1024 * 1024) # 256MB .start()Key Takeaways
- Parquet is default choice: Best for analytics, universal support, excellent compression
- Columnar wins for analytics: 10-100x performance for typical queries
- Avro for streaming: Schema evolution, row-based, fast writes
- ORC for Hive/Presto: Good alternative, but less universal
- Compression matters: ZSTD recommended for new projects
- File size critical: Target 256MB-512MB Parquet files
- Type selection matters: Proper types save storage and improve performance
Next: Partitioning Strategies