Skip to content

Storage Formats Deep Dive

Parquet, ORC, Avro - When to Use Each


Overview

Storage format selection is one of the most impactful decisions for data platform cost and performance. This document provides a deep dive into the three major formats: Parquet, ORC, and Avro, including compression algorithms, encoding techniques, and selection criteria.


Quick Comparison

FormatTypeBest ForEcosystemCompression Ratio
ParquetColumnarAnalytics, BI, MLUniversal10-100x vs. CSV
ORCColumnarHive, Presto/TrinoHadoop ecosystem10-100x vs. CSV
AvroRow-basedStreaming, schema evolutionKafka, Hadoop5-30x vs. CSV
JSON/CSVRow/textDebugging, interchangeUniversal1x (baseline)

The Columnar Advantage

Why Columnar Wins for Analytics

ScenarioRow-BasedColumnarSpeedup
SELECT * FROM tableSimilarSimilar1x
SELECT col1 FROM tableRead allRead col1 only10-50x
WHERE col2 > 100Read allSkip col1, col310-100x
AGG col3 GROUP BY col1Read allRead col1, col3 only5-20x

Result: For typical analytical queries (projection + filtering), columnar formats provide 10-100x performance improvement.


Apache Parquet

Architecture

Key Concepts

ConceptDescriptionSize Target
Row GroupLogical horizontal partition128MB-1GB
Column ChunkSingle column within row groupVaries
PageSmallest unit of compression1MB
StatisticsMin/max/null count per column chunkEnable data skipping

Data Skipping in Parquet

Result: 50-90% I/O reduction for selective queries.

Parquet Compression

CodecCompression RatioSpeedUse Case
Uncompressed1xFastestTesting
Snappy5-10xFastDefault (balanced)
GZIP10-20xSlowStorage-sensitive
ZSTD10-15xMediumRecommended (best balance)
LZO5-8xFastLegacy systems

Recommendation: Use ZSTD for new projects (better compression than Snappy with similar speed).

Parquet Encoding

EncodingBest ForDescription
PLAINAllSimple storage
DictionaryLow cardinality (< 1000 values)10-100x compression
RLERepeated valuesRun-length encoding
Delta EncodingSorted, monotonicStore differences
Bit-PackingSmall integersPack multiple values per byte

Impact: Automatic encoding can add 20-50% additional compression on top of codec.

When to Use Parquet

Use Parquet when:

  • Analytical queries (projection, filtering, aggregation)
  • Columnar access patterns
  • Need compression + performance
  • Ecosystem support (all engines support it)
  • ML workloads (read columns into DataFrames)

Avoid Parquet when:

  • Row-heavy operations (full table scans with all columns)
  • Write-heavy workloads (columnar has write overhead)
  • Small files (< 128MB, overhead not worth it)

Optimized Row Columnar (ORC)

Architecture

Key Concepts

ConceptDescriptionSize Target
StripeLogical horizontal partition64MB-256MB
Row Group10,000 rows within stripeFixed
Index DataMin/max, sum, bloom filtersEnable skipping
Row DataActual column valuesCompressed

ORC vs. Parquet

FeatureParquetORC
EcosystemUniversalHive/Presto focused
CompressionSlightly better (generally)Slightly worse
Predicate pushdownExcellentExcellent (bloom filters)
Nested dataExcellent supportGood support
ACID via HiveVia OTFNative Hive ACID
Write speedFasterSlower
Read speedSimilarSimilar (slightly faster in Hive)
Tool supportAll toolsHive, Presto, Spark

ORC Compression

CodecCompression RatioSpeedUse Case
None1xFastestTesting
Snappy5-10xFastDefault
ZLIB15-25xSlowStorage-sensitive
LZO5-8xFastLegacy

Recommendation: Snappy for balance, ZLIB for storage-sensitive workloads.

When to Use ORC

Use ORC when:

  • Using Hive or Presto/Trino as primary engine
  • Need Hive ACID transactions (without OTF)
  • Working in Hadoop ecosystem
  • Need built-in bloom filters

Avoid ORC when:

  • Need broad ecosystem support (Parquet more universal)
  • Working with Databricks (optimizes for Parquet)
  • Need nested data support (Parquet better)

Apache Avro

Architecture

Key Concepts

ConceptDescriptionBenefit
Schema-firstSchema embedded or separateSchema evolution without data rewrite
Row-basedStore records sequentiallyFast writes, full-row reads
Binary encodingCompact binary formatEfficient storage
RPC supportBuilt-in remote procedure callService communication

Schema Evolution

Avro supports seamless schema evolution:

ChangeForward CompatibleBackward CompatibleAction
Add field with defaultYesYesSafe
Add field without defaultYesNoRequires all readers updated
Remove fieldNoYesRequires all writers updated
Rename fieldNoNoUse aliases
Change typeNoNoSafe promotions only

Avro Compression

CodecCompression RatioSpeedUse Case
None1xFastestTesting
Snappy3-8xFastDefault
Deflate5-15xMediumStorage-sensitive
Bzip210-20xSlowMaximum compression

Note: Avro compresses less than Parquet/ORC because row-based data has less repetition.

When to Use Avro

Use Avro when:

  • Streaming data (Kafka, Kinesis, Pulsar)
  • Need schema evolution without data rewrite
  • Row-based access patterns (full row reads)
  • RPC between services
  • Fast writes are critical
  • Schema registry in use (Confluent Schema Registry)

Avoid Avro when:

  • Analytical queries (columnar better)
  • Need column-level access
  • Need high compression ratios
  • Large files (columnar wins for analytics)

Format Selection Decision Tree


Performance Benchmarks

Storage Comparison

FormatStorage Size vs. CSVWrite SpeedRead Speed (Full)Read Speed (Selective)
CSV1x (baseline)FastFastFast
JSON1.5-2xMediumMediumMedium
Avro (Snappy)0.2-0.3xFastFastFast
Parquet (Snappy)0.1-0.2xMediumMediumVery Fast
Parquet (ZSTD)0.08-0.15xMediumMediumVery Fast
ORC (ZLIB)0.05-0.1xSlowMediumVery Fast

Lower storage = better. Parquet/ZSTD typically best balance.

Query Performance

Query TypeCSVAvroParquetORC
**SELECT ***1.0x0.9x0.8x0.8x
SELECT 3 cols1.0x0.9x0.3x0.3x
WHERE col > 1001.0x1.0x0.1x0.1x
AGG GROUP BY1.0x0.9x0.2x0.2x
Point lookup1.0x0.8x0.5x0.5x

Lower = faster. Parquet/ORC dominate for analytical queries.


Cost Implications

Storage Cost Example

Scenario: 100TB raw data

FormatStorageMonthly Cost (S3)Annual Savings
CSV100TB$2,300$0 (baseline)
JSON150TB$3,450-$13,800
Avro (Snappy)25TB$575+$20,700
Parquet (Snappy)15TB$345+$23,400
Parquet (ZSTD)10TB$230+$24,840

Result: Parquet/ZSTD saves $24,840 annually per 100TB vs. CSV.

Query Cost Example

Scenario: 10,000 queries/month, 1TB data scanned average

FormatScan SizeCost/QueryMonthly CostAnnual Savings
CSV1TB$5.00$50,000$0 (baseline)
Parquet0.1TB (10x skipping)$0.50$5,000+$540,000

Result: Combined storage + query savings = $560,000+ annually per 100TB.


Best Practices

File Sizing

File SizeRead PerformanceMetadata OverheadRecommendation
< 64MBPoor (many small reads)HighToo small
128MB-512MBGoodLowOptimal for most
512MB-1GBExcellentVery LowGood for large scans
> 1GBNo additional benefitVery LowGood, but parallelization limited

Target: Aim for 256MB Parquet files for most workloads.

Compression Selection

ScenarioRecommended
DefaultParquet + ZSTD
Max compressionParquet + ZSTD-19 or ORC + ZLIB
Fast writesParquet + Snappy
StreamingAvro + Snappy
Debug/landingJSON (convert immediately)

Schema Design

DO:

  • Use appropriate types (INT vs. BIGINT, DECIMAL vs. DOUBLE)
  • Leverage dictionary encoding (low cardinality columns)
  • Partition high-cardinality, filter-heavy columns
  • Use nested types appropriately (Parquet handles well)

DON’T:

  • Store everything as STRING
  • Use JSON within Parquet strings (parse to struct)
  • Over-partition (see partitioning module)
  • Store timestamps as strings (use TIMESTAMP type)

Senior Level Gotchas

Gotcha 1: JSON in Parquet Strings

Problem: Storing JSON as string in Parquet column.

# Bad: JSON string in Parquet
df.write.parquet("/data", schema="id INT, data STRING")
# data contains: '{"user": "alice", "age": 30}'

Issue: No predicate pushdown, no compression benefit, can’t query fields.

Solution: Parse JSON to struct:

# Good: Parsed struct in Parquet
df.write.parquet("/data", schema="id INT, data STRUCT<user:STRING, age:INT>")
# Can now: SELECT data.user, data.age FROM table WHERE data.age > 25

Gotcha 2: Wrong Type Selection

Problem: Using STRING for everything.

Impact:

  • 2-5x storage vs. proper types
  • No type-specific optimization
  • Requires parsing in queries

Solution: Use proper types:

  • INT vs. BIGINT (storage savings)
  • DECIMAL for money (not DOUBLE)
  • TIMESTAMP for dates (not STRING)
  • DATE for dates without time (save 8 bytes per value)

Gotcha 3: Not Configuring Compression

Problem: Default settings often not optimal.

Example: Spark default is Snappy, but ZSTD is better for most workloads.

Solution: Configure per workload:

# For hot data (frequent access): Snappy
df.write.option("compression", "snappy").parquet("/hot")
# For warm data: ZSTD level 3
df.write.option("compression", "zstd").option("zstd.level", 3).parquet("/warm")
# For cold data: ZSTD level 19
df.write.option("compression", "zstd").option("zstd.level", 19).parquet("/cold")

Gotcha 4: Small Files from Streaming

Problem: Each micro-batch creates small files → thousands of files.

Solution: Configure file size and compaction:

# Spark Structured Streaming
df.writeStream
.option("path", "/data")
.option("checkpointLocation", "/checkpoint")
.option("maxFilesPerTrigger", 1000)
.option("parquet.block.size", 256 * 1024 * 1024) # 256MB
.start()

Key Takeaways

  1. Parquet is default choice: Best for analytics, universal support, excellent compression
  2. Columnar wins for analytics: 10-100x performance for typical queries
  3. Avro for streaming: Schema evolution, row-based, fast writes
  4. ORC for Hive/Presto: Good alternative, but less universal
  5. Compression matters: ZSTD recommended for new projects
  6. File size critical: Target 256MB-512MB Parquet files
  7. Type selection matters: Proper types save storage and improve performance

Next: Partitioning Strategies