Skip to content

Streaming Engines Comparison

Apache Flink vs. Spark Structured Streaming


Overview

This document compares the two major streaming engines: Apache Flink and Spark Structured Streaming. Both enable real-time data processing but have different architectures, strengths, and use cases.


Quick Comparison Matrix

FeatureFlinkSpark Structured Streaming
ModelTrue streaming (per-event)Micro-batch
LatencyMilliseconds100ms - seconds
State ManagementExcellent (RocksDB)Good (HDFS/S3)
Exactly-OnceNativeNative (with checkpoints)
WindowingVery flexibleFlexible
SQL SupportExcellentExcellent
EcosystemGrowingMature (Spark ecosystem)
OperationsComplexSimpler (unified with Spark)
DeploymentStandalone, K8s, YARNStandalone, K8s, YARN, Cloud

Architecture Comparison

Key characteristics:

  • Event-at-a-time processing
  • Native state management
  • Checkpointing for exactly-once
  • Watermarks for late data
  • Savepoints for upgrades

Spark Structured Streaming Architecture

Key characteristics:

  • Micro-batch processing
  • DataFrame/Dataset API
  • Checkpointing for fault tolerance
  • Watermarks for event time
  • Unified batch + streaming

Deep Dive by Engine

Strengths:

  • True low-latency streaming (milliseconds)
  • Excellent state management (RocksDB integration)
  • Advanced windowing (session, sliding, tumbling, custom)
  • Native exactly-once semantics
  • Savepoints for stateful upgrades
  • Strong SQL support for streaming

Weaknesses:

  • Steeper learning curve
  • Smaller community than Spark
  • More complex operations
  • Fewer high-level libraries
  • More operational complexity

Best for:

  • Low-latency requirements (< 100ms)
  • Complex stateful operations
  • Event-driven applications
  • Real-time ML inference
  • Time-series analytics

Code Example:

// Flink DataStream API
DataStream<Event> events = env
.addSource(new FlinkKafkaConsumer<>("topic", deserializer, props))
// Keyed stream with state
events.keyBy(Event::getUserId)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.process(new ProcessWindowFunction<Event, Result, String, TimeWindow>() {
public void process(String key, Context ctx, Iterable<Event> events, Collector<Result> out) {
// Stateful processing
long count = 0;
for (Event e : events) {
count++;
}
out.collect(new Result(key, count));
}
});

Spark Structured Streaming

Strengths:

  • Unified batch + streaming (same API)
  • Mature ecosystem (Spark ML, GraphX, etc.)
  • Simpler operations (micro-batch abstraction)
  • Strong community support
  • Easy to learn for Spark users
  • Good integration with Delta Lake

Weaknesses:

  • Higher latency (micro-batch)
  • Less flexible windowing
  • State management less robust than Flink
  • No native per-event processing
  • Checkpoint overhead can be high

Best for:

  • Near-real-time analytics (100ms - seconds)
  • Lambda architecture (batch + streaming)
  • ETL pipelines
  • Data warehousing workloads
  • Teams already using Spark

Code Example:

# Spark Structured Streaming
from pyspark.sql.functions import col, window
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "topic") \
.load()
# Transform and write
query = df.select(
col("key").cast("string"),
col("value").cast("string"),
col("timestamp")
).writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/checkpoint") \
.start("/delta/events")
query.awaitTermination()

Performance Comparison

Latency

MetricFlinkSpark Streaming
End-to-end latency10-50ms100ms - 5s
Micro-batch overheadNoneBatch interval (typically 1-10s)
p99 latencyPredictableVariable (depends on batch)

Throughput

MetricFlinkSpark Streaming
Max throughput10+ million events/sec5+ million events/sec
ScalingLinearLinear (after initial overhead)

State Management

FeatureFlinkSpark Streaming
State backendRocksDB, MemoryHDFS, S3 (checkpointing)
State sizeCan handle TBs of stateLimited by checkpoint overhead
Recovery timeSeconds (RocksDB)Minutes (S3)

Feature Comparison

Windowing

Window TypeFlinkSpark Streaming
TumblingYesYes
SlidingYesYes
SessionYes (native)Yes (with session window)
GlobalYesYes
CustomYesLimited

Time Semantics

FeatureFlinkSpark Streaming
Event timeNative (watermarks)Native (watermarks)
Processing timeYesYes
Ingestion timeYesYes
Late data handlingExcellent (allowed lateness)Good (watermarks)

Fault Tolerance

FeatureFlinkSpark Streaming
Exactly-onceNative (end-to-end)Native (with checkpoint)
Checkpoint intervalConfigurableConfigurable
RecoveryFrom checkpoint/savepointFrom checkpoint
State upgradeSavepointsRebuild state

Selection Framework

Decision Guide

ScenarioRecommendedRationale
Real-time MLFlinkLow latency, state management
Real-time fraud detectionFlinkMilliseconds latency, complex state
Near-real-time analyticsSpark StreamingSimpler, unified with batch
Lambda architectureSpark StreamingSame API for batch + streaming
Complex event processingFlinkCEP library, low latency
Data warehousingSpark StreamingIntegration with Delta Lake
IoT processingFlinkHigh throughput, state management

Cost Comparison

Infrastructure Costs

Scenario: 1M events/sec, 1KB per event

EngineCluster SizeMonthly CostNotes
Flink20 nodes (64GB, 16 cores)$10,000With spot instances: $3,000
Spark Streaming20 nodes (64GB, 16 cores)$10,000With spot instances: $3,000

Result: Similar infrastructure costs. Flink may require fewer nodes for low-latency workloads.

Operational Costs

TaskFlinkSpark Streaming
SetupMediumLow (if using Spark)
MonitoringMediumLow
State managementMediumLow
UpgradesComplex (savepoints)Simple

Integration Patterns

# Flink Kafka Source
source:
type: kafka
properties:
bootstrap.servers: kafka1:9092,kafka2:9092
group.id: flink-consumer
topic: events
format: json
# Flink Kafka Sink
sink:
type: kafka
properties:
bootstrap.servers: kafka1:9092
topic: results
format: json

Spark Streaming + Delta Lake

# Read from Kafka, write to Delta
df = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "events") \
.load()
query = df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/checkpoint") \
.start("/delta/events")

Production Patterns

// Keyed ProcessFunction with state
public class UserFunction extends KeyedProcessFunction<String, Event, Result> {
private ValueState<Long> countState;
@Override
public void open(Configuration parameters) {
countState = getRuntimeContext().getState(
new ValueStateDescriptor<>("count", Long.class)
);
}
@Override
public void processElement(Event event, Context ctx, Collector<Result> out) {
Long count = countState.value();
if (count == null) {
count = 0L;
}
count++;
countState.update(count);
out.collect(new Result(ctx.getCurrentKey(), count));
}
}

Spark Streaming: Aggregation

from pyspark.sql.functions import col, window
# Windowed aggregation
df = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "events") \
.load()
aggregated = df.groupBy(
window(col("timestamp"), "5 minutes"),
col("key")
).agg(
count("*").alias("count"),
sum("value").alias("total")
)
query = aggregated.writeStream \
.format("delta") \
.outputMode("update") \
.option("checkpointLocation", "/checkpoint") \
.start("/delta/aggregated")

Senior Level Considerations

Operational Complexity

AspectFlinkSpark Streaming
DeploymentStandalone, K8s, YARNStandalone, K8s, YARN, Cloud
MonitoringFlink Web UI, PrometheusSpark UI, Prometheus
DebuggingHarder (distributed)Easier (micro-batch)
BackpressureAutomatic handlingAutomatic handling

Scaling

ScenarioFlinkSpark Streaming
Scale upAdd task managersAdd executors
Scale downRemove task managersRemove executors
RebalanceManual (savepoint)Automatic

Failure Recovery

ScenarioFlinkSpark Streaming
Task failureRestart from checkpointRestart from checkpoint
Node failureReschedule tasksReschedule tasks
Recovery timeSecondsMinutes (S3 checkpoint)

Key Takeaways

  1. Flink: True streaming, millisecond latency, complex state
  2. Spark Streaming: Micro-batch, simpler, unified with batch
  3. Latency: Flink (10-50ms) vs. Spark (100ms - 5s)
  4. State: Flink superior (RocksDB) vs. Spark (checkpointing)
  5. Ecosystem: Spark larger, Flink growing
  6. Selection: Latency requirements drive decision
  7. Cost: Similar infrastructure, Flink may require more ops

Back to Module 2