Streaming Engines Comparison

Apache Flink vs. Spark Structured Streaming

Overview

This document compares the two major streaming engines: Apache Flink and Spark Structured Streaming. Both enable real-time data processing but have different architectures, strengths, and use cases.

Quick Comparison Matrix

Feature	Flink	Spark Structured Streaming
Model	True streaming (per-event)	Micro-batch
Latency	Milliseconds	100ms - seconds
State Management	Excellent (RocksDB)	Good (HDFS/S3)
Exactly-Once	Native	Native (with checkpoints)
Windowing	Very flexible	Flexible
SQL Support	Excellent	Excellent
Ecosystem	Growing	Mature (Spark ecosystem)
Operations	Complex	Simpler (unified with Spark)
Deployment	Standalone, K8s, YARN	Standalone, K8s, YARN, Cloud

Architecture Comparison

Flink Architecture

Key characteristics:

Event-at-a-time processing
Native state management
Checkpointing for exactly-once
Watermarks for late data
Savepoints for upgrades

Spark Structured Streaming Architecture

Key characteristics:

Micro-batch processing
DataFrame/Dataset API
Checkpointing for fault tolerance
Watermarks for event time
Unified batch + streaming

Deep Dive by Engine

Apache Flink

Strengths:

True low-latency streaming (milliseconds)
Excellent state management (RocksDB integration)
Advanced windowing (session, sliding, tumbling, custom)
Native exactly-once semantics
Savepoints for stateful upgrades
Strong SQL support for streaming

Weaknesses:

Steeper learning curve
Smaller community than Spark
More complex operations
Fewer high-level libraries
More operational complexity

Best for:

Low-latency requirements (< 100ms)
Complex stateful operations
Event-driven applications
Real-time ML inference
Time-series analytics

Code Example:

// Flink DataStream API
DataStream<Event> events = env
    .addSource(new FlinkKafkaConsumer<>("topic", deserializer, props))

// Keyed stream with state
events.keyBy(Event::getUserId)
      .window(TumblingEventTimeWindows.of(Time.minutes(5)))
      .process(new ProcessWindowFunction<Event, Result, String, TimeWindow>() {
          public void process(String key, Context ctx, Iterable<Event> events, Collector<Result> out) {
              // Stateful processing
              long count = 0;
              for (Event e : events) {
                  count++;
              }
              out.collect(new Result(key, count));
          }
      });

Spark Structured Streaming

Strengths:

Unified batch + streaming (same API)
Mature ecosystem (Spark ML, GraphX, etc.)
Simpler operations (micro-batch abstraction)
Strong community support
Easy to learn for Spark users
Good integration with Delta Lake

Weaknesses:

Higher latency (micro-batch)
Less flexible windowing
State management less robust than Flink
No native per-event processing
Checkpoint overhead can be high

Best for:

Near-real-time analytics (100ms - seconds)
Lambda architecture (batch + streaming)
ETL pipelines
Data warehousing workloads
Teams already using Spark

Code Example:

# Spark Structured Streaming
from pyspark.sql.functions import col, window

df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "topic") \
    .load()

# Transform and write
query = df.select(
    col("key").cast("string"),
    col("value").cast("string"),
    col("timestamp")
).writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoint") \
    .start("/delta/events")

query.awaitTermination()

Performance Comparison

Latency

Metric	Flink	Spark Streaming
End-to-end latency	10-50ms	100ms - 5s
Micro-batch overhead	None	Batch interval (typically 1-10s)
p99 latency	Predictable	Variable (depends on batch)

Throughput

Metric	Flink	Spark Streaming
Max throughput	10+ million events/sec	5+ million events/sec
Scaling	Linear	Linear (after initial overhead)

State Management

Feature	Flink	Spark Streaming
State backend	RocksDB, Memory	HDFS, S3 (checkpointing)
State size	Can handle TBs of state	Limited by checkpoint overhead
Recovery time	Seconds (RocksDB)	Minutes (S3)

Feature Comparison

Windowing

Window Type	Flink	Spark Streaming
Tumbling	Yes	Yes
Sliding	Yes	Yes
Session	Yes (native)	Yes (with session window)
Global	Yes	Yes
Custom	Yes	Limited

Time Semantics

Feature	Flink	Spark Streaming
Event time	Native (watermarks)	Native (watermarks)
Processing time	Yes	Yes
Ingestion time	Yes	Yes
Late data handling	Excellent (allowed lateness)	Good (watermarks)

Fault Tolerance

Feature	Flink	Spark Streaming
Exactly-once	Native (end-to-end)	Native (with checkpoint)
Checkpoint interval	Configurable	Configurable
Recovery	From checkpoint/savepoint	From checkpoint
State upgrade	Savepoints	Rebuild state

Selection Framework

Decision Guide

Scenario	Recommended	Rationale
Real-time ML	Flink	Low latency, state management
Real-time fraud detection	Flink	Milliseconds latency, complex state
Near-real-time analytics	Spark Streaming	Simpler, unified with batch
Lambda architecture	Spark Streaming	Same API for batch + streaming
Complex event processing	Flink	CEP library, low latency
Data warehousing	Spark Streaming	Integration with Delta Lake
IoT processing	Flink	High throughput, state management

Cost Comparison

Infrastructure Costs

Scenario: 1M events/sec, 1KB per event

Engine	Cluster Size	Monthly Cost	Notes
Flink	20 nodes (64GB, 16 cores)	$10,000	With spot instances: $3,000
Spark Streaming	20 nodes (64GB, 16 cores)	$10,000	With spot instances: $3,000

Result: Similar infrastructure costs. Flink may require fewer nodes for low-latency workloads.

Operational Costs

Task	Flink	Spark Streaming
Setup	Medium	Low (if using Spark)
Monitoring	Medium	Low
State management	Medium	Low
Upgrades	Complex (savepoints)	Simple

Integration Patterns

Flink + Kafka

# Flink Kafka Source
source:
  type: kafka
  properties:
    bootstrap.servers: kafka1:9092,kafka2:9092
    group.id: flink-consumer
  topic: events
  format: json

# Flink Kafka Sink
sink:
  type: kafka
  properties:
    bootstrap.servers: kafka1:9092
  topic: results
  format: json

Spark Streaming + Delta Lake

# Read from Kafka, write to Delta
df = spark.readStream.format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "events") \
    .load()

query = df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoint") \
    .start("/delta/events")

Production Patterns

Flink: Stateful Processing

// Keyed ProcessFunction with state
public class UserFunction extends KeyedProcessFunction<String, Event, Result> {
    private ValueState<Long> countState;

    @Override
    public void open(Configuration parameters) {
        countState = getRuntimeContext().getState(
            new ValueStateDescriptor<>("count", Long.class)
        );
    }

    @Override
    public void processElement(Event event, Context ctx, Collector<Result> out) {
        Long count = countState.value();
        if (count == null) {
            count = 0L;
        }
        count++;
        countState.update(count);
        out.collect(new Result(ctx.getCurrentKey(), count));
    }
}

Spark Streaming: Aggregation

from pyspark.sql.functions import col, window

# Windowed aggregation
df = spark.readStream.format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "events") \
    .load()

aggregated = df.groupBy(
    window(col("timestamp"), "5 minutes"),
    col("key")
).agg(
    count("*").alias("count"),
    sum("value").alias("total")
)

query = aggregated.writeStream \
    .format("delta") \
    .outputMode("update") \
    .option("checkpointLocation", "/checkpoint") \
    .start("/delta/aggregated")

Senior Level Considerations

Operational Complexity

Aspect	Flink	Spark Streaming
Deployment	Standalone, K8s, YARN	Standalone, K8s, YARN, Cloud
Monitoring	Flink Web UI, Prometheus	Spark UI, Prometheus
Debugging	Harder (distributed)	Easier (micro-batch)
Backpressure	Automatic handling	Automatic handling

Scaling

Scenario	Flink	Spark Streaming
Scale up	Add task managers	Add executors
Scale down	Remove task managers	Remove executors
Rebalance	Manual (savepoint)	Automatic

Failure Recovery

Scenario	Flink	Spark Streaming
Task failure	Restart from checkpoint	Restart from checkpoint
Node failure	Reschedule tasks	Reschedule tasks
Recovery time	Seconds	Minutes (S3 checkpoint)

Key Takeaways

Flink: True streaming, millisecond latency, complex state
Spark Streaming: Micro-batch, simpler, unified with batch
Latency: Flink (10-50ms) vs. Spark (100ms - 5s)
State: Flink superior (RocksDB) vs. Spark (checkpointing)
Ecosystem: Spark larger, Flink growing
Selection: Latency requirements drive decision
Cost: Similar infrastructure, Flink may require more ops

Back to Module 2