Streaming Engines Comparison
Apache Flink vs. Spark Structured Streaming
Overview
This document compares the two major streaming engines: Apache Flink and Spark Structured Streaming. Both enable real-time data processing but have different architectures, strengths, and use cases.
Quick Comparison Matrix
| Feature | Flink | Spark Structured Streaming |
|---|---|---|
| Model | True streaming (per-event) | Micro-batch |
| Latency | Milliseconds | 100ms - seconds |
| State Management | Excellent (RocksDB) | Good (HDFS/S3) |
| Exactly-Once | Native | Native (with checkpoints) |
| Windowing | Very flexible | Flexible |
| SQL Support | Excellent | Excellent |
| Ecosystem | Growing | Mature (Spark ecosystem) |
| Operations | Complex | Simpler (unified with Spark) |
| Deployment | Standalone, K8s, YARN | Standalone, K8s, YARN, Cloud |
Architecture Comparison
Flink Architecture
Key characteristics:
- Event-at-a-time processing
- Native state management
- Checkpointing for exactly-once
- Watermarks for late data
- Savepoints for upgrades
Spark Structured Streaming Architecture
Key characteristics:
- Micro-batch processing
- DataFrame/Dataset API
- Checkpointing for fault tolerance
- Watermarks for event time
- Unified batch + streaming
Deep Dive by Engine
Apache Flink
Strengths:
- True low-latency streaming (milliseconds)
- Excellent state management (RocksDB integration)
- Advanced windowing (session, sliding, tumbling, custom)
- Native exactly-once semantics
- Savepoints for stateful upgrades
- Strong SQL support for streaming
Weaknesses:
- Steeper learning curve
- Smaller community than Spark
- More complex operations
- Fewer high-level libraries
- More operational complexity
Best for:
- Low-latency requirements (< 100ms)
- Complex stateful operations
- Event-driven applications
- Real-time ML inference
- Time-series analytics
Code Example:
// Flink DataStream APIDataStream<Event> events = env .addSource(new FlinkKafkaConsumer<>("topic", deserializer, props))
// Keyed stream with stateevents.keyBy(Event::getUserId) .window(TumblingEventTimeWindows.of(Time.minutes(5))) .process(new ProcessWindowFunction<Event, Result, String, TimeWindow>() { public void process(String key, Context ctx, Iterable<Event> events, Collector<Result> out) { // Stateful processing long count = 0; for (Event e : events) { count++; } out.collect(new Result(key, count)); } });Spark Structured Streaming
Strengths:
- Unified batch + streaming (same API)
- Mature ecosystem (Spark ML, GraphX, etc.)
- Simpler operations (micro-batch abstraction)
- Strong community support
- Easy to learn for Spark users
- Good integration with Delta Lake
Weaknesses:
- Higher latency (micro-batch)
- Less flexible windowing
- State management less robust than Flink
- No native per-event processing
- Checkpoint overhead can be high
Best for:
- Near-real-time analytics (100ms - seconds)
- Lambda architecture (batch + streaming)
- ETL pipelines
- Data warehousing workloads
- Teams already using Spark
Code Example:
# Spark Structured Streamingfrom pyspark.sql.functions import col, window
df = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "localhost:9092") \ .option("subscribe", "topic") \ .load()
# Transform and writequery = df.select( col("key").cast("string"), col("value").cast("string"), col("timestamp")).writeStream \ .format("delta") \ .outputMode("append") \ .option("checkpointLocation", "/checkpoint") \ .start("/delta/events")
query.awaitTermination()Performance Comparison
Latency
| Metric | Flink | Spark Streaming |
|---|---|---|
| End-to-end latency | 10-50ms | 100ms - 5s |
| Micro-batch overhead | None | Batch interval (typically 1-10s) |
| p99 latency | Predictable | Variable (depends on batch) |
Throughput
| Metric | Flink | Spark Streaming |
|---|---|---|
| Max throughput | 10+ million events/sec | 5+ million events/sec |
| Scaling | Linear | Linear (after initial overhead) |
State Management
| Feature | Flink | Spark Streaming |
|---|---|---|
| State backend | RocksDB, Memory | HDFS, S3 (checkpointing) |
| State size | Can handle TBs of state | Limited by checkpoint overhead |
| Recovery time | Seconds (RocksDB) | Minutes (S3) |
Feature Comparison
Windowing
| Window Type | Flink | Spark Streaming |
|---|---|---|
| Tumbling | Yes | Yes |
| Sliding | Yes | Yes |
| Session | Yes (native) | Yes (with session window) |
| Global | Yes | Yes |
| Custom | Yes | Limited |
Time Semantics
| Feature | Flink | Spark Streaming |
|---|---|---|
| Event time | Native (watermarks) | Native (watermarks) |
| Processing time | Yes | Yes |
| Ingestion time | Yes | Yes |
| Late data handling | Excellent (allowed lateness) | Good (watermarks) |
Fault Tolerance
| Feature | Flink | Spark Streaming |
|---|---|---|
| Exactly-once | Native (end-to-end) | Native (with checkpoint) |
| Checkpoint interval | Configurable | Configurable |
| Recovery | From checkpoint/savepoint | From checkpoint |
| State upgrade | Savepoints | Rebuild state |
Selection Framework
Decision Guide
| Scenario | Recommended | Rationale |
|---|---|---|
| Real-time ML | Flink | Low latency, state management |
| Real-time fraud detection | Flink | Milliseconds latency, complex state |
| Near-real-time analytics | Spark Streaming | Simpler, unified with batch |
| Lambda architecture | Spark Streaming | Same API for batch + streaming |
| Complex event processing | Flink | CEP library, low latency |
| Data warehousing | Spark Streaming | Integration with Delta Lake |
| IoT processing | Flink | High throughput, state management |
Cost Comparison
Infrastructure Costs
Scenario: 1M events/sec, 1KB per event
| Engine | Cluster Size | Monthly Cost | Notes |
|---|---|---|---|
| Flink | 20 nodes (64GB, 16 cores) | $10,000 | With spot instances: $3,000 |
| Spark Streaming | 20 nodes (64GB, 16 cores) | $10,000 | With spot instances: $3,000 |
Result: Similar infrastructure costs. Flink may require fewer nodes for low-latency workloads.
Operational Costs
| Task | Flink | Spark Streaming |
|---|---|---|
| Setup | Medium | Low (if using Spark) |
| Monitoring | Medium | Low |
| State management | Medium | Low |
| Upgrades | Complex (savepoints) | Simple |
Integration Patterns
Flink + Kafka
# Flink Kafka Sourcesource: type: kafka properties: bootstrap.servers: kafka1:9092,kafka2:9092 group.id: flink-consumer topic: events format: json
# Flink Kafka Sinksink: type: kafka properties: bootstrap.servers: kafka1:9092 topic: results format: jsonSpark Streaming + Delta Lake
# Read from Kafka, write to Deltadf = spark.readStream.format("kafka") \ .option("kafka.bootstrap.servers", "localhost:9092") \ .option("subscribe", "events") \ .load()
query = df.writeStream \ .format("delta") \ .outputMode("append") \ .option("checkpointLocation", "/checkpoint") \ .start("/delta/events")Production Patterns
Flink: Stateful Processing
// Keyed ProcessFunction with statepublic class UserFunction extends KeyedProcessFunction<String, Event, Result> { private ValueState<Long> countState;
@Override public void open(Configuration parameters) { countState = getRuntimeContext().getState( new ValueStateDescriptor<>("count", Long.class) ); }
@Override public void processElement(Event event, Context ctx, Collector<Result> out) { Long count = countState.value(); if (count == null) { count = 0L; } count++; countState.update(count); out.collect(new Result(ctx.getCurrentKey(), count)); }}Spark Streaming: Aggregation
from pyspark.sql.functions import col, window
# Windowed aggregationdf = spark.readStream.format("kafka") \ .option("kafka.bootstrap.servers", "localhost:9092") \ .option("subscribe", "events") \ .load()
aggregated = df.groupBy( window(col("timestamp"), "5 minutes"), col("key")).agg( count("*").alias("count"), sum("value").alias("total"))
query = aggregated.writeStream \ .format("delta") \ .outputMode("update") \ .option("checkpointLocation", "/checkpoint") \ .start("/delta/aggregated")Senior Level Considerations
Operational Complexity
| Aspect | Flink | Spark Streaming |
|---|---|---|
| Deployment | Standalone, K8s, YARN | Standalone, K8s, YARN, Cloud |
| Monitoring | Flink Web UI, Prometheus | Spark UI, Prometheus |
| Debugging | Harder (distributed) | Easier (micro-batch) |
| Backpressure | Automatic handling | Automatic handling |
Scaling
| Scenario | Flink | Spark Streaming |
|---|---|---|
| Scale up | Add task managers | Add executors |
| Scale down | Remove task managers | Remove executors |
| Rebalance | Manual (savepoint) | Automatic |
Failure Recovery
| Scenario | Flink | Spark Streaming |
|---|---|---|
| Task failure | Restart from checkpoint | Restart from checkpoint |
| Node failure | Reschedule tasks | Reschedule tasks |
| Recovery time | Seconds | Minutes (S3 checkpoint) |
Key Takeaways
- Flink: True streaming, millisecond latency, complex state
- Spark Streaming: Micro-batch, simpler, unified with batch
- Latency: Flink (10-50ms) vs. Spark (100ms - 5s)
- State: Flink superior (RocksDB) vs. Spark (checkpointing)
- Ecosystem: Spark larger, Flink growing
- Selection: Latency requirements drive decision
- Cost: Similar infrastructure, Flink may require more ops
Back to Module 2