Module 2: Computing & Processing
Overview
This module covers data processing patterns, from batch ETL to real-time streaming to high-performance computing with Rust. Understanding when to use batch vs. streaming, and selecting the right processing engine for each workload, is critical for Principal-level architecture decisions.
Module Contents
Batch Processing
| Document | Description | Key Topics |
|---|---|---|
| PySpark Patterns | Common PySpark patterns | Joins, aggregations, UDFs, optimization |
| Pandas Optimization | Pandas at scale | Vectorization, chunking, alternatives |
| DataFrames vs RDD | API comparison | When to use which |
Streaming & Near-Realtime
| Document | Description | Key Topics |
|---|---|---|
| Streaming Platforms Comparison | Platform comparison | Architecture, performance, cost |
| Kafka Guide | Apache Kafka deep dive | Producer/consumer patterns, security, cost |
| Pulsar Guide | Apache Pulsar deep dive | Layered architecture, geo-replication, tiering |
| Kinesis Guide | AWS Kinesis guide | Managed streaming, Firehose, Analytics |
| Streaming Engines Comparison | Engine comparison | Latency, state, operations |
| Flink Deep Dive | Apache Flink guide | DataStream API, state backends, savepoints |
| Spark Structured Streaming Deep Dive | Spark Streaming guide | Micro-batch, watermarks, joins |
| Windowing Strategies | Windowing patterns | Tumbling, sliding, session, watermarks |
| State Management | Stateful processing | State backends, checkpointing, TTL |
| Backpressure Handling | Flow control | Detection, mitigation, prevention |
High-Performance Computing
| Document | Description | Key Topics |
|---|---|---|
| Rust Data Engineering | Rust for data pipelines | Performance, memory safety, use cases |
Batch vs. Streaming Decision Matrix
Cost Considerations
| Processing Type | Compute Cost | Complexity | Operations |
|---|---|---|---|
| Batch (Spark) | $0.50-2.00/TB | Low | Spot instances friendly |
| Streaming (Flink) | $1.00-3.00/TB | High | Always running |
| Micro-batch (Spark Streaming) | $0.75-2.50/TB | Medium | Periodic execution |
| Real-time (ksqlDB) | $2.00-5.00/TB | Very High | Complex ops |
Key Concepts
Batch Processing
Characteristics:
- High latency (minutes to hours)
- High throughput
- Fault-tolerant (reprocess from checkpoint)
- Cost-effective (use spot instances)
Best for:
- Historical data processing
- ETL pipelines
- Data warehousing
- Machine learning training
Streaming Processing
Characteristics:
- Low latency (seconds to milliseconds)
- Continuous processing
- Complex state management
- Higher operational cost
Best for:
- Real-time analytics
- Fraud detection
- Live dashboards
- Real-time personalization
Learning Objectives
After completing this module, you will:
- Choose batch vs. streaming: Make architecture decisions based on latency requirements
- Select streaming platforms: Kafka vs. Pulsar vs. Kinesis for production workloads
- Select streaming engines: Flink vs. Spark Structured Streaming based on use case
- Implement windowing: Tumbling, sliding, session windows with watermarks
- Manage state: State backends, checkpointing, savepoints, TTL
- Handle backpressure: Detection, mitigation, prevention strategies
- Optimize batch processing: PySpark patterns, Pandas optimization, DataFrame vs RDD
- Understand high-performance computing: When to use Rust for data pipelines
Module Dependencies
Next Steps
- Study Batch Processing for ETL patterns
- Study Streaming for real-time patterns
- Review Windowing Strategies for streaming aggregations
- Explore High-Performance Computing for specialized use cases
Estimated Time to Complete Module 2: 8-10 hours