Skip to content

Module 2: Computing & Processing


Overview

This module covers data processing patterns, from batch ETL to real-time streaming to high-performance computing with Rust. Understanding when to use batch vs. streaming, and selecting the right processing engine for each workload, is critical for Principal-level architecture decisions.


Module Contents

Batch Processing

DocumentDescriptionKey Topics
PySpark PatternsCommon PySpark patternsJoins, aggregations, UDFs, optimization
Pandas OptimizationPandas at scaleVectorization, chunking, alternatives
DataFrames vs RDDAPI comparisonWhen to use which

Streaming & Near-Realtime

DocumentDescriptionKey Topics
Streaming Platforms ComparisonPlatform comparisonArchitecture, performance, cost
Kafka GuideApache Kafka deep diveProducer/consumer patterns, security, cost
Pulsar GuideApache Pulsar deep diveLayered architecture, geo-replication, tiering
Kinesis GuideAWS Kinesis guideManaged streaming, Firehose, Analytics
Streaming Engines ComparisonEngine comparisonLatency, state, operations
Flink Deep DiveApache Flink guideDataStream API, state backends, savepoints
Spark Structured Streaming Deep DiveSpark Streaming guideMicro-batch, watermarks, joins
Windowing StrategiesWindowing patternsTumbling, sliding, session, watermarks
State ManagementStateful processingState backends, checkpointing, TTL
Backpressure HandlingFlow controlDetection, mitigation, prevention

High-Performance Computing

DocumentDescriptionKey Topics
Rust Data EngineeringRust for data pipelinesPerformance, memory safety, use cases

Batch vs. Streaming Decision Matrix


Cost Considerations

Processing TypeCompute CostComplexityOperations
Batch (Spark)$0.50-2.00/TBLowSpot instances friendly
Streaming (Flink)$1.00-3.00/TBHighAlways running
Micro-batch (Spark Streaming)$0.75-2.50/TBMediumPeriodic execution
Real-time (ksqlDB)$2.00-5.00/TBVery HighComplex ops

Key Concepts

Batch Processing

Characteristics:

  • High latency (minutes to hours)
  • High throughput
  • Fault-tolerant (reprocess from checkpoint)
  • Cost-effective (use spot instances)

Best for:

  • Historical data processing
  • ETL pipelines
  • Data warehousing
  • Machine learning training

Streaming Processing

Characteristics:

  • Low latency (seconds to milliseconds)
  • Continuous processing
  • Complex state management
  • Higher operational cost

Best for:

  • Real-time analytics
  • Fraud detection
  • Live dashboards
  • Real-time personalization

Learning Objectives

After completing this module, you will:

  1. Choose batch vs. streaming: Make architecture decisions based on latency requirements
  2. Select streaming platforms: Kafka vs. Pulsar vs. Kinesis for production workloads
  3. Select streaming engines: Flink vs. Spark Structured Streaming based on use case
  4. Implement windowing: Tumbling, sliding, session windows with watermarks
  5. Manage state: State backends, checkpointing, savepoints, TTL
  6. Handle backpressure: Detection, mitigation, prevention strategies
  7. Optimize batch processing: PySpark patterns, Pandas optimization, DataFrame vs RDD
  8. Understand high-performance computing: When to use Rust for data pipelines

Module Dependencies


Next Steps

  1. Study Batch Processing for ETL patterns
  2. Study Streaming for real-time patterns
  3. Review Windowing Strategies for streaming aggregations
  4. Explore High-Performance Computing for specialized use cases

Estimated Time to Complete Module 2: 8-10 hours