Streaming Platforms Comparison

Kafka vs. Pulsar vs. Kinesis

Overview

This document compares the three major streaming platforms: Apache Kafka, Apache Pulsar, and Amazon Kinesis. Selecting the right streaming platform is critical for real-time data architecture decisions.

Quick Comparison Matrix

Feature	Kafka	Pulsar	Kinesis
Architecture	Log-based	Layered (BookKeeper)	Cloud service
Deployment	Self-hosted or Confluent Cloud	Self-hosted or StreamNative Cloud	AWS only
Scaling	Rebalancing	Geo-replication	Auto-scaling
Retention	Configurable	Tiered storage	1-365 days
Protocol	TCP, proprietary	HTTP, MQTT, Kafka compat	AWS SDK
Cost	Free (self-hosted)	Free (self-hosted)	Per-GB/Shard
Ecosystem	Largest	Growing	AWS-focused

Architecture Comparison

Kafka Architecture

Key characteristics:

Log-based storage (append-only)
Partitions for parallelism
Consumer groups for scalability
Originally required ZooKeeper, now KRaft mode available

Pulsar Architecture

Key characteristics:

Layered architecture (compute + storage分离)
BookKeeper for durable storage
Built-in geo-replication
Functions for serverless compute

Kinesis Architecture

Key characteristics:

Fully managed AWS service
Shard-based scaling
KCL (Kinesis Client Library) for consumers
Firehose for direct delivery to S3, Redshift, etc.

Deep Dive by Platform

Apache Kafka

Strengths:

Largest ecosystem and community
Proven at extreme scale (trillions of events/day)
Excellent performance (millions of messages/sec)
Strong durability guarantees (replication)
Broad language support

Weaknesses:

Operational complexity (self-hosted)
Rebalancing can be disruptive
No built-in compute (need Kafka Streams/Flink)
ZooKeeper dependency (until KRaft)

Best for:

Multi-cloud or on-prem deployments
Maximum ecosystem compatibility
Extreme scale requirements
Control over infrastructure

Cost model:

Self-hosted: Infrastructure costs only
Confluent Cloud: $0.50-2.00 per GB + cluster fees

Apache Pulsar

Strengths:

Layered architecture (independent scaling of compute/storage)
Built-in geo-replication
Tiered storage (hot S3 → cold S3)
Multi-protocol (Kafka compatible, MQTT, JMS)
Serverless functions

Weaknesses:

Smaller community than Kafka
More complex architecture
Less mature tooling
Fewer third-party integrations

Best for:

Geo-distributed deployments
Multi-tenant environments
Need for built-in compute
Cloud-native deployments

Cost model:

Self-hosted: Infrastructure costs only
StreamNative Cloud: Similar to Confluent Cloud

Amazon Kinesis

Strengths:

Fully managed (zero ops)
Auto-scaling
AWS integration (Lambda, Firehose, Analytics)
Simple pricing model
High availability built-in

Weaknesses:

AWS lock-in
Limited retention (1-365 days)
Shard management complexity
Less flexible than Kafka/Pulsar
Higher cost at scale

Best for:

AWS-centric workloads
Simple real-time pipelines
Quick prototyping
Teams wanting managed service

Cost model:

Data Streams: $0.015/GB + $0.012/Shard/hour
Firehose: $0.029/GB + $0.001-0.002/PUT
Typical: $50-200/TB processed

Performance Comparison

Throughput

Metric	Kafka	Pulsar	Kinesis
Producer throughput	100-200 MB/s/broker	50-100 MB/s/broker	1 MB/s/shard
Consumer throughput	200-400 MB/s/broker	100-200 MB/s/broker	2 MB/s/shard
Latency (p99)	10-50ms	20-100ms	50-200ms

Scaling

Scenario	Kafka	Pulsar	Kinesis
Vertical scaling	Limited	Better	Automatic
Horizontal scaling	Manual rebalance	Auto	Auto
Geo-replication	MirrorMaker (complex)	Built-in	Cross-region replication

Selection Framework

Decision Guide

Scenario	Recommended	Rationale
AWS-only, simple	Kinesis	Managed, AWS integration
Multi-cloud	Kafka or Pulsar	Cloud-agnostic
Maximum ecosystem	Kafka	Largest community
Geo-replication	Pulsar	Built-in, easy
Serverless compute	Pulsar Functions	Built-in
Cost-sensitive	Kafka (self-hosted)	No premium pricing
Operations-averse	Kinesis or Confluent Cloud	Fully managed
Extreme scale	Kafka	Proven at scale

Cost Comparison

Example: 1TB/day, 10K messages/sec

Platform	Monthly Cost	Notes
Kafka (self-hosted)	$500-1,000	Infrastructure only
Kafka (Confluent Cloud)	$3,000-5,000	Premium pricing
Pulsar (self-hosted)	$500-1,000	Infrastructure only
Pulsar (StreamNative Cloud)	$3,000-5,000	Similar to Confluent
Kinesis	$2,000-4,000	AWS pricing

Note: Self-hosted requires operational overhead (~$5K/month in engineering time for small team).

Migration Considerations

Kafka to Pulsar Migration

Options:

Kafka-compatible API: Pulsar supports Kafka protocol
Connector: Use Kafka-Pulsar IO connector
Rewrite: Migrate clients to Pulsar client

Kafka to Kinesis Migration

Challenges:

Different client libraries
Shard vs. partition concepts
Consumer group differences

Approach:

Dual-write during migration
Validate data parity
Switch consumers
Decommission Kafka

Production Patterns

Kafka: Idempotent Producer

from kafka import KafkaProducer
import json

# Enable idempotence for exactly-once
producer = KafkaProducer(
    bootstrap_servers=['kafka1:9092', 'kafka2:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    # Idempotence settings
    enable_idempotence=True,
    acks='all',
    max_in_flight_requests_per_connection=5,
    retries=3
)

# Send message
producer.send('topic', value={'key': 'value'})
producer.flush()

Pulsar: Reader API

import pulsar

# Create client
client = pulsar.Client('pulsar://localhost:6650')

# Create reader (read from specific message)
reader = client.create_reader(
    'topic',
    start_message_id=pulsar.MessageId.earliest,
    reader_name='my-reader'
)

# Read messages
while True:
    msg = reader.read_next()
    print(msg.data())
    reader.acknowledge(msg)

Kinesis: Enhanced Fan-out

import boto3

# Create Kinesis client
kinesis = boto3.client('kinesis')

# Enhanced fan-out consumers (dedicated throughput)
response = kinesis.register_stream_consumer(
    StreamARN='arn:aws:kinesis:...',
    ConsumerName='my-consumer'
)

# Subscribe to shard
kinesis.subscribe_to_shard(
    StreamARN='arn:aws:kinesis:...',
    ConsumerARN=response['Consumer']['ConsumerARN'],
    ShardId='shardId-000000000000'
)

Senior Level Considerations

Scalability Limits

Platform	Max Producers	Max Consumers	Max Throughput
Kafka	Unlimited	Limited by partitions	100+ GB/s per cluster
Pulsar	Unlimited	Unlimited	100+ GB/s per cluster
Kinesis	Limited by shards	Limited by fan-out	1 GB/s per shard

Operational Complexity

Platform	Ops Complexity	Why
Kafka (self-hosted)	High	Broker management, ZooKeeper, rebalancing
Pulsar (self-hosted)	Medium-High	Brokers + BookKeeper
Confluent Cloud	Low	Fully managed
Kinesis	Very Low	AWS managed

Monitoring Requirements

All platforms require:

Producer/consumer lag metrics
Throughput monitoring
Error rate tracking
Consumer group health

Tools:

Kafka: Burrow, Kafka Exporter
Pulsar: Pulsar Manager, Prometheus
Kinesis: CloudWatch metrics

Key Takeaways

Kafka: Ecosystem leader, self-hosted or Confluent Cloud
Pulsar: Geo-replication, layered architecture, multi-protocol
Kinesis: AWS managed, simple, expensive at scale
Cost: Self-hosted cheapest, managed services cost 3-5x
Selection: Cloud strategy drives decision more than features
Migration: Plan for dual-write period during migration
Monitoring: All platforms require comprehensive monitoring

Back to Module 2