Skip to content

Apache Kafka Guide

Distributed Event Streaming Platform


Overview

Apache Kafka is a distributed event streaming platform capable of handling trillions of events per day. It is the de facto standard for event streaming, proven at extreme scale by companies like LinkedIn, Uber, Netflix, and thousands of others.


Core Architecture

Key Concepts

ConceptDescription
ProducerApplication that sends events to Kafka
ConsumerApplication that reads events from Kafka
TopicLogical channel for events
PartitionUnit of parallelism in a topic
BrokerKafka server
Consumer GroupGroup of consumers that coordinate consumption
OffsetPosition in partition for a consumer

Producer Patterns

Producer Configuration

from kafka import KafkaProducer
import json
# Configure producer for reliability
producer = KafkaProducer(
bootstrap_servers=['kafka1:9092', 'kafka2:9092', 'kafka3:9092'],
# Delivery semantics
acks='all', # Wait for all replicas
# Compression
compression_type='snappy',
# Batching
linger_ms=10, # Wait up to 10ms to batch
batch_size=32768, # 32KB batches
# Reliability
enable_idempotence=True,
retries=3,
max_in_flight_requests=5
)
# Send message
def send_event(topic: str, key: str, value: dict):
producer.send(
topic,
key=key.encode('utf-8'),
value=json.dumps(value).encode('utf-8')
)
# Flush on shutdown
producer.flush()

Exactly-Once Semantics

# Enable idempotence for exactly-once
producer = KafkaProducer(
bootstrap_servers=['kafka1:9092'],
enable_idempotence=True,
acks='all'
)
# With idempotence, retries are safe
# If producer fails after send but before ack,
# retry will not create duplicates

Consumer Patterns

Consumer Group Configuration

from kafka import KafkaConsumer
# Configure consumer
consumer = KafkaConsumer(
bootstrap_servers=['kafka1:9092', 'kafka2:9092', 'kafka3:9092'],
group_id='my_consumer_group',
auto_offset_reset='earliest', # or 'latest'
enable_auto_commit=True,
auto_commit_interval_ms=1000,
# Performance tuning
fetch_min_bytes=1024,
fetch_max_wait_ms=500,
max_poll_records=500
)
# Subscribe to topic
consumer.subscribe(['my_topic'])
# Consume messages
for message in consumer:
event = json.loads(message.value.decode('utf-8'))
process_event(event)
consumer.commit()

Manual Offset Control

# Disable auto-commit for manual control
consumer = KafkaConsumer(
bootstrap_servers=['kafka1:9092'],
group_id='manual_control_group',
enable_auto_commit=False, # Manual commits
auto_offset_reset='earliest'
)
# Process and commit manually
for message in consumer:
event = json.loads(message.value.decode('utf-8'))
if process_event(event):
# Commit only if successful
consumer.commit()

Kafka Configuration

Broker Configuration

server.properties
# Cluster identification
broker.id=1
listeners=PLAINTEXT://kafka1:9092
advertised.listeners=PLAINTEXT://kafka1:9092
zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
# Log configuration
log.dirs=/var/kafka/data
num.partitions=3
default.replication.factor=3
# Performance
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
# Replication
replica.fetch.max.bytes=10485760
replica.fetch.max.wait.ms=500

Topic Configuration

Terminal window
# Create topic with specific configuration
kafka-topics.sh --create \
--topic events \
--partitions 10 \
--replication-factor 3 \
--config max.message.bytes=10485760 \
--config retention.ms=604800000 # 7 days
# Alter topic configuration
kafka-configs.sh \
--bootstrap-server localhost:9092 \
--entity-type topics \
--alter \
--topic events \
--config retention.ms=2592000000 # 30 days

Performance Tuning

Producer Tuning

# High-throughput producer
producer = KafkaProducer(
bootstrap_servers=['kafka1:9092'],
# Batching
linger_ms=20, # More batching
batch_size=65536, # 64KB batches
# Compression
compression_type='lz4', # Better than snappy
# Buffering
buffer_memory=67108864, # 64MB buffer
# Acknowledgment
acks=1, # Leader only (faster, less durable)
max_in_flight_requests=5
)

Consumer Tuning

# High-throughput consumer
consumer = KafkaConsumer(
bootstrap_servers=['kafka1:9092'],
group_id='high_throughput_group',
# Fetch optimization
fetch_min_bytes=102400, # 100KB min fetch
fetch_max_bytes=10485760, # 10MB max
fetch_max_wait_ms=100,
# Session timeout
session_timeout_ms=30000,
heartbeat_interval_ms=3000,
# Max poll records
max_poll_records=1000
)

Monitoring

Critical Metrics

kafka_metrics:
broker_metrics:
- name: "Under-replicated partitions"
metric: kafka.controller.UnderReplicatedPartitions
alert: "If > 0"
- name: "Offline partitions count"
metric: kafka.controller.OfflinePartitionsCount
alert: "If > 0"
- name: "Request latency"
metric: kafka.network.RequestAvgIdlePercent
alert: "If > 80% (underutilized)"
producer_metrics:
- name: "Producer retry rate"
metric: producer.retry.rate
alert: "If > 0.05"
- name: "Request latency"
metric: producer.request-latency-avg
alert: "If > 100ms"
consumer_metrics:
- name: "Consumer lag"
metric: consumer.lag
alert: "If > 100000"
- name: "Consumer latency"
metric: consumer.latency-avg
alert: "If > 1000ms"

Cost Optimization

Cost Optimization Strategies

kafka_cost_optimization:
storage:
- "Use appropriate retention (don't keep forever)"
- "Compact old topics"
- "Delete test data promptly"
compute:
- "Use spot instances for consumers"
- "Right-size cluster (avoid over-provisioning)"
network:
- "Colocate producers and consumers"
- "Use compression (snappy, lz4, zstd)"
- "Batch messages to reduce network calls"

Security

SSL/SASL Configuration

# Secure producer
from kafka import KafkaProducer
producer = KafkaProducer(
bootstrap_servers=['kafka1:9093'], # SSL port
security_protocol='SSL',
ssl_cafile='/path/to/ca.crt',
ssl_certfile='/path/to/client.crt',
ssl_keyfile='/path/to/client.key',
sasl_mechanism='PLAIN',
sasl_plain_username='producer_user',
sasl_plain_password='producer_pass'
)

ACL (Access Control Lists)

Terminal window
# Add ACL for producer
kafka-acls.sh \
--authorizer kafka.security.auth.SimpleAclAuthorizer \
--add \
--allow-principal User:producer \
--operation Write \
--topic events
# Add ACL for consumer
kafka-acls.sh \
--authorizer kafka.security.auth.SimpleAclAuthorizer \
--add \
--allow-principal User:consumer \
--operation Read \
--topic events

Key Takeaways

  1. Proven at scale: Handles trillions of events per day
  2. Exactly-once: Enable with enable_idempotence
  3. Performance tuning: Batch size, linger, compression
  4. Replication: Factor of 3 for durability
  5. Monitoring: Track lag, under-replicated partitions
  6. Security: SSL/SASL for production
  7. Cost: Spot instances for consumers, collocate components
  8. Consumer groups: Enable parallel processing

Next: Apache Pulsar Guide