System Design Framework

Architecture Interviews for Principal-Level Roles

Overview

The system design framework is your blueprint for tackling architecture interview questions. At the Principal level, every design must consider scale (TB/PB), cost optimization, and operational excellence. This framework ensures you cover all critical aspects systematically.

The 7-Step Framework

Step 1: Clarify Requirements

Functional Requirements

Questions to ask:

What is the primary use case?
Who are the users (internal, external, both)?
What are the success metrics?
What are the key features?

Non-Functional Requirements

Example Requirements:

Scale: 100TB data, 10K QPS, growing 2x/year
Latency: Sub-second for dashboard, 24 hours for reports
Availability: 99.9% uptime (8.76 hours downtime/year)
Consistency: Eventual acceptable for analytics
Cost: Budget: $50K/month

Step 2: Estimate Scale

Back-of-the-Envelope Calculations

Example: Event tracking system

Storage Calculation:

Daily: 1M events × 1KB = 1GB/day
Monthly: 30GB
Annual: 365GB
With compression (10x): 36.5GB/year

Throughput Calculation:

Average: 1M/day / 86400s = 11.5 events/sec
Peak: 10x = 115 events/sec
Buffer: 100x = 1,150 events/sec

Step 3: High-Level Design

Architecture Diagram

Technology Selection:

Streaming: Kafka (proven at scale)
Processing: Flink (latency) + Spark (batch)
Storage: S3 + Delta (cost optimization)
Serving: Redis (hot) + Trino (analytics)

Step 4: Data Modeling

Schema Design

-- Events table (fact)
CREATE TABLE events (
    event_id BIGINT,
    user_id BIGINT,
    event_type STRING,
    event_timestamp TIMESTAMP,
    properties MAP<STRING, STRING>
)
PARTITIONED BY (DATE(event_timestamp))
STORED AS PARQUET;

-- User profiles (dimension)
CREATE TABLE user_profiles (
    user_id BIGINT PRIMARY KEY,
    user_properties MAP<STRING, STRING>,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
)
STORED AS PARQUET;

Partitioning Strategy

partitioning:
  strategy: "Date-based partitioning"
  primary: "event_date"
  granularity: "Daily"
  reasoning: "Time-series data, common query pattern"

  z_ordering:
    columns: ["user_id", "event_type"]
    reasoning: "Multi-dimensional queries"

Step 5: Deep Dive on Components

Component 1: Kafka Cluster

kafka_cluster:
  configuration:
    brokers: 12
    replication_factor: 3
    partitions: 100  # Per topic

  sizing:
    throughput: "1M messages/sec"
    latency: "< 10ms p99"

  cost: "$6,000/month (on-demand)"

  optimization:
    spot_instances: "No (requires durability)"
    compression: "ZSTD"
    retention: "7 days"

Component 2: Flink Cluster

flink_cluster:
  configuration:
    task_managers: 100
    slots_per_tm: 4
    parallelism: 400

  sizing:
    throughput: "1M events/sec"
    state: "RocksDB state backend"
    checkpointing: "Every 5 minutes"

  cost: "$43,200/month (70% spot)"

  optimization:
    state_backend: "RocksDB (2TB state)"
    checkpointing: "S3 with TTL 30 days"

Step 6: Bottleneck Analysis

Scaling Strategy:

Kafka: Add partitions (horizontal scale)
Flink: Add TaskManagers (horizontal scale)
S3: Essentially unlimited, but watch egress costs
Redis: Add shards (horizontal scale)

Step 7: Cost Optimization (The Cost Pass)

CRITICAL: Always include cost estimates in system design answers.

Storage Cost

100TB data
- Hot (S3 Standard): 100TB × $23/TB = $2,300/month
- Warm (S3 IA): 80TB × $12/TB = $960/month
- Cold (Glacier): 20TB × $4/TB = $80/month
Total: $3,340/month

Optimization: Lifecycle policies (save 30%)

Compute Cost

Flink: 100 nodes × $0.30/hour × 730 hours = $21,900/month
Spark: 200 nodes × $0.30/hour × 200 hours = $12,000/month
Kafka: 12 nodes × $0.50/hour × 730 hours = $4,380/month
Redis: 50 nodes × $0.20/hour × 730 hours = $7,300/month

Total: $45,580/month

Optimization: Spot for Flink/Spark (save 60%)

Network Cost

Data transfer: 10TB egress/month
- S3 egress: 10TB × $0.09/GB = $900/month

Optimization: Colocation, minimize egress

Total Monthly Cost

Storage: $3,340
Compute: $45,580
Network: $900
Monitoring: $500
Support: $2,000
---
TOTAL: $52,320/month
ANNUAL: $627,840

Optimization: Spot instances + lifecycle = $35,000/month (33% savings)

Trade-offs Discussion

Decision	Option A	Option B	Trade-off
Storage	S3 Standard	S3 IA + Glacier	Cost vs. latency
Compute	On-demand	Spot	Reliability vs. cost
Consistency	Strong	Eventual	Complexity vs. usability
Latency	Real-time	Batch	Cost vs. user experience

Principal-Level Answer: “I would start with S3 Standard for simplicity, then implement lifecycle policies to move older data to IA and Glacier, reducing storage costs by 30%. For compute, I’d use 70% spot instances for fault-tolerant batch workloads, with 30% on-demand for critical streaming workloads, balancing cost optimization with reliability requirements.”

Anti-Patterns to Avoid

Anti-Pattern 1: Ignoring Scale

Interviewee: “We’ll use Postgres for everything.”

Principal: “What happens when we hit 10TB data? Postgres won’t scale. I’d use a data lakehouse approach with S3 + Delta Lake + Trino for analytics at scale.”

Anti-Pattern 2: No Cost Discussion

Interviewee: [Never mentions cost]

Principal: Always discuss cost. “This design would cost approximately $50K/month. To optimize, I’d use spot instances (60% savings), lifecycle policies (30% storage savings), and rightsizing (20% compute savings), bringing total to $25K/month.”

Anti-Pattern 3: No Failure Modes

Interviewee: “This system will work perfectly.”

Principal: Always discuss what can go wrong. “If Kafka fails, we have a 7-day buffer. If Flink fails, we replay from checkpoints. If S3 has issues, we serve from Redis cache.”

Practice Framework

Mock Interview Flow

Time Allocation (45 minutes)

Phase	Time
Clarification	5 minutes
Scale estimation	5 minutes
Architecture	10 minutes
Deep dive	15 minutes
Cost/trade-offs	5 minutes
Questions	5 minutes

Key Takeaways

7 steps: Clarify, scale, design, model, deep dive, bottlenecks, cost
Always estimate: Storage, compute, network costs
Discuss trade-offs: Every decision has trade-offs
Handle scale: TB/PB scale drives architecture
Cost optimization: 30-50% savings with proper strategies
Failure modes: What can go wrong and mitigation
Practice: 20+ mock interviews before real interview

Common Questions

Question Types

Design a real-time data platform
Design a data warehouse at scale
Design a fraud detection system
Design a recommendation system
Migrate from on-prem to cloud
Optimize a $100K/month bill to $50K

For Each Question

Clarify: Ask questions before designing
Think out loud: Show your thought process
Diagram: Draw architecture on whiteboard
Cost pass: Always estimate monthly cost
Trade-offs: Discuss pros and cons of decisions

Back to Module 9