Skip to content

System Design Framework

Architecture Interviews for Principal-Level Roles


Overview

The system design framework is your blueprint for tackling architecture interview questions. At the Principal level, every design must consider scale (TB/PB), cost optimization, and operational excellence. This framework ensures you cover all critical aspects systematically.


The 7-Step Framework


Step 1: Clarify Requirements

Functional Requirements

Questions to ask:

  • What is the primary use case?
  • Who are the users (internal, external, both)?
  • What are the success metrics?
  • What are the key features?

Non-Functional Requirements

Example Requirements:

  • Scale: 100TB data, 10K QPS, growing 2x/year
  • Latency: Sub-second for dashboard, 24 hours for reports
  • Availability: 99.9% uptime (8.76 hours downtime/year)
  • Consistency: Eventual acceptable for analytics
  • Cost: Budget: $50K/month

Step 2: Estimate Scale

Back-of-the-Envelope Calculations

Example: Event tracking system

Storage Calculation:

Daily: 1M events × 1KB = 1GB/day
Monthly: 30GB
Annual: 365GB
With compression (10x): 36.5GB/year

Throughput Calculation:

Average: 1M/day / 86400s = 11.5 events/sec
Peak: 10x = 115 events/sec
Buffer: 100x = 1,150 events/sec

Step 3: High-Level Design

Architecture Diagram

Technology Selection:

  • Streaming: Kafka (proven at scale)
  • Processing: Flink (latency) + Spark (batch)
  • Storage: S3 + Delta (cost optimization)
  • Serving: Redis (hot) + Trino (analytics)

Step 4: Data Modeling

Schema Design

-- Events table (fact)
CREATE TABLE events (
event_id BIGINT,
user_id BIGINT,
event_type STRING,
event_timestamp TIMESTAMP,
properties MAP<STRING, STRING>
)
PARTITIONED BY (DATE(event_timestamp))
STORED AS PARQUET;
-- User profiles (dimension)
CREATE TABLE user_profiles (
user_id BIGINT PRIMARY KEY,
user_properties MAP<STRING, STRING>,
created_at TIMESTAMP,
updated_at TIMESTAMP
)
STORED AS PARQUET;

Partitioning Strategy

partitioning:
strategy: "Date-based partitioning"
primary: "event_date"
granularity: "Daily"
reasoning: "Time-series data, common query pattern"
z_ordering:
columns: ["user_id", "event_type"]
reasoning: "Multi-dimensional queries"

Step 5: Deep Dive on Components

Component 1: Kafka Cluster

kafka_cluster:
configuration:
brokers: 12
replication_factor: 3
partitions: 100 # Per topic
sizing:
throughput: "1M messages/sec"
latency: "< 10ms p99"
cost: "$6,000/month (on-demand)"
optimization:
spot_instances: "No (requires durability)"
compression: "ZSTD"
retention: "7 days"
flink_cluster:
configuration:
task_managers: 100
slots_per_tm: 4
parallelism: 400
sizing:
throughput: "1M events/sec"
state: "RocksDB state backend"
checkpointing: "Every 5 minutes"
cost: "$43,200/month (70% spot)"
optimization:
state_backend: "RocksDB (2TB state)"
checkpointing: "S3 with TTL 30 days"

Step 6: Bottleneck Analysis

Scaling Strategy:

  • Kafka: Add partitions (horizontal scale)
  • Flink: Add TaskManagers (horizontal scale)
  • S3: Essentially unlimited, but watch egress costs
  • Redis: Add shards (horizontal scale)

Step 7: Cost Optimization (The Cost Pass)

CRITICAL: Always include cost estimates in system design answers.

Storage Cost

100TB data
- Hot (S3 Standard): 100TB × $23/TB = $2,300/month
- Warm (S3 IA): 80TB × $12/TB = $960/month
- Cold (Glacier): 20TB × $4/TB = $80/month
Total: $3,340/month
Optimization: Lifecycle policies (save 30%)

Compute Cost

Flink: 100 nodes × $0.30/hour × 730 hours = $21,900/month
Spark: 200 nodes × $0.30/hour × 200 hours = $12,000/month
Kafka: 12 nodes × $0.50/hour × 730 hours = $4,380/month
Redis: 50 nodes × $0.20/hour × 730 hours = $7,300/month
Total: $45,580/month
Optimization: Spot for Flink/Spark (save 60%)

Network Cost

Data transfer: 10TB egress/month
- S3 egress: 10TB × $0.09/GB = $900/month
Optimization: Colocation, minimize egress

Total Monthly Cost

Storage: $3,340
Compute: $45,580
Network: $900
Monitoring: $500
Support: $2,000
---
TOTAL: $52,320/month
ANNUAL: $627,840
Optimization: Spot instances + lifecycle = $35,000/month (33% savings)

Trade-offs Discussion

DecisionOption AOption BTrade-off
StorageS3 StandardS3 IA + GlacierCost vs. latency
ComputeOn-demandSpotReliability vs. cost
ConsistencyStrongEventualComplexity vs. usability
LatencyReal-timeBatchCost vs. user experience

Principal-Level Answer: “I would start with S3 Standard for simplicity, then implement lifecycle policies to move older data to IA and Glacier, reducing storage costs by 30%. For compute, I’d use 70% spot instances for fault-tolerant batch workloads, with 30% on-demand for critical streaming workloads, balancing cost optimization with reliability requirements.”


Anti-Patterns to Avoid

Anti-Pattern 1: Ignoring Scale

Interviewee: “We’ll use Postgres for everything.”

Principal: “What happens when we hit 10TB data? Postgres won’t scale. I’d use a data lakehouse approach with S3 + Delta Lake + Trino for analytics at scale.”

Anti-Pattern 2: No Cost Discussion

Interviewee: [Never mentions cost]

Principal: Always discuss cost. “This design would cost approximately $50K/month. To optimize, I’d use spot instances (60% savings), lifecycle policies (30% storage savings), and rightsizing (20% compute savings), bringing total to $25K/month.”

Anti-Pattern 3: No Failure Modes

Interviewee: “This system will work perfectly.”

Principal: Always discuss what can go wrong. “If Kafka fails, we have a 7-day buffer. If Flink fails, we replay from checkpoints. If S3 has issues, we serve from Redis cache.”


Practice Framework

Mock Interview Flow

Time Allocation (45 minutes)

PhaseTime
Clarification5 minutes
Scale estimation5 minutes
Architecture10 minutes
Deep dive15 minutes
Cost/trade-offs5 minutes
Questions5 minutes

Key Takeaways

  1. 7 steps: Clarify, scale, design, model, deep dive, bottlenecks, cost
  2. Always estimate: Storage, compute, network costs
  3. Discuss trade-offs: Every decision has trade-offs
  4. Handle scale: TB/PB scale drives architecture
  5. Cost optimization: 30-50% savings with proper strategies
  6. Failure modes: What can go wrong and mitigation
  7. Practice: 20+ mock interviews before real interview

Common Questions

Question Types

  1. Design a real-time data platform
  2. Design a data warehouse at scale
  3. Design a fraud detection system
  4. Design a recommendation system
  5. Migrate from on-prem to cloud
  6. Optimize a $100K/month bill to $50K

For Each Question

  • Clarify: Ask questions before designing
  • Think out loud: Show your thought process
  • Diagram: Draw architecture on whiteboard
  • Cost pass: Always estimate monthly cost
  • Trade-offs: Discuss pros and cons of decisions

Back to Module 9