System Design Framework
Architecture Interviews for Principal-Level Roles
Overview
The system design framework is your blueprint for tackling architecture interview questions. At the Principal level, every design must consider scale (TB/PB), cost optimization, and operational excellence. This framework ensures you cover all critical aspects systematically.
The 7-Step Framework
Step 1: Clarify Requirements
Functional Requirements
Questions to ask:
- What is the primary use case?
- Who are the users (internal, external, both)?
- What are the success metrics?
- What are the key features?
Non-Functional Requirements
Example Requirements:
- Scale: 100TB data, 10K QPS, growing 2x/year
- Latency: Sub-second for dashboard, 24 hours for reports
- Availability: 99.9% uptime (8.76 hours downtime/year)
- Consistency: Eventual acceptable for analytics
- Cost: Budget: $50K/month
Step 2: Estimate Scale
Back-of-the-Envelope Calculations
Example: Event tracking system
Storage Calculation:
Daily: 1M events × 1KB = 1GB/dayMonthly: 30GBAnnual: 365GBWith compression (10x): 36.5GB/yearThroughput Calculation:
Average: 1M/day / 86400s = 11.5 events/secPeak: 10x = 115 events/secBuffer: 100x = 1,150 events/secStep 3: High-Level Design
Architecture Diagram
Technology Selection:
- Streaming: Kafka (proven at scale)
- Processing: Flink (latency) + Spark (batch)
- Storage: S3 + Delta (cost optimization)
- Serving: Redis (hot) + Trino (analytics)
Step 4: Data Modeling
Schema Design
-- Events table (fact)CREATE TABLE events ( event_id BIGINT, user_id BIGINT, event_type STRING, event_timestamp TIMESTAMP, properties MAP<STRING, STRING>)PARTITIONED BY (DATE(event_timestamp))STORED AS PARQUET;
-- User profiles (dimension)CREATE TABLE user_profiles ( user_id BIGINT PRIMARY KEY, user_properties MAP<STRING, STRING>, created_at TIMESTAMP, updated_at TIMESTAMP)STORED AS PARQUET;Partitioning Strategy
partitioning: strategy: "Date-based partitioning" primary: "event_date" granularity: "Daily" reasoning: "Time-series data, common query pattern"
z_ordering: columns: ["user_id", "event_type"] reasoning: "Multi-dimensional queries"Step 5: Deep Dive on Components
Component 1: Kafka Cluster
kafka_cluster: configuration: brokers: 12 replication_factor: 3 partitions: 100 # Per topic
sizing: throughput: "1M messages/sec" latency: "< 10ms p99"
cost: "$6,000/month (on-demand)"
optimization: spot_instances: "No (requires durability)" compression: "ZSTD" retention: "7 days"Component 2: Flink Cluster
flink_cluster: configuration: task_managers: 100 slots_per_tm: 4 parallelism: 400
sizing: throughput: "1M events/sec" state: "RocksDB state backend" checkpointing: "Every 5 minutes"
cost: "$43,200/month (70% spot)"
optimization: state_backend: "RocksDB (2TB state)" checkpointing: "S3 with TTL 30 days"Step 6: Bottleneck Analysis
Scaling Strategy:
- Kafka: Add partitions (horizontal scale)
- Flink: Add TaskManagers (horizontal scale)
- S3: Essentially unlimited, but watch egress costs
- Redis: Add shards (horizontal scale)
Step 7: Cost Optimization (The Cost Pass)
CRITICAL: Always include cost estimates in system design answers.
Storage Cost
100TB data- Hot (S3 Standard): 100TB × $23/TB = $2,300/month- Warm (S3 IA): 80TB × $12/TB = $960/month- Cold (Glacier): 20TB × $4/TB = $80/monthTotal: $3,340/month
Optimization: Lifecycle policies (save 30%)Compute Cost
Flink: 100 nodes × $0.30/hour × 730 hours = $21,900/monthSpark: 200 nodes × $0.30/hour × 200 hours = $12,000/monthKafka: 12 nodes × $0.50/hour × 730 hours = $4,380/monthRedis: 50 nodes × $0.20/hour × 730 hours = $7,300/month
Total: $45,580/month
Optimization: Spot for Flink/Spark (save 60%)Network Cost
Data transfer: 10TB egress/month- S3 egress: 10TB × $0.09/GB = $900/month
Optimization: Colocation, minimize egressTotal Monthly Cost
Storage: $3,340Compute: $45,580Network: $900Monitoring: $500Support: $2,000---TOTAL: $52,320/monthANNUAL: $627,840
Optimization: Spot instances + lifecycle = $35,000/month (33% savings)Trade-offs Discussion
| Decision | Option A | Option B | Trade-off |
|---|---|---|---|
| Storage | S3 Standard | S3 IA + Glacier | Cost vs. latency |
| Compute | On-demand | Spot | Reliability vs. cost |
| Consistency | Strong | Eventual | Complexity vs. usability |
| Latency | Real-time | Batch | Cost vs. user experience |
Principal-Level Answer: “I would start with S3 Standard for simplicity, then implement lifecycle policies to move older data to IA and Glacier, reducing storage costs by 30%. For compute, I’d use 70% spot instances for fault-tolerant batch workloads, with 30% on-demand for critical streaming workloads, balancing cost optimization with reliability requirements.”
Anti-Patterns to Avoid
Anti-Pattern 1: Ignoring Scale
Interviewee: “We’ll use Postgres for everything.”
Principal: “What happens when we hit 10TB data? Postgres won’t scale. I’d use a data lakehouse approach with S3 + Delta Lake + Trino for analytics at scale.”
Anti-Pattern 2: No Cost Discussion
Interviewee: [Never mentions cost]
Principal: Always discuss cost. “This design would cost approximately $50K/month. To optimize, I’d use spot instances (60% savings), lifecycle policies (30% storage savings), and rightsizing (20% compute savings), bringing total to $25K/month.”
Anti-Pattern 3: No Failure Modes
Interviewee: “This system will work perfectly.”
Principal: Always discuss what can go wrong. “If Kafka fails, we have a 7-day buffer. If Flink fails, we replay from checkpoints. If S3 has issues, we serve from Redis cache.”
Practice Framework
Mock Interview Flow
Time Allocation (45 minutes)
| Phase | Time |
|---|---|
| Clarification | 5 minutes |
| Scale estimation | 5 minutes |
| Architecture | 10 minutes |
| Deep dive | 15 minutes |
| Cost/trade-offs | 5 minutes |
| Questions | 5 minutes |
Key Takeaways
- 7 steps: Clarify, scale, design, model, deep dive, bottlenecks, cost
- Always estimate: Storage, compute, network costs
- Discuss trade-offs: Every decision has trade-offs
- Handle scale: TB/PB scale drives architecture
- Cost optimization: 30-50% savings with proper strategies
- Failure modes: What can go wrong and mitigation
- Practice: 20+ mock interviews before real interview
Common Questions
Question Types
- Design a real-time data platform
- Design a data warehouse at scale
- Design a fraud detection system
- Design a recommendation system
- Migrate from on-prem to cloud
- Optimize a $100K/month bill to $50K
For Each Question
- Clarify: Ask questions before designing
- Think out loud: Show your thought process
- Diagram: Draw architecture on whiteboard
- Cost pass: Always estimate monthly cost
- Trade-offs: Discuss pros and cons of decisions
Back to Module 9