WELCOME · YOUR INTERVIEW GUIDE

System Design Interview
Guide

A comprehensive, language-agnostic guide to system design interviews — covering scalability, databases, caching, messaging, distributed systems, and fintech patterns.

All Levels Welcome Fintech Patterns Language Agnostic 30-Day Roadmap 15 Canonical Problems

20

Core Concepts

15

Design Problems

30

Days Intensive

8

Study Modules

💡

How to use this guide: Work through modules in sidebar order. Do the 30-Day Roadmap first to set your plan. Then dive deep into each concept. Finish each week with the Quiz. The Cheat Sheet is your interview-day companion.

🎯 What interviewers test

Interviewers don't ask basic coding questions. They ask: "Design a fraud detection system" and see if you can break down ambiguity, make trade-offs, and communicate reasoning under pressure.

🏦 Fintech priority

Financial companies specifically probe ACID vs BASE trade-offs, payment idempotency, event sourcing, and distributed ledger concepts — all covered in the Fintech Focus module.

⚡ 30-day intensive plan

Week 1: Foundations → Week 2: Beginner problems → Week 3: Intermediate problems → Week 4: Advanced & Fintech problems + mock interviews.

THE GOLDEN RULE

A confident explanation beats a perfect diagram

A candidate who says "I'd start with PostgreSQL because transactions are critical, shard by user_id if writes become a bottleneck, use Redis to reduce read pressure, and decouple notifications via Kafka" scores higher than someone drawing 20 boxes without reasoning through trade-offs.

MODULE 1 · STUDY PLAN

Your 30-Day Intensive Roadmap

4 weeks × focused themes — from zero to confident senior-level system design in 30 days.

W1

Week 1 — Core Foundations (Days 1–7)

Day 1–2: Scalability — vertical vs horizontal, load balancers, stateless apps
Day 3: Databases — SQL vs NoSQL, ACID vs BASE, when to use what
Day 4: Caching — Cache-Aside, Write-Through, Redis internals
Day 5: Messaging — Kafka vs RabbitMQ, async processing, event-driven patterns
Day 6–7: Distributed Systems — CAP theorem, replication, sharding, consistency models

Scalability Databases Caching Messaging CAP Theorem

W2

Week 2 — Beginner Problems (Days 8–14)

Day 8–9: Design URL Shortener (TinyURL) — hashing, redirection, analytics
Day 10: Design Pastebin — blob storage, TTL, deduplication
Day 11–12: Design Rate Limiter — token bucket, sliding window, Redis Lua scripts
Day 13–14: Design Notification Service — fan-out, push vs pull, delivery guarantees

URL Shortener Rate Limiter Notifications

W3

Week 3 — Intermediate Problems (Days 15–21)

Day 15–16: Design Chat System (WhatsApp) — WebSockets, message storage, delivery states
Day 17–18: Design News Feed — fan-out on write vs read, ranking, Redis sorted sets
Day 19–20: Design BookMyShow — seat inventory, concurrency, optimistic locking
Day 21: Design API Gateway — auth, rate limiting, circuit breaker, routing

WhatsApp News Feed BookMyShow API Gateway

W4

Week 4 — Advanced & Fintech (Days 22–30)

Day 22–23: Design Payment Gateway — idempotency, 2-phase commit, saga pattern
Day 24–25: Design Banking Ledger — event sourcing, CQRS, double-entry accounting
Day 26–27: Design Fraud Detection Platform — stream processing, ML model serving, alerting
Day 28: Design Distributed Cache — consistent hashing, eviction policies
Day 29–30: Mock Interviews + Review weak spots + cheat sheet drill

Payment Gateway Banking Ledger Fraud Detection Mock Interviews

⚠️

Daily commitment: Target 90–120 minutes per day. Read one concept → draw the architecture by hand → explain it out loud as if in an interview. This tri-modal learning (read, draw, speak) is the fastest way to internalize system design.

MODULE 2 · FOUNDATIONS

Scalability

Understanding how systems grow to handle millions of users — the bedrock of every system design interview.

Vertical vs Horizontal Scaling

VERTICAL SCALING (Scale Up)

Small Server
2 CPU · 4GB RAM

⬇ Upgrade

Medium Server
8 CPU · 32GB RAM

⬇ Upgrade

Big Server
32 CPU · 256GB RAM

⚠ Has a ceiling. SPOF. Downtime to upgrade.

HORIZONTAL SCALING (Scale Out)

Load Balancer

↓

App 1

App 2

App 3

App N

✅ Infinite scale. No SPOF. Zero-downtime deploys.

Standard Load Balancer Architecture

Client

→

DNS

→

Load Balancer
(Nginx / ALB)

→

App Servers
Stateless

→

DB / Cache
Shared State

Load Balancing Algorithms ▾

Algorithm	How it works	Best for
Round Robin	Sends requests in circular order	Homogeneous servers, stateless apps
Least Connections	Routes to server with fewest active connections	Long-lived connections, heterogeneous load
IP Hash	Same client always hits same server	Session stickiness (avoid if possible)
Weighted	Higher weight = more requests	Servers with different capacities

Stateless vs Stateful Apps — Why it matters ▾

Horizontal scaling requires stateless app servers. If your app stores user session in local memory, a different server can't serve the next request. Instead:

Store sessions in Redis (shared, fast)
Use JWT tokens (state lives in the token itself)
Externalize all user state to the database layer

💡

Interview line: "I'll make app servers stateless by externalizing session state to Redis, which allows any server to handle any request — enabling true horizontal scaling."

CDN — Content Delivery Network ▾

A CDN caches static assets (images, CSS, JS, videos) at edge nodes globally, so users receive content from the nearest server rather than your origin.

User (Mumbai)

→

CDN Edge (Mumbai)
Cache Hit ✅

Cache Miss scenario:

User (Mumbai)

→

CDN Edge
Miss ❌

→

Origin Server
(US East)

→

CDN caches it

Tools: Cloudflare, AWS CloudFront, Akamai. Mention CDN whenever the problem involves serving media at scale (YouTube, Netflix, Instagram).

🎙️

Interview script: "I'll start with a single server, then extract the DB layer. Once traffic grows, I'll add a load balancer with multiple stateless app instances. For global users, a CDN handles static assets, and Redis handles session state."

MODULE 3 · FOUNDATIONS

Databases

The most contested topic in system design — knowing exactly when to use SQL vs NoSQL, and how to scale either.

SQL vs NoSQL — Decision Tree

Need ACID transactions?
Payments, banking, inventory

YES ↓

SQL Database
PostgreSQL / MySQL

NO ↓

Flexible schema?
User profiles, catalogs

YES

MongoDB
Document

Time-series

Cassandra
Wide-column

K-V

DynamoDB
Key-Value

🐘 PostgreSQL / MySQL

ACID Transactions Joins

Use when: Financial transactions, inventory, user accounts, any domain needing consistency guarantees.
Weakness: Schema migrations are painful; harder to scale writes horizontally.

🍃 MongoDB

Document Flexible Schema

Use when: Catalogs, user profiles, CMS, data with varying structure.
Weakness: No multi-document ACID (well, limited since v4); not ideal for complex joins.

🔥 Cassandra

Wide-Column Massive Scale

Use when: Time-series data, IoT, activity feeds, Netflix-scale writes.
Weakness: Eventual consistency; no joins; queries must be designed around access patterns.

Database Sharding — How to scale writes ▾

Sharding splits data across multiple database nodes (shards) so each node holds only a subset of the data.

Sharding by User ID (Hash-based)

user_id = 1001

→

Shard Router
user_id % 4

→

Shard 0
uid%4=0

Shard 1
uid%4=1

Shard 2
uid%4=2

Shard 3
uid%4=3

⚠️

Hot shard problem: If one user generates 90% of traffic (celebrity user), that shard becomes a bottleneck. Solution: add a random suffix to the sharding key for celebrity accounts, or use consistent hashing.

Read Replicas — Scale reads separately ▾

Writes

→

Primary DB

→

Replica 1
Reads

Replica 2
Reads

Replica 3
Reads

Replication lag means replicas may serve slightly stale data — acceptable for most reads, not acceptable for financial reads (always read from primary after a write in banking).

ACID vs BASE — Critical for payment system interviews ▾

Property	ACID (SQL)	BASE (NoSQL)
Consistency	Strong (immediate)	Eventual (may lag)
Availability	Can sacrifice for consistency	Highly available
Best for	Payments, banking, orders	Social feeds, analytics, metrics
Example DB	PostgreSQL, MySQL	Cassandra, DynamoDB, MongoDB

🏦

Rule of thumb: Any money movement (debits, credits, transfers) must use ACID. Analytics and reporting can use BASE. Never let an interviewer catch you putting payment data in Cassandra without explaining compensating transactions.

MODULE 4 · CORE CONCEPTS

Caching

The single highest-leverage optimization in distributed systems. Know your patterns cold.

Cache Aside (Lazy Loading) — Most Common Pattern

App

→

Redis
Check cache first

Cache HIT ✅

Return cached data
instantly

Cache MISS ❌

Query Database

↓

Write to Redis (TTL)

↓

Return to App

Cache Aside

App manages cache manually. On miss: read DB → store in cache → return. Best for read-heavy workloads. Data may be stale if DB is updated without invalidating cache.

Write Through

Every write goes to cache AND DB simultaneously. Cache is always fresh. Downside: higher write latency; cache fills with data that may never be read.

Write Back (Behind)

Write to cache only, async-flush to DB later. Extremely fast writes but risk of data loss if cache crashes before flush. Use for non-critical high-frequency writes.

Redis Data Structures — Know these for interviews ▾

Structure	Use Case	Example
String	Simple cache, counters, sessions	user:1001:session → JWT
Hash	Object with fields	user:1001 → {name, email, age}
List	Queues, recent activity	notifications:user:1001 (latest 20)
Sorted Set	Leaderboards, ranking, news feed	feed:user:1001 → posts sorted by score
Set	Unique membership, tags	online_users → {uid1, uid2, ...}

💡

Interview win: When asked about News Feed, mention Redis Sorted Sets (ZADD/ZRANGE) for ranking posts by timestamp or engagement score — shows you know the tool deeply.

Cache Eviction Policies ▾

Policy	What it evicts	Best for
LRU	Least recently used items	General purpose — most common
LFU	Least frequently used items	Skewed access patterns (Zipf distribution)
TTL	Items past expiry time	Session data, OTPs, rate limits
Random	Random item	When all items equally likely to be accessed

⚠️

Cache Stampede / Thundering Herd: When a popular cache key expires, thousands of requests simultaneously miss and hammer the DB. Solutions: (1) Mutex locking — only one request rebuilds, others wait. (2) Background refresh — proactively refresh before TTL expires. (3) Probabilistic early expiry.

MODULE 5 · CORE CONCEPTS

Messaging & Async Systems

Decouple services, absorb traffic spikes, and build fault-tolerant pipelines with message queues and event streams.

Synchronous vs Asynchronous Processing

SYNCHRONOUS — Tight coupling

User Request

↓ waits...

App Server

↓ waits...

Email Service

↓ waits...

SMS Service

↓ 3000ms later

Response

ASYNC WITH KAFKA — Loose coupling

User Request

↓ instant

App Server

↓ publish event

Kafka Topic
user.registered

Email Worker

SMS Worker

Analytics

Kafka — Deep Dive for Interviews ▾

Key Concepts

Topic: Named stream of events
Partition: Ordered, immutable log (parallelism unit)
Consumer Group: Multiple consumers sharing a partition
Offset: Position of a message in a partition
Retention: Messages kept for N hours/days

When Kafka shines

High-throughput event streaming
Multiple consumers per event
Replay capability needed
Audit log / event sourcing
Fraud detection pipelines
Real-time analytics

🎙️

Interview line: "When the user registers, I'll publish a 'user.registered' event to Kafka. Multiple consumers can independently handle email, SMS, and analytics — completely decoupled, with replay capability if any consumer fails."

Kafka vs RabbitMQ — When to use which ▾

Factor	Kafka	RabbitMQ
Model	Log-based (pull)	Queue-based (push)
Throughput	Millions/sec	Thousands/sec
Message replay	✅ Yes (retention)	❌ No (consumed = gone)
Complex routing	Basic (topic-based)	✅ Rich (exchanges, bindings)
Best for	Event streaming, audit logs, payment pipelines	Task queues, RPC, job workers
Ecosystem	spring-kafka / confluent-kafka-python / sarama	spring-amqp / pika / amqp-client

Exactly-Once, At-Least-Once, At-Most-Once Delivery ▾

Guarantee	Meaning	Risk	Use
At-Most-Once	Message may be lost, never duplicated	Data loss	Metrics, logs (loss OK)
At-Least-Once	Delivered ≥1 times, may duplicate	Duplicate processing	Most systems (idempotent consumers)
Exactly-Once	Delivered exactly once	Performance cost	Payments, ledger (critical)

🏦

Payment critical: Payment processors must ensure exactly-once semantics. In Kafka, this requires idempotent producers + transactional APIs. Always mention idempotency keys for payment APIs.

MODULE 6 · CORE CONCEPTS

Distributed Systems

The deep theory that separates senior engineers from mid-level ones. Master CAP, consistency, and failure modes.

CAP Theorem — You can only choose 2 of 3

Network partitions will happen in any distributed system.
So the real choice is: CP (consistent but may be unavailable) vs AP (available but may be stale).

Consistency Models — Spectrum ▾

Strong

Linearizable

Sequential

Ordered globally

Causal

Cause before effect

Eventual

Will converge

For payment systems: payments need linearizable (strongest). User profile updates can be eventual. Chat message ordering can be causal.

Consistent Hashing — How distributed caches & databases route data ▾

Regular hashing (key % N) breaks when you add/remove servers — everything remaps. Consistent hashing places both servers and keys on a ring so only K/N keys migrate when a server changes (K = keys, N = servers).

Used by: Redis Cluster, Cassandra, DynamoDB, Memcached. Virtual nodes improve distribution uniformity.

Common Failure Modes — Senior engineers must know these ▾

Failure	Description	Mitigation
Single Point of Failure	One component takes down the whole system	Redundancy, active-active setup
Cascading Failure	One service failure overloads dependents	Circuit breaker, bulkhead pattern
Hot Partition	One shard gets disproportionate traffic	Better shard key, random suffix, celebrity handling
Split Brain	Two nodes both think they're the primary	Consensus protocol (Raft/Paxos), odd number nodes
Network Partition	Nodes can't communicate	CAP trade-off choice, timeout + retry

MODULE 7 · STRATEGY

The Interview Method

A repeatable 5-step framework that works for every system design question — memorise and rehearse until it's automatic.

1

Clarify Requirements (2–3 min)

Never start drawing. Ask: Who uses this? What scale? Which features are MVP vs nice-to-have? What are the SLA requirements (latency, availability)? Interviewers love this — it shows senior thinking. Example: "Before I start, let me clarify — are we designing for global users or India-only? Do we need real-time delivery receipts or is eventual OK?"

2

Estimate Scale (3–5 min)

Order-of-magnitude math. 100M users × 10 req/day = 1B req/day ÷ 86,400 sec ≈ 11,500 RPS. Storage: 100M users × 1KB profile = 100GB. This tells you: do you need caching? Sharding? CDN? Always walk the interviewer through your estimates — the process matters more than precision.

// Quick scale estimation template
DAU = 10M users
Reads/user/day = 20 → 200M reads/day = 2,300 RPS
Writes/user/day = 2 → 20M writes/day = 230 RPS
Avg object size = 1KB → storage growth = 20GB/day
// Peak = 3-5x average → design for 10,000 RPS peak
        

3

High-Level Design (5–8 min)

Draw the skeleton: Client → API Gateway → Services → Cache → Database. Start simple. Don't jump to microservices immediately — start monolith, then extract services if scale demands. Label every arrow (HTTP, WebSocket, gRPC, Kafka).

Standard High-Level Template

Client
Mobile/Web

HTTPS

API Gateway
Auth · Rate Limit

gRPC

Services
Stateless

→

Redis Cache

PostgreSQL

Kafka

4

Deep Dive a Component (10–15 min)

The interviewer will direct you — follow their lead. Common deep-dives: database schema design, API contract, caching strategy, failure handling, consistency guarantees. This is where you show seniority. Discuss trade-offs explicitly: "I chose X over Y because of Z, and the trade-off is W."

5

Identify Bottlenecks & Improvements (5 min)

Proactively identify where the system will break: "The DB write path is the likely bottleneck at 10,000 RPS. I'd shard by user_id. The read path can be cached in Redis with a 5-min TTL. For global users, add CDN and regional read replicas." This is the separator between mid-level and senior candidates.

LANGUAGE THAT SIGNALS SENIORITY

✅ Say This

"The trade-off here is..."
"I'd start simple with X, then scale to Y when..."
"Failure mode here would be... mitigated by..."
"Consistency requirement for this is [strong/eventual] because..."
"I'd instrument this with [Prometheus/Grafana] to detect..."

❌ Avoid This

Drawing 20 boxes without explanation
"I'll use microservices" (without justifying)
Jumping to Cassandra when PostgreSQL suffices
"I'll use Kubernetes" as a solution to everything
Not asking clarifying questions at the start

MODULE 8 · PRACTICE

15 Canonical Design Problems

Ordered by difficulty. Work through each — read the breakdown, then cover it and explain it out loud in 30 minutes.

1. URL Shortener (TinyURL)

BEGINNER

Key Questions to Clarify

Custom short codes or random?
Analytics needed (click counts)?
Expiry / TTL on URLs?
Read:Write ratio? (Typically 100:1 — very read-heavy)

Architecture

User

→

API Server

→

Redis Cache
shortcode→URL

→ miss →

PostgreSQL
url_mappings

Short Code Generation

Option A: MD5/SHA256 hash of long URL → take first 7 chars. Collision rate is low but handle with retry. Option B: Auto-increment ID → Base62 encode (a-z, A-Z, 0-9). 7 chars of Base62 = 62^7 = 3.5 trillion URLs. Option B is preferred — no collision, predictable space.

Key Trade-offs

Redis caches hot URLs — 80% traffic served from cache
302 redirect (temporary) vs 301 (permanent) — 301 offloads server, but loses analytics
Shard by short code hash if scale demands it

2. Rate Limiter

BEGINNER

Algorithms

Algorithm	Pros	Cons
Token Bucket	Burst allowed, smooth	Race condition on distributed
Sliding Window Counter	Accurate, fair	Memory per user
Fixed Window Counter	Simple	Boundary burst problem
Leaky Bucket	Consistent output rate	Drops bursts

Redis Implementation

// Sliding window with Redis sorted sets
key = "ratelimit:user:1001"
now = current_timestamp_ms
window = 60000 // 1 minute
ZREMRANGEBYSCORE key 0 (now - window)  // remove old
count = ZCARD key
if count < limit:
    ZADD key now now
    EXPIRE key window_seconds
    allow()
else: reject()
          

Distributed Rate Limiter

Single Redis instance is a SPOF. Use Redis Cluster or Lua scripts for atomic operations. For global rate limiting, use a distributed counter with sticky sessions or a global Redis.

3. Notification Service

BEGINNER

Architecture

Any Microservice (Payment, Auth, Order...)

↓ publish event

Kafka Topic: notifications

Email Worker
SendGrid

SMS Worker
Twilio

Push Worker
FCM/APNs

In-App Worker
WebSocket

Key Design Points

Fan-out: one event → multiple channels
Idempotency: deduplicate with notification_id to avoid double-send
User preferences: respect opt-out per channel per category
Retry with exponential backoff for failed deliveries
Dead letter queue (DLQ) for permanently failed messages

4. Chat System (WhatsApp)

INTERMEDIATE

Connection Layer

Use WebSockets (persistent bidirectional) for real-time messaging. HTTP long-polling is a fallback. Each user connects to a chat server — need a routing layer to find which server a user is on.

User A

WebSocket

Chat Server 1

→

Message Queue
Kafka

→

Chat Server 2

WebSocket

User B

Message Storage

Cassandra: ideal for chat — time-series, high write throughput, partition by conversation_id
Schema: (conv_id, message_id, sender_id, content, timestamp, status)
Message IDs: use Snowflake IDs (time-sortable, unique across nodes)

Online Presence

Heartbeat every 30s → update Redis key "online:user_id" with TTL 60s. If key expires, user is offline. For "last seen": store timestamp in Redis, batch-persist to DB every 5 min.

5. BookMyShow (Seat Booking)

INTERMEDIATE

The Core Challenge

Concurrency: 10,000 users trying to book the last 5 seats simultaneously. You need to prevent double-booking without sacrificing performance.

Approaches

Approach	Mechanism	Trade-off
Pessimistic Locking	SELECT FOR UPDATE on seat row	Serialized — safe but slow under load
Optimistic Locking	version column — retry on conflict	Fast but retry storms under high contention
Redis Distributed Lock	SETNX seat_id with TTL	Fast, handles contention, but Redis SPOF risk
Queue + Reserve	Virtual queue, 10-min hold TTL	Best UX — user gets time to pay

Recommended: Temporary Seat Reservation

User selects seat → Redis: SET seat:show_1:seat_A5 user_1001 EX 600 (10 min)
Payment completes → mark seat as BOOKED in PostgreSQL, remove Redis key
Timeout → seat auto-released back to available pool

6. News Feed (Facebook/Instagram)

INTERMEDIATE

Fan-out Strategies

Strategy	How	Best for
Fan-out on Write	On post, push to all followers' feeds immediately	Users with small follower counts
Fan-out on Read	On feed load, fetch and merge followed users' posts	Celebrity users (millions of followers)
Hybrid	Fan-out on write for normal users, on read for celebrities	Production (Facebook, Instagram)

Feed Storage

Redis Sorted Set per user: key = "feed:user:1001", score = timestamp. Store post_ids, not full content. Fetch top 20 post_ids → batch fetch post content from cache/DB.

7. YouTube / Netflix

ADVANCED

Video Upload Pipeline

Creator uploads raw video

↓

Object Storage (S3) — raw video

↓ triggers

Kafka: video.uploaded event

↓

Transcoding Workers
360p 720p 1080p 4K

Thumbnail Generator

Metadata Indexer

↓

CDN (CloudFront) — serve globally

Adaptive Bitrate Streaming (ABR)

Store video in HLS/DASH chunks. Player requests manifest file → selects quality based on current bandwidth → streams chunks. Redis caches popular video metadata. CDN serves video chunks from edge nodes nearest to user.

📚

The remaining 8 problems (Uber, Twitter Search, Distributed Cache, Payment Gateway, Banking Ledger, Fraud Detection, Distributed Lock, API Gateway) are covered in detail in the Fintech Focus module and through the 30-Day Roadmap exercises.

MODULE 9 · FINTECH FOCUS

Fintech & Payment System Interviews

The design problems and concepts that financial companies specifically probe. This module is your edge.

🏦

Fintech Interview Reality: Banks and payment companies care obsessively about: data integrity, idempotency, audit trails, regulatory compliance (PCI-DSS), failure recovery, and "what happens if this crashes mid-transaction?" Practice answering every design question through this lens.

💳 Design a Payment Gateway

EXPERT

The Core Challenge: Exactly-Once Payments

If a payment request times out, did it go through? The user retries — you must never charge them twice. Solution: Idempotency Keys.

POST /payments
{
  "idempotency_key": "usr-1001-order-9876-attempt-1",
  "amount": 1000.00,
  "currency": "INR",
  "source": "card_abc123"
}
// Server: check idempotency_key in DB
// If exists → return cached response (same result)
// If not → process payment → store key + result
          

Saga Pattern — Distributed Transactions

Across microservices, you can't use a single DB transaction. Use the Saga pattern:

1. Debit user account → publish PaymentDebited

↓

2. Reserve inventory → publish InventoryReserved

↓

3. Create order → publish OrderCreated

↓ If step 3 fails

Compensate: Release inventory → Credit account back

Database Schema (Simplified)

accounts: (id, user_id, balance, version, updated_at)
transactions: (id, from_account, to_account, amount, status, 
              idempotency_key, created_at)
payment_events: (id, payment_id, event_type, payload, timestamp)
-- version column for optimistic locking
-- payment_events for audit trail and event sourcing
          

What Fintech Interviewers Probe

"What if the network fails after debit but before credit?" → Saga compensating transactions
"How do you prevent duplicate charges?" → Idempotency keys
"How do you ensure your ledger always balances?" → Double-entry accounting, event sourcing
"How do you handle PCI compliance?" → Tokenization, never store raw card data

📒 Design a Banking Ledger (Event Sourcing + CQRS)

EXPERT

Event Sourcing

Instead of storing current balance (mutable state), store every event that changed it. Current balance = replay all events.

// Events stored — immutable, append-only
{ event: "DEPOSIT",  amount: 5000, timestamp: ... }
{ event: "WITHDRAWAL", amount: 200,  timestamp: ... }
{ event: "TRANSFER_IN", amount: 1500, timestamp: ... }

// Current balance = sum of all events
// Audit trail is FREE — you have the full history
// Can replay to any point in time for debugging
          

CQRS (Command Query Responsibility Segregation)

Separate write model (commands: debit, credit) from read model (queries: current balance, transaction history). Write to event store → async project to read-optimized views in Redis/PostgreSQL.

Debit Command

→

Command Handler
validates + writes event

→

Event Store
append-only

→

Projections
update read views

Double-Entry Accounting

Every financial transaction debits one account and credits another. The sum of all debits must always equal sum of all credits — this is how you verify ledger integrity. Never violate this rule in a banking system design.

🚨 Design a Fraud Detection Platform

EXPERT

Architecture

Transaction Event → Kafka: payments.raw

↓

Rule Engine
"velocity > 5/min"

Stream Processor
Flink / Spark

ML Model Service
Risk scoring

↓

Decision Service → APPROVE / DECLINE / REVIEW

↓

Alert Ops Team

Block Transaction

Audit Log (Kafka)

Features to store in feature store (Redis)

Transaction velocity: number of transactions in last 1min / 5min / 1hr
Geolocation anomaly: user usually pays in Pune, now paying in London
Merchant category risk score
Device fingerprint change
Time-of-day anomaly

Latency requirement

Fraud decision must happen in <100ms for real-time payment approval. Pre-compute features in Redis. Serve ML model via low-latency inference endpoint (not batch). Rule engine executes in memory.

MODULE 10 · QUICK REFERENCE

Interview Cheat Sheet

Pin this. Review the night before every interview. 2-minute scan before walking into the room.

🗄️

Database Cheatsheet

Payments/Banking → PostgreSQL (ACID)
User profiles/Catalog → MongoDB
Time-series/Chat → Cassandra
Session/Cache → Redis
Search → Elasticsearch
Graph relationships → Neo4j
Analytics/OLAP → Redshift/BigQuery

⚡

Caching Cheatsheet

Read-heavy → Cache-Aside (Redis)
Write-through for fresh data
Write-back for high-freq non-critical
Feed ranking → Sorted Set (ZADD)
Session storage → String + TTL
Rate limiting → INCR + EXPIRE
Distributed lock → SETNX + EX

📨

Messaging Cheatsheet

Event streaming / audit → Kafka
Task queue / RPC → RabbitMQ
Payments → Exactly-once (Kafka tx)
Notifications → Kafka fan-out
Dead letter queue for failures
Idempotency key for dedup
Retry + exponential backoff

🌐

Scale Numbers

1M users → single DB fine
10M users → add read replicas
100M users → sharding needed
1B users → multi-region
10K RPS → Redis mandatory
100K RPS → Kafka + sharding
1M RPS → CDN + edge compute

🏦

Fintech Must-Knows

Idempotency key on every payment
Saga pattern for distributed tx
Event sourcing for audit trail
CQRS for read/write separation
Double-entry accounting always
Optimistic lock for balance updates
Never store raw card data (PCI)

🛡️

Reliability Patterns

Circuit Breaker (Resilience4j)
Bulkhead — isolate failures
Retry with backoff + jitter
Timeout on every external call
Health checks + readiness probes
Graceful degradation
Blue-green / canary deploys

📐

Interview Steps

1. Clarify requirements (2-3 min)
2. Estimate scale (3-5 min)
3. High-level design (5-8 min)
4. Deep dive (10-15 min)
5. Bottlenecks (5 min)
Always state trade-offs
Start simple, evolve design

🔢

Back-of-Envelope

1 day = 86,400 sec (~100K sec)
1M req/day = ~12 RPS
10M req/day = ~120 RPS
1B req/day = ~11,500 RPS
1 char = 1 byte
1 tweet = ~300 bytes
1 photo = ~300 KB avg

TECHNOLOGY QUICK MAP

API & COMMUNICATION

REST — CRUD over HTTP gRPC — internal microservices GraphQL — flexible client queries WebSocket — real-time bidirectional SSE — server push (notifications)

STORAGE

S3 — object/blob storage HDFS — distributed file system EBS — block storage Elasticsearch — full-text search Neo4j — graph data

POPULAR FRAMEWORKS

Spring Boot — microservices (Java) Spring Kafka — Kafka integration Spring Data JPA — ORM / SQL Resilience4j — circuit breaker Spring Security — JWT / OAuth2

MODULE 11 · SELF ASSESSMENT

Knowledge Quiz

10 questions covering all modules. Answer without looking at the book first. Track your score.

QUESTION 01 / 10

You're designing a payment system. A user's payment request times out at the network level before receiving a response. What is the primary mechanism to prevent double-charging on retry?

Use a unique transaction UUID stored in the database

Idempotency key — client sends a unique key per operation; server returns cached result on duplicate

Use optimistic locking with a version column

Implement a distributed lock using Redis SETNX

QUESTION 02 / 10

Your news feed has 500 million users. Posting to a user with 10 million followers — which strategy prevents the fan-out write from being catastrophically slow?

Fan-out on write for all users — precompute all feeds

Store everything in Cassandra, read on demand

Hybrid approach — fan-out on write for normal users, fan-out on read for celebrity users

Use GraphQL subscriptions to push updates

QUESTION 03 / 10

CAP Theorem: Cassandra chooses AP (Available + Partition Tolerant). What does this mean for a fintech application?

Cassandra is ideal for payment processing — high availability is critical for banks

Cassandra may return stale data — not suitable for balance reads without additional consistency configuration; use PostgreSQL (CP) for money

Cassandra's AP means it will always show consistent balances

AP means atomic partitioning — good for banking transactions

QUESTION 04 / 10

10,000 requests simultaneously hit your server for the same cache key the moment it expires. What is this problem called and what is the best mitigation?

Cache miss storm — increase TTL to reduce expiry frequency

Cache stampede / Thundering herd — mitigate with mutex locking or background cache refresh before TTL expires

Cache invalidation problem — use write-through to prevent misses

Hot partition problem — shard your cache keys

QUESTION 05 / 10

When designing a chat system like WhatsApp, which database is most suitable for storing message history and why?

PostgreSQL — ACID transactions ensure no message is lost

MongoDB — flexible document schema handles attachments well

Cassandra — optimised for time-series append-heavy writes, partition by conversation_id for fast range reads

Redis — in-memory for real-time message delivery

QUESTION 06 / 10

Which load balancing algorithm would you choose for a system where servers have different hardware capacities (some 8-core, some 32-core)?

Round Robin — simple and fair

IP Hash — ensures session stickiness

Weighted Round Robin — assign higher weights to more powerful servers proportional to capacity

Random — statistically uniform distribution

QUESTION 07 / 10

What is the Saga pattern and when would you use it over a traditional database transaction?

A caching pattern where data is stored in sagas (time-ordered segments) for fast retrieval

A distributed consensus algorithm similar to Raft

A sequence of local transactions across microservices, each publishing events. If one fails, compensating transactions undo previous steps. Used when a single 2-phase commit across services is impractical.

An event replay pattern for recreating state from an audit log

QUESTION 08 / 10

Kafka vs RabbitMQ: A fraud detection system needs to consume each payment event across 4 different services (alerting, ML scoring, audit, analytics) — which tool fits best?

Kafka — multiple consumer groups can independently read the same event; supports replay if a consumer is down; high throughput for real-time stream processing

RabbitMQ — rich routing with exchanges ensures each service gets its message

Either works — the choice is purely based on team familiarity

Redis Pub/Sub — lowest latency for real-time event delivery

QUESTION 09 / 10

In Event Sourcing, how do you get the current balance of a bank account?

Read the 'balance' column from the accounts table

Replay all events (deposits, withdrawals, transfers) for that account from the event store, or read from a pre-computed projection

Read the latest snapshot from a time-series database

Query the CQRS command model for the current state

QUESTION 10 / 10

A senior engineer is asked "Design BookMyShow." After 2 minutes of drawing boxes, the interviewer interrupts to ask about concurrent seat booking. What's the red flag in this engineer's approach?

Using microservices instead of a monolith for a ticketing system

Not including a CDN in the architecture

Skipped the clarification and estimation phase — jumped to drawing without understanding scale, feature scope, or identifying the core challenge (concurrency). Always clarify first.

Not mentioning Kafka for async seat confirmation

System Design InterviewGuide

Week 1 — Core Foundations (Days 1–7)

Week 2 — Beginner Problems (Days 8–14)

Week 3 — Intermediate Problems (Days 15–21)

Week 4 — Advanced & Fintech (Days 22–30)

Vertical vs Horizontal Scaling

Standard Load Balancer Architecture

SQL vs NoSQL — Decision Tree

Sharding by User ID (Hash-based)

Cache Aside (Lazy Loading) — Most Common Pattern

Synchronous vs Asynchronous Processing

CAP Theorem — You can only choose 2 of 3

Clarify Requirements (2–3 min)

Estimate Scale (3–5 min)

High-Level Design (5–8 min)

Standard High-Level Template

Deep Dive a Component (10–15 min)

Identify Bottlenecks & Improvements (5 min)

1. URL Shortener (TinyURL)

Key Questions to Clarify

Architecture

Short Code Generation

Key Trade-offs

2. Rate Limiter

Algorithms

Redis Implementation

Distributed Rate Limiter

3. Notification Service

Architecture

Key Design Points

4. Chat System (WhatsApp)

Connection Layer

Message Storage

Online Presence

5. BookMyShow (Seat Booking)

The Core Challenge

Approaches

Recommended: Temporary Seat Reservation

6. News Feed (Facebook/Instagram)

Fan-out Strategies

Feed Storage

7. YouTube / Netflix

Video Upload Pipeline

Adaptive Bitrate Streaming (ABR)

💳 Design a Payment Gateway

The Core Challenge: Exactly-Once Payments

Saga Pattern — Distributed Transactions

Database Schema (Simplified)

What Fintech Interviewers Probe

📒 Design a Banking Ledger (Event Sourcing + CQRS)

Event Sourcing

CQRS (Command Query Responsibility Segregation)

Double-Entry Accounting

🚨 Design a Fraud Detection Platform

Architecture

Features to store in feature store (Redis)

Latency requirement

Database Cheatsheet

Caching Cheatsheet

Messaging Cheatsheet

Scale Numbers

Fintech Must-Knows

Reliability Patterns

Interview Steps

Back-of-Envelope

—

System Design Interview
Guide