WELCOME · YOUR INTERVIEW GUIDE

System Design Interview
Guide

A comprehensive, language-agnostic guide to system design interviews — covering scalability, databases, caching, messaging, distributed systems, and fintech patterns.

All Levels Welcome Fintech Patterns Language Agnostic 30-Day Roadmap 15 Canonical Problems
20
Core Concepts
15
Design Problems
30
Days Intensive
8
Study Modules
💡
How to use this guide: Work through modules in sidebar order. Do the 30-Day Roadmap first to set your plan. Then dive deep into each concept. Finish each week with the Quiz. The Cheat Sheet is your interview-day companion.
🎯 What interviewers test
Interviewers don't ask basic coding questions. They ask: "Design a fraud detection system" and see if you can break down ambiguity, make trade-offs, and communicate reasoning under pressure.
🏦 Fintech priority
Financial companies specifically probe ACID vs BASE trade-offs, payment idempotency, event sourcing, and distributed ledger concepts — all covered in the Fintech Focus module.
⚡ 30-day intensive plan
Week 1: Foundations → Week 2: Beginner problems → Week 3: Intermediate problems → Week 4: Advanced & Fintech problems + mock interviews.
THE GOLDEN RULE
A confident explanation beats a perfect diagram
A candidate who says "I'd start with PostgreSQL because transactions are critical, shard by user_id if writes become a bottleneck, use Redis to reduce read pressure, and decouple notifications via Kafka" scores higher than someone drawing 20 boxes without reasoning through trade-offs.
MODULE 1 · STUDY PLAN
Your 30-Day Intensive Roadmap
4 weeks × focused themes — from zero to confident senior-level system design in 30 days.
W1

Week 1 — Core Foundations (Days 1–7)

  • Day 1–2: Scalability — vertical vs horizontal, load balancers, stateless apps
  • Day 3: Databases — SQL vs NoSQL, ACID vs BASE, when to use what
  • Day 4: Caching — Cache-Aside, Write-Through, Redis internals
  • Day 5: Messaging — Kafka vs RabbitMQ, async processing, event-driven patterns
  • Day 6–7: Distributed Systems — CAP theorem, replication, sharding, consistency models
Scalability Databases Caching Messaging CAP Theorem
W2

Week 2 — Beginner Problems (Days 8–14)

  • Day 8–9: Design URL Shortener (TinyURL) — hashing, redirection, analytics
  • Day 10: Design Pastebin — blob storage, TTL, deduplication
  • Day 11–12: Design Rate Limiter — token bucket, sliding window, Redis Lua scripts
  • Day 13–14: Design Notification Service — fan-out, push vs pull, delivery guarantees
URL Shortener Rate Limiter Notifications
W3

Week 3 — Intermediate Problems (Days 15–21)

  • Day 15–16: Design Chat System (WhatsApp) — WebSockets, message storage, delivery states
  • Day 17–18: Design News Feed — fan-out on write vs read, ranking, Redis sorted sets
  • Day 19–20: Design BookMyShow — seat inventory, concurrency, optimistic locking
  • Day 21: Design API Gateway — auth, rate limiting, circuit breaker, routing
WhatsApp News Feed BookMyShow API Gateway
W4

Week 4 — Advanced & Fintech (Days 22–30)

  • Day 22–23: Design Payment Gateway — idempotency, 2-phase commit, saga pattern
  • Day 24–25: Design Banking Ledger — event sourcing, CQRS, double-entry accounting
  • Day 26–27: Design Fraud Detection Platform — stream processing, ML model serving, alerting
  • Day 28: Design Distributed Cache — consistent hashing, eviction policies
  • Day 29–30: Mock Interviews + Review weak spots + cheat sheet drill
Payment Gateway Banking Ledger Fraud Detection Mock Interviews
⚠️
Daily commitment: Target 90–120 minutes per day. Read one concept → draw the architecture by hand → explain it out loud as if in an interview. This tri-modal learning (read, draw, speak) is the fastest way to internalize system design.
MODULE 2 · FOUNDATIONS
Scalability
Understanding how systems grow to handle millions of users — the bedrock of every system design interview.

Vertical vs Horizontal Scaling

VERTICAL SCALING (Scale Up)
Small Server
2 CPU · 4GB RAM
⬇ Upgrade
Medium Server
8 CPU · 32GB RAM
⬇ Upgrade
Big Server
32 CPU · 256GB RAM
⚠ Has a ceiling. SPOF. Downtime to upgrade.
HORIZONTAL SCALING (Scale Out)
Load Balancer
App 1
App 2
App 3
App N
✅ Infinite scale. No SPOF. Zero-downtime deploys.

Standard Load Balancer Architecture

Client
DNS
Load Balancer
(Nginx / ALB)
App Servers
Stateless
DB / Cache
Shared State
Load Balancing Algorithms
AlgorithmHow it worksBest for
Round RobinSends requests in circular orderHomogeneous servers, stateless apps
Least ConnectionsRoutes to server with fewest active connectionsLong-lived connections, heterogeneous load
IP HashSame client always hits same serverSession stickiness (avoid if possible)
WeightedHigher weight = more requestsServers with different capacities
Stateless vs Stateful Apps — Why it matters

Horizontal scaling requires stateless app servers. If your app stores user session in local memory, a different server can't serve the next request. Instead:

  • Store sessions in Redis (shared, fast)
  • Use JWT tokens (state lives in the token itself)
  • Externalize all user state to the database layer
💡
Interview line: "I'll make app servers stateless by externalizing session state to Redis, which allows any server to handle any request — enabling true horizontal scaling."
CDN — Content Delivery Network

A CDN caches static assets (images, CSS, JS, videos) at edge nodes globally, so users receive content from the nearest server rather than your origin.

User (Mumbai)
CDN Edge (Mumbai)
Cache Hit ✅
Cache Miss scenario:
User (Mumbai)
CDN Edge
Miss ❌
Origin Server
(US East)
CDN caches it

Tools: Cloudflare, AWS CloudFront, Akamai. Mention CDN whenever the problem involves serving media at scale (YouTube, Netflix, Instagram).

🎙️
Interview script: "I'll start with a single server, then extract the DB layer. Once traffic grows, I'll add a load balancer with multiple stateless app instances. For global users, a CDN handles static assets, and Redis handles session state."
MODULE 3 · FOUNDATIONS
Databases
The most contested topic in system design — knowing exactly when to use SQL vs NoSQL, and how to scale either.

SQL vs NoSQL — Decision Tree

Need ACID transactions?
Payments, banking, inventory
YES ↓
SQL Database
PostgreSQL / MySQL
NO ↓
Flexible schema?
User profiles, catalogs
YES
MongoDB
Document
Time-series
Cassandra
Wide-column
K-V
DynamoDB
Key-Value
🐘 PostgreSQL / MySQL
ACID Transactions Joins
Use when: Financial transactions, inventory, user accounts, any domain needing consistency guarantees.
Weakness: Schema migrations are painful; harder to scale writes horizontally.
🍃 MongoDB
Document Flexible Schema
Use when: Catalogs, user profiles, CMS, data with varying structure.
Weakness: No multi-document ACID (well, limited since v4); not ideal for complex joins.
🔥 Cassandra
Wide-Column Massive Scale
Use when: Time-series data, IoT, activity feeds, Netflix-scale writes.
Weakness: Eventual consistency; no joins; queries must be designed around access patterns.
Database Sharding — How to scale writes

Sharding splits data across multiple database nodes (shards) so each node holds only a subset of the data.

Sharding by User ID (Hash-based)

user_id = 1001
Shard Router
user_id % 4
Shard 0
uid%4=0
Shard 1
uid%4=1
Shard 2
uid%4=2
Shard 3
uid%4=3
⚠️
Hot shard problem: If one user generates 90% of traffic (celebrity user), that shard becomes a bottleneck. Solution: add a random suffix to the sharding key for celebrity accounts, or use consistent hashing.
Read Replicas — Scale reads separately
Writes
Primary DB
Replica 1
Reads
Replica 2
Reads
Replica 3
Reads

Replication lag means replicas may serve slightly stale data — acceptable for most reads, not acceptable for financial reads (always read from primary after a write in banking).

ACID vs BASE — Critical for payment system interviews
PropertyACID (SQL)BASE (NoSQL)
ConsistencyStrong (immediate)Eventual (may lag)
AvailabilityCan sacrifice for consistencyHighly available
Best forPayments, banking, ordersSocial feeds, analytics, metrics
Example DBPostgreSQL, MySQLCassandra, DynamoDB, MongoDB
🏦
Rule of thumb: Any money movement (debits, credits, transfers) must use ACID. Analytics and reporting can use BASE. Never let an interviewer catch you putting payment data in Cassandra without explaining compensating transactions.
MODULE 4 · CORE CONCEPTS
Caching
The single highest-leverage optimization in distributed systems. Know your patterns cold.

Cache Aside (Lazy Loading) — Most Common Pattern

App
Redis
Check cache first
Cache HIT ✅
Return cached data
instantly
Cache MISS ❌
Query Database
Write to Redis (TTL)
Return to App
Cache Aside
App manages cache manually. On miss: read DB → store in cache → return. Best for read-heavy workloads. Data may be stale if DB is updated without invalidating cache.
Write Through
Every write goes to cache AND DB simultaneously. Cache is always fresh. Downside: higher write latency; cache fills with data that may never be read.
Write Back (Behind)
Write to cache only, async-flush to DB later. Extremely fast writes but risk of data loss if cache crashes before flush. Use for non-critical high-frequency writes.
Redis Data Structures — Know these for interviews
StructureUse CaseExample
StringSimple cache, counters, sessionsuser:1001:session → JWT
HashObject with fieldsuser:1001 → {name, email, age}
ListQueues, recent activitynotifications:user:1001 (latest 20)
Sorted SetLeaderboards, ranking, news feedfeed:user:1001 → posts sorted by score
SetUnique membership, tagsonline_users → {uid1, uid2, ...}
💡
Interview win: When asked about News Feed, mention Redis Sorted Sets (ZADD/ZRANGE) for ranking posts by timestamp or engagement score — shows you know the tool deeply.
Cache Eviction Policies
PolicyWhat it evictsBest for
LRULeast recently used itemsGeneral purpose — most common
LFULeast frequently used itemsSkewed access patterns (Zipf distribution)
TTLItems past expiry timeSession data, OTPs, rate limits
RandomRandom itemWhen all items equally likely to be accessed
⚠️
Cache Stampede / Thundering Herd: When a popular cache key expires, thousands of requests simultaneously miss and hammer the DB. Solutions: (1) Mutex locking — only one request rebuilds, others wait. (2) Background refresh — proactively refresh before TTL expires. (3) Probabilistic early expiry.
MODULE 5 · CORE CONCEPTS
Messaging & Async Systems
Decouple services, absorb traffic spikes, and build fault-tolerant pipelines with message queues and event streams.

Synchronous vs Asynchronous Processing

SYNCHRONOUS — Tight coupling
User Request
↓ waits...
App Server
↓ waits...
Email Service
↓ waits...
SMS Service
↓ 3000ms later
Response
ASYNC WITH KAFKA — Loose coupling
User Request
↓ instant
App Server
↓ publish event
Kafka Topic
user.registered
Email Worker
SMS Worker
Analytics
Kafka — Deep Dive for Interviews
Key Concepts
  • Topic: Named stream of events
  • Partition: Ordered, immutable log (parallelism unit)
  • Consumer Group: Multiple consumers sharing a partition
  • Offset: Position of a message in a partition
  • Retention: Messages kept for N hours/days
When Kafka shines
  • High-throughput event streaming
  • Multiple consumers per event
  • Replay capability needed
  • Audit log / event sourcing
  • Fraud detection pipelines
  • Real-time analytics
🎙️
Interview line: "When the user registers, I'll publish a 'user.registered' event to Kafka. Multiple consumers can independently handle email, SMS, and analytics — completely decoupled, with replay capability if any consumer fails."
Kafka vs RabbitMQ — When to use which
FactorKafkaRabbitMQ
ModelLog-based (pull)Queue-based (push)
ThroughputMillions/secThousands/sec
Message replay✅ Yes (retention)❌ No (consumed = gone)
Complex routingBasic (topic-based)✅ Rich (exchanges, bindings)
Best forEvent streaming, audit logs, payment pipelinesTask queues, RPC, job workers
Ecosystemspring-kafka / confluent-kafka-python / saramaspring-amqp / pika / amqp-client
Exactly-Once, At-Least-Once, At-Most-Once Delivery
GuaranteeMeaningRiskUse
At-Most-OnceMessage may be lost, never duplicatedData lossMetrics, logs (loss OK)
At-Least-OnceDelivered ≥1 times, may duplicateDuplicate processingMost systems (idempotent consumers)
Exactly-OnceDelivered exactly oncePerformance costPayments, ledger (critical)
🏦
Payment critical: Payment processors must ensure exactly-once semantics. In Kafka, this requires idempotent producers + transactional APIs. Always mention idempotency keys for payment APIs.
MODULE 6 · CORE CONCEPTS
Distributed Systems
The deep theory that separates senior engineers from mid-level ones. Master CAP, consistency, and failure modes.

CAP Theorem — You can only choose 2 of 3

SQL databases HBase, Zookeeper Cassandra, DynamoDB Consistency Same data everywhere Availability Always responds Partition Tolerance
Network partitions will happen in any distributed system.
So the real choice is: CP (consistent but may be unavailable) vs AP (available but may be stale).
Consistency Models — Spectrum
Strong
Linearizable
Sequential
Ordered globally
Causal
Cause before effect
Eventual
Will converge

For payment systems: payments need linearizable (strongest). User profile updates can be eventual. Chat message ordering can be causal.

Consistent Hashing — How distributed caches & databases route data

Regular hashing (key % N) breaks when you add/remove servers — everything remaps. Consistent hashing places both servers and keys on a ring so only K/N keys migrate when a server changes (K = keys, N = servers).

S1 Redis-1 S2 Redis-2 S3 Redis-3 k1 k2 k3 Hash Ring

Used by: Redis Cluster, Cassandra, DynamoDB, Memcached. Virtual nodes improve distribution uniformity.

Common Failure Modes — Senior engineers must know these
FailureDescriptionMitigation
Single Point of FailureOne component takes down the whole systemRedundancy, active-active setup
Cascading FailureOne service failure overloads dependentsCircuit breaker, bulkhead pattern
Hot PartitionOne shard gets disproportionate trafficBetter shard key, random suffix, celebrity handling
Split BrainTwo nodes both think they're the primaryConsensus protocol (Raft/Paxos), odd number nodes
Network PartitionNodes can't communicateCAP trade-off choice, timeout + retry
MODULE 7 · STRATEGY
The Interview Method
A repeatable 5-step framework that works for every system design question — memorise and rehearse until it's automatic.
1

Clarify Requirements (2–3 min)

Never start drawing. Ask: Who uses this? What scale? Which features are MVP vs nice-to-have? What are the SLA requirements (latency, availability)? Interviewers love this — it shows senior thinking. Example: "Before I start, let me clarify — are we designing for global users or India-only? Do we need real-time delivery receipts or is eventual OK?"

2

Estimate Scale (3–5 min)

Order-of-magnitude math. 100M users × 10 req/day = 1B req/day ÷ 86,400 sec ≈ 11,500 RPS. Storage: 100M users × 1KB profile = 100GB. This tells you: do you need caching? Sharding? CDN? Always walk the interviewer through your estimates — the process matters more than precision.

// Quick scale estimation template DAU = 10M users Reads/user/day = 20 → 200M reads/day = 2,300 RPS Writes/user/day = 2 → 20M writes/day = 230 RPS Avg object size = 1KB → storage growth = 20GB/day // Peak = 3-5x average → design for 10,000 RPS peak
3

High-Level Design (5–8 min)

Draw the skeleton: Client → API Gateway → Services → Cache → Database. Start simple. Don't jump to microservices immediately — start monolith, then extract services if scale demands. Label every arrow (HTTP, WebSocket, gRPC, Kafka).

Standard High-Level Template

Client
Mobile/Web
HTTPS
API Gateway
Auth · Rate Limit
gRPC
Services
Stateless
Redis Cache
PostgreSQL
Kafka
4

Deep Dive a Component (10–15 min)

The interviewer will direct you — follow their lead. Common deep-dives: database schema design, API contract, caching strategy, failure handling, consistency guarantees. This is where you show seniority. Discuss trade-offs explicitly: "I chose X over Y because of Z, and the trade-off is W."

5

Identify Bottlenecks & Improvements (5 min)

Proactively identify where the system will break: "The DB write path is the likely bottleneck at 10,000 RPS. I'd shard by user_id. The read path can be cached in Redis with a 5-min TTL. For global users, add CDN and regional read replicas." This is the separator between mid-level and senior candidates.

LANGUAGE THAT SIGNALS SENIORITY
✅ Say This
"The trade-off here is..."
"I'd start simple with X, then scale to Y when..."
"Failure mode here would be... mitigated by..."
"Consistency requirement for this is [strong/eventual] because..."
"I'd instrument this with [Prometheus/Grafana] to detect..."
❌ Avoid This
Drawing 20 boxes without explanation
"I'll use microservices" (without justifying)
Jumping to Cassandra when PostgreSQL suffices
"I'll use Kubernetes" as a solution to everything
Not asking clarifying questions at the start
MODULE 8 · PRACTICE
15 Canonical Design Problems
Ordered by difficulty. Work through each — read the breakdown, then cover it and explain it out loud in 30 minutes.

1. URL Shortener (TinyURL)

BEGINNER

Key Questions to Clarify

  • Custom short codes or random?
  • Analytics needed (click counts)?
  • Expiry / TTL on URLs?
  • Read:Write ratio? (Typically 100:1 — very read-heavy)

Architecture

User
API Server
Redis Cache
shortcode→URL
→ miss →
PostgreSQL
url_mappings

Short Code Generation

Option A: MD5/SHA256 hash of long URL → take first 7 chars. Collision rate is low but handle with retry. Option B: Auto-increment ID → Base62 encode (a-z, A-Z, 0-9). 7 chars of Base62 = 62^7 = 3.5 trillion URLs. Option B is preferred — no collision, predictable space.

Key Trade-offs

  • Redis caches hot URLs — 80% traffic served from cache
  • 302 redirect (temporary) vs 301 (permanent) — 301 offloads server, but loses analytics
  • Shard by short code hash if scale demands it

2. Rate Limiter

BEGINNER

Algorithms

AlgorithmProsCons
Token BucketBurst allowed, smoothRace condition on distributed
Sliding Window CounterAccurate, fairMemory per user
Fixed Window CounterSimpleBoundary burst problem
Leaky BucketConsistent output rateDrops bursts

Redis Implementation

// Sliding window with Redis sorted sets key = "ratelimit:user:1001" now = current_timestamp_ms window = 60000 // 1 minute ZREMRANGEBYSCORE key 0 (now - window) // remove old count = ZCARD key if count < limit: ZADD key now now EXPIRE key window_seconds allow() else: reject()

Distributed Rate Limiter

Single Redis instance is a SPOF. Use Redis Cluster or Lua scripts for atomic operations. For global rate limiting, use a distributed counter with sticky sessions or a global Redis.

3. Notification Service

BEGINNER

Architecture

Any Microservice (Payment, Auth, Order...)
↓ publish event
Kafka Topic: notifications
Email Worker
SendGrid
SMS Worker
Twilio
Push Worker
FCM/APNs
In-App Worker
WebSocket

Key Design Points

  • Fan-out: one event → multiple channels
  • Idempotency: deduplicate with notification_id to avoid double-send
  • User preferences: respect opt-out per channel per category
  • Retry with exponential backoff for failed deliveries
  • Dead letter queue (DLQ) for permanently failed messages

4. Chat System (WhatsApp)

INTERMEDIATE

Connection Layer

Use WebSockets (persistent bidirectional) for real-time messaging. HTTP long-polling is a fallback. Each user connects to a chat server — need a routing layer to find which server a user is on.

User A
WebSocket
Chat Server 1
Message Queue
Kafka
Chat Server 2
WebSocket
User B

Message Storage

  • Cassandra: ideal for chat — time-series, high write throughput, partition by conversation_id
  • Schema: (conv_id, message_id, sender_id, content, timestamp, status)
  • Message IDs: use Snowflake IDs (time-sortable, unique across nodes)

Online Presence

Heartbeat every 30s → update Redis key "online:user_id" with TTL 60s. If key expires, user is offline. For "last seen": store timestamp in Redis, batch-persist to DB every 5 min.

5. BookMyShow (Seat Booking)

INTERMEDIATE

The Core Challenge

Concurrency: 10,000 users trying to book the last 5 seats simultaneously. You need to prevent double-booking without sacrificing performance.

Approaches

ApproachMechanismTrade-off
Pessimistic LockingSELECT FOR UPDATE on seat rowSerialized — safe but slow under load
Optimistic Lockingversion column — retry on conflictFast but retry storms under high contention
Redis Distributed LockSETNX seat_id with TTLFast, handles contention, but Redis SPOF risk
Queue + ReserveVirtual queue, 10-min hold TTLBest UX — user gets time to pay

Recommended: Temporary Seat Reservation

  • User selects seat → Redis: SET seat:show_1:seat_A5 user_1001 EX 600 (10 min)
  • Payment completes → mark seat as BOOKED in PostgreSQL, remove Redis key
  • Timeout → seat auto-released back to available pool

6. News Feed (Facebook/Instagram)

INTERMEDIATE

Fan-out Strategies

StrategyHowBest for
Fan-out on WriteOn post, push to all followers' feeds immediatelyUsers with small follower counts
Fan-out on ReadOn feed load, fetch and merge followed users' postsCelebrity users (millions of followers)
HybridFan-out on write for normal users, on read for celebritiesProduction (Facebook, Instagram)

Feed Storage

Redis Sorted Set per user: key = "feed:user:1001", score = timestamp. Store post_ids, not full content. Fetch top 20 post_ids → batch fetch post content from cache/DB.

7. YouTube / Netflix

ADVANCED

Video Upload Pipeline

Creator uploads raw video
Object Storage (S3) — raw video
↓ triggers
Kafka: video.uploaded event
Transcoding Workers
360p 720p 1080p 4K
Thumbnail Generator
Metadata Indexer
CDN (CloudFront) — serve globally

Adaptive Bitrate Streaming (ABR)

Store video in HLS/DASH chunks. Player requests manifest file → selects quality based on current bandwidth → streams chunks. Redis caches popular video metadata. CDN serves video chunks from edge nodes nearest to user.

📚
The remaining 8 problems (Uber, Twitter Search, Distributed Cache, Payment Gateway, Banking Ledger, Fraud Detection, Distributed Lock, API Gateway) are covered in detail in the Fintech Focus module and through the 30-Day Roadmap exercises.
MODULE 9 · FINTECH FOCUS
Fintech & Payment System Interviews
The design problems and concepts that financial companies specifically probe. This module is your edge.
🏦
Fintech Interview Reality: Banks and payment companies care obsessively about: data integrity, idempotency, audit trails, regulatory compliance (PCI-DSS), failure recovery, and "what happens if this crashes mid-transaction?" Practice answering every design question through this lens.

💳 Design a Payment Gateway

EXPERT

The Core Challenge: Exactly-Once Payments

If a payment request times out, did it go through? The user retries — you must never charge them twice. Solution: Idempotency Keys.

POST /payments { "idempotency_key": "usr-1001-order-9876-attempt-1", "amount": 1000.00, "currency": "INR", "source": "card_abc123" } // Server: check idempotency_key in DB // If exists → return cached response (same result) // If not → process payment → store key + result

Saga Pattern — Distributed Transactions

Across microservices, you can't use a single DB transaction. Use the Saga pattern:

1. Debit user account → publish PaymentDebited
2. Reserve inventory → publish InventoryReserved
3. Create order → publish OrderCreated
↓ If step 3 fails
Compensate: Release inventory → Credit account back

Database Schema (Simplified)

accounts: (id, user_id, balance, version, updated_at) transactions: (id, from_account, to_account, amount, status, idempotency_key, created_at) payment_events: (id, payment_id, event_type, payload, timestamp) -- version column for optimistic locking -- payment_events for audit trail and event sourcing

What Fintech Interviewers Probe

  • "What if the network fails after debit but before credit?" → Saga compensating transactions
  • "How do you prevent duplicate charges?" → Idempotency keys
  • "How do you ensure your ledger always balances?" → Double-entry accounting, event sourcing
  • "How do you handle PCI compliance?" → Tokenization, never store raw card data

📒 Design a Banking Ledger (Event Sourcing + CQRS)

EXPERT

Event Sourcing

Instead of storing current balance (mutable state), store every event that changed it. Current balance = replay all events.

// Events stored — immutable, append-only { event: "DEPOSIT", amount: 5000, timestamp: ... } { event: "WITHDRAWAL", amount: 200, timestamp: ... } { event: "TRANSFER_IN", amount: 1500, timestamp: ... } // Current balance = sum of all events // Audit trail is FREE — you have the full history // Can replay to any point in time for debugging

CQRS (Command Query Responsibility Segregation)

Separate write model (commands: debit, credit) from read model (queries: current balance, transaction history). Write to event store → async project to read-optimized views in Redis/PostgreSQL.

Debit Command
Command Handler
validates + writes event
Event Store
append-only
Projections
update read views

Double-Entry Accounting

Every financial transaction debits one account and credits another. The sum of all debits must always equal sum of all credits — this is how you verify ledger integrity. Never violate this rule in a banking system design.

🚨 Design a Fraud Detection Platform

EXPERT

Architecture

Transaction Event → Kafka: payments.raw
Rule Engine
"velocity > 5/min"
Stream Processor
Flink / Spark
ML Model Service
Risk scoring
Decision Service → APPROVE / DECLINE / REVIEW
Alert Ops Team
Block Transaction
Audit Log (Kafka)

Features to store in feature store (Redis)

  • Transaction velocity: number of transactions in last 1min / 5min / 1hr
  • Geolocation anomaly: user usually pays in Pune, now paying in London
  • Merchant category risk score
  • Device fingerprint change
  • Time-of-day anomaly

Latency requirement

Fraud decision must happen in <100ms for real-time payment approval. Pre-compute features in Redis. Serve ML model via low-latency inference endpoint (not batch). Rule engine executes in memory.

MODULE 10 · QUICK REFERENCE
Interview Cheat Sheet
Pin this. Review the night before every interview. 2-minute scan before walking into the room.
🗄️

Database Cheatsheet

  • Payments/Banking → PostgreSQL (ACID)
  • User profiles/Catalog → MongoDB
  • Time-series/Chat → Cassandra
  • Session/Cache → Redis
  • Search → Elasticsearch
  • Graph relationships → Neo4j
  • Analytics/OLAP → Redshift/BigQuery

Caching Cheatsheet

  • Read-heavy → Cache-Aside (Redis)
  • Write-through for fresh data
  • Write-back for high-freq non-critical
  • Feed ranking → Sorted Set (ZADD)
  • Session storage → String + TTL
  • Rate limiting → INCR + EXPIRE
  • Distributed lock → SETNX + EX
📨

Messaging Cheatsheet

  • Event streaming / audit → Kafka
  • Task queue / RPC → RabbitMQ
  • Payments → Exactly-once (Kafka tx)
  • Notifications → Kafka fan-out
  • Dead letter queue for failures
  • Idempotency key for dedup
  • Retry + exponential backoff
🌐

Scale Numbers

  • 1M users → single DB fine
  • 10M users → add read replicas
  • 100M users → sharding needed
  • 1B users → multi-region
  • 10K RPS → Redis mandatory
  • 100K RPS → Kafka + sharding
  • 1M RPS → CDN + edge compute
🏦

Fintech Must-Knows

  • Idempotency key on every payment
  • Saga pattern for distributed tx
  • Event sourcing for audit trail
  • CQRS for read/write separation
  • Double-entry accounting always
  • Optimistic lock for balance updates
  • Never store raw card data (PCI)
🛡️

Reliability Patterns

  • Circuit Breaker (Resilience4j)
  • Bulkhead — isolate failures
  • Retry with backoff + jitter
  • Timeout on every external call
  • Health checks + readiness probes
  • Graceful degradation
  • Blue-green / canary deploys
📐

Interview Steps

  • 1. Clarify requirements (2-3 min)
  • 2. Estimate scale (3-5 min)
  • 3. High-level design (5-8 min)
  • 4. Deep dive (10-15 min)
  • 5. Bottlenecks (5 min)
  • Always state trade-offs
  • Start simple, evolve design
🔢

Back-of-Envelope

  • 1 day = 86,400 sec (~100K sec)
  • 1M req/day = ~12 RPS
  • 10M req/day = ~120 RPS
  • 1B req/day = ~11,500 RPS
  • 1 char = 1 byte
  • 1 tweet = ~300 bytes
  • 1 photo = ~300 KB avg
TECHNOLOGY QUICK MAP
API & COMMUNICATION
REST — CRUD over HTTP gRPC — internal microservices GraphQL — flexible client queries WebSocket — real-time bidirectional SSE — server push (notifications)
STORAGE
S3 — object/blob storage HDFS — distributed file system EBS — block storage Elasticsearch — full-text search Neo4j — graph data
POPULAR FRAMEWORKS
Spring Boot — microservices (Java) Spring Kafka — Kafka integration Spring Data JPA — ORM / SQL Resilience4j — circuit breaker Spring Security — JWT / OAuth2
MODULE 11 · SELF ASSESSMENT
Knowledge Quiz
10 questions covering all modules. Answer without looking at the book first. Track your score.
QUESTION 01 / 10
You're designing a payment system. A user's payment request times out at the network level before receiving a response. What is the primary mechanism to prevent double-charging on retry?
Use a unique transaction UUID stored in the database
Idempotency key — client sends a unique key per operation; server returns cached result on duplicate
Use optimistic locking with a version column
Implement a distributed lock using Redis SETNX
QUESTION 02 / 10
Your news feed has 500 million users. Posting to a user with 10 million followers — which strategy prevents the fan-out write from being catastrophically slow?
Fan-out on write for all users — precompute all feeds
Store everything in Cassandra, read on demand
Hybrid approach — fan-out on write for normal users, fan-out on read for celebrity users
Use GraphQL subscriptions to push updates
QUESTION 03 / 10
CAP Theorem: Cassandra chooses AP (Available + Partition Tolerant). What does this mean for a fintech application?
Cassandra is ideal for payment processing — high availability is critical for banks
Cassandra may return stale data — not suitable for balance reads without additional consistency configuration; use PostgreSQL (CP) for money
Cassandra's AP means it will always show consistent balances
AP means atomic partitioning — good for banking transactions
QUESTION 04 / 10
10,000 requests simultaneously hit your server for the same cache key the moment it expires. What is this problem called and what is the best mitigation?
Cache miss storm — increase TTL to reduce expiry frequency
Cache stampede / Thundering herd — mitigate with mutex locking or background cache refresh before TTL expires
Cache invalidation problem — use write-through to prevent misses
Hot partition problem — shard your cache keys
QUESTION 05 / 10
When designing a chat system like WhatsApp, which database is most suitable for storing message history and why?
PostgreSQL — ACID transactions ensure no message is lost
MongoDB — flexible document schema handles attachments well
Cassandra — optimised for time-series append-heavy writes, partition by conversation_id for fast range reads
Redis — in-memory for real-time message delivery
QUESTION 06 / 10
Which load balancing algorithm would you choose for a system where servers have different hardware capacities (some 8-core, some 32-core)?
Round Robin — simple and fair
IP Hash — ensures session stickiness
Weighted Round Robin — assign higher weights to more powerful servers proportional to capacity
Random — statistically uniform distribution
QUESTION 07 / 10
What is the Saga pattern and when would you use it over a traditional database transaction?
A caching pattern where data is stored in sagas (time-ordered segments) for fast retrieval
A distributed consensus algorithm similar to Raft
A sequence of local transactions across microservices, each publishing events. If one fails, compensating transactions undo previous steps. Used when a single 2-phase commit across services is impractical.
An event replay pattern for recreating state from an audit log
QUESTION 08 / 10
Kafka vs RabbitMQ: A fraud detection system needs to consume each payment event across 4 different services (alerting, ML scoring, audit, analytics) — which tool fits best?
Kafka — multiple consumer groups can independently read the same event; supports replay if a consumer is down; high throughput for real-time stream processing
RabbitMQ — rich routing with exchanges ensures each service gets its message
Either works — the choice is purely based on team familiarity
Redis Pub/Sub — lowest latency for real-time event delivery
QUESTION 09 / 10
In Event Sourcing, how do you get the current balance of a bank account?
Read the 'balance' column from the accounts table
Replay all events (deposits, withdrawals, transfers) for that account from the event store, or read from a pre-computed projection
Read the latest snapshot from a time-series database
Query the CQRS command model for the current state
QUESTION 10 / 10
A senior engineer is asked "Design BookMyShow." After 2 minutes of drawing boxes, the interviewer interrupts to ask about concurrent seat booking. What's the red flag in this engineer's approach?
Using microservices instead of a monolith for a ticketing system
Not including a CDN in the architecture
Skipped the clarification and estimation phase — jumped to drawing without understanding scale, feature scope, or identifying the core challenge (concurrency). Always clarify first.
Not mentioning Kafka for async seat confirmation