System Design Cheatsheet (Expert-Level, 80/20 Coverage)

If you internalize this cheatsheet, you’ll cover the majority of real-world system design discussions and interview expectations.

This is not “what is a load balancer” material. This is the stuff that decides whether your system survives its first incident.

0) The One-Minute Mental Model

System design is trade-offs under constraints.

In every design, you are balancing:

Correctness (consistency, integrity, invariants)
Latency (p50 vs p99/p999)
Throughput (QPS, write amplification)
Availability (degradation vs outage)
Cost (compute, storage, bandwidth, operational load)
Operability (deploy, debug, rollback, migrate)

If you can answer these 8 questions, you’re already “senior” in system design:

What’s the SLO (p99 latency, error rate, availability)?
What’s the peak (traffic, fanout, burstiness) and growth curve?
What’s the data model (entities + access patterns + invariants)?
Where do we need strong consistency vs eventual?
What are the hot keys / skew / supernodes?
What is the blast radius of a dependency failure?
What’s the backpressure strategy?
How do we migrate without downtime?

1) First 10 Minutes: Frame the Problem Like an Architect

Workload profile (write this down)

Read/write ratio (e.g., 90/10)
Data size and growth (GB/day, records/day)
Access patterns (by user, by time, by geo)
Latency target (p99, not average)
Consistency requirements (what must never be wrong?)
Fanout patterns (celebrity problem, multi-tenant whales)

“Constraints-first” checklist

Regulatory: PII, retention, deletion, audit
Geo: single-region vs multi-region active/active
Failure domains: zone, region, provider
Budget: infra cost and headcount (operations is expensive)

2) Core Building Blocks (The Standard Toolkit)

The canonical shape of modern systems

flowchart LR
	Client --> Edge[CDN/WAF/API Gateway]
	Edge --> Svc[App Services]
	Svc --> Cache[(Cache)]
	Svc --> DB[(Primary DB)]
	Svc --> Search[(Search/Index)]
	Svc --> Queue[(Queue/Stream)]
	Queue --> Worker[Async Workers]
	Worker --> DB
	Worker --> Blob[(Object Storage)]
	Svc --> Obs[Logs/Metrics/Tracing]

A useful classification

Serving path: must meet p99 latency (API + cache + DB reads)
Write path: must preserve invariants (validation + idempotency + durability)
Async path: absorbs burst, isolates failures (queues/streams + workers)
Control plane: config, deploys, migrations, feature flags

3) Data: Model First, Then Pick Storage

Data model cheatsheet

Write down:

Entities: User, Post, Order, Payment, Session
Relationships: 1:1, 1:N, N:N
Queries: “get timeline”, “search”, “recent orders”, “by tenant”, “by time range”
Invariants: “no double charge”, “unique username”, “inventory never negative”

Storage selection (fast heuristics)

War story rule

If you don’t know your access patterns, don’t design your sharding key.

4) Consistency: Decide What Must Be True

Think in invariants

Examples:

Payments: “charge exactly once” (or “at most once + reconciliation”)
Inventory: “stock never negative”
Usernames: “unique globally”

CAP/PACELC (use it correctly)

Under a partition: choose Consistency (CP) or Availability (AP)
Else: choose Latency (EL) or Consistency (EC)

Consistency patterns you actually use

Default stance: strong consistency only where invariants demand it; eventual consistency everywhere else.

5) Caching: The Fast Path and the Failure Path

What caching is really for

Reduce read load
Reduce tail latency
Provide graceful degradation when DB is struggling

Common cache patterns

Cache failure playbook

Stampede: jitter TTL + request coalescing
Cold start: warm top keys, or degrade to stale
Stale tolerance: stale-while-revalidate (serve stale, refresh async)

6) Scale: Partitioning, Sharding, and Hot Keys

Sharding key rules

Your sharding key must:

Match the dominant access pattern
Avoid hot partitions (skew)
Be stable over time (or have a re-sharding plan)

Patterns that fix real problems

The most common scaling lie

“We’ll just add more shards.”

If your partition key is wrong, you don’t need more shards. You need a new key (and a migration plan).

7) Messaging: Queues vs Streams (Pick the Right Weapon)

Decision guide

Reliability patterns

Retries with backoff (bounded)
DLQ (dead-letter queue) for poison pills
Idempotent consumers (must)
Deduplication (idempotency key or content hash)

8) Resilience: Design for Partial Failure

The Big 6 resilience patterns

Timeouts (always) — no unbounded waits
Retries (carefully) — with jitter + budgets
Circuit breaker — stop cascading failures
Bulkheads — isolate tenants/endpoints/dependencies
Load shedding — degrade gracefully, protect core
Backpressure — push work back when overloaded

Retry math (why people get this wrong)

Retries increase traffic during incidents. If you retry everything, you can DDoS your own dependencies.

Rules:

Retry only idempotent operations or with idempotency keys
Retry only on transient failures (timeouts, 429, 503)
Use retry budgets per service

9) Latency: p99 Is the Product

Tail latency compounds

If a request calls N dependencies sequentially, the probability of hitting a tail event rises quickly.

Practical fixes:

Parallelize fanout calls
Use hedged requests for critical reads
Add fallbacks (partial UI/data)
Precompute when it’s cheaper than computing on read

Network bottlenecks people discover too late

Connection churn (ephemeral ports, TIME_WAIT)
TLS handshakes (terminate smartly)
Cross-zone chatter (expensive + slow)

10) Observability: Debuggability Is a Feature

The “triad”

Metrics: low-cardinality aggregates (SLOs)
Logs: high-cardinality details
Traces: causality across services

Golden signals (start here)

Latency (p50/p95/p99)
Traffic (QPS)
Errors (rate + types)
Saturation (CPU, memory, queue depth, thread pools)

Cardinality warning

Never put unbounded IDs (user_id, request_id, email) into metric labels.

11) Security & Abuse (Often the Real Bottleneck)

Minimum viable security for “internet-facing” systems:

AuthN/AuthZ (token validation, scopes)
Rate limiting + WAF
Input validation + payload limits
Secrets management
Audit logs for sensitive actions

Abuse patterns to plan for:

Credential stuffing
Scraping and bot traffic
Multi-tenant noisy neighbor attacks

12) Migrations: The Part That Separates Theory from Reality

Safe schema changes (boring but crucial)

Expand → backfill → switch reads → switch writes → contract
Dual writes only with reconciliation and end-to-end idempotency

Data migration playbook

Backfill in small batches
Verify with checksums / sampling
Feature-flag the cutover
Keep rollback path

13) A Compact “Design Interview” Script (Repeatable)

Requirements + SLOs
APIs + core entities
High-level architecture
Data model + storage choice
Consistency + failure modes
Scaling strategy (shards, caches, async)
Observability + ops + migrations

If you can do this smoothly, you will look like someone who has shipped production systems.

14) The 20 Red Flags (Expert Smell Test)

“We’ll add retries” (without idempotency)
“We’ll shard later” (no plan)
“We store files in the DB” (without a reason)
“We use exactly-once” (without defining it)
“We’ll use microservices for scale” (without a latency plan)
“We’ll do active-active” (without conflict resolution)
“Metrics per user” (cardinality explosion)
“No backpressure needed” (death spiral incoming)

Optional: Your Personal Mastery Loop

To go from “knows patterns” to “expert”:

Pick a system (feed, payments, chat, search).
Write its invariants.
List failure modes.
Add mitigations (timeouts, idempotency, backpressure).
Design migrations.

That’s how you build real instincts.