Staff-Level Feedback & Practice Plan (Distributed Cache)
This note turns a real interview-style evaluation into a checklist + drills you can reuse.
1) The Feedback (verbatim, condensed)
Summary: Strong Senior. For Staff, show more proactive system-wide bottleneck detection, tighter terminology, and deeper node-level internals.
Key gaps called out:
- Back-of-envelope first: Capacity/scale math (e.g., 50TB → hundreds of nodes) should drive the architecture up front.
- Concurrency internals: Strict LRU + doubly-linked list lock contention is a classic trap.
- Terminology precision: Don’t say leaderless while describing a primary/replica write path.
- Stampede mechanics: Refresh-ahead alone isn’t the most robust protection; request coalescing (singleflight) should show up early.
2) What “Staff Ownership” Looks Like in System Design
Use this as an internal script during the interview:
A. Start with the forcing functions (numbers)
You should be able to say within the first 5–10 minutes:
- Data size: $\text{Total bytes} = \text{items} \times \text{avg item size}$ (+ overhead)
- Working set: what fraction must be “hot” in cache?
- Node count (rough):
- QPS per node:
Staff signal: you do this before you pick LB vs smart-client vs proxy.
B. Name the top bottlenecks proactively
For a distributed cache, you can usually call these early:
- Coordinator bottleneck (if routing/membership goes through a single tier)
- Hot keys / skew (p99 latency + CPU saturation)
- Eviction/TTL overhead (metadata contention at high concurrency)
- Rebalance blast radius (network + DB load spikes)
- Stampede/dogpile (downstream collapse)
C. Keep a “control plane vs data plane” split
- Data plane:
GET/SETpath, latency budget, tail behavior. - Control plane: membership, topology epochs, rebalancing, tooling.
This prevents mixing ideas like “ZooKeeper is on the request path”.
3) Terminology Cheat Sheet (use exact words)
Primary/Replica (Redis-style)
- Write path: client writes primary; primary replicates to replicas.
- Consistency: depends on sync/async replication; reads may be stale.
- Failover: elect/promote a new primary.
Leaderless / Quorum (Dynamo-style)
- Write path: client writes to $W$ replicas; read from $R$ replicas.
- Consistency: tunable via $R/W/N$; handles partitions differently.
- Conflict resolution: versions/vector clocks/LWW, read-repair.
Rule: Don’t say “leaderless” unless you mean quorum writes/reads.
4) Node Internals: The “LRU Lock” Trap and Better Defaults
The trap
A strict LRU often implies:
- Hash map lookup
- Then mutate a shared doubly-linked list (move-to-front)
At high concurrency this can become a global write lock and dominate CPU.
Better defaults to propose
- Sharded cache: partition key space; each shard has single-writer eviction structures.
- Buffered recency: record access in a per-core buffer; batch apply.
- CLOCK / second-chance: approximate LRU with lower contention.
- Sampling eviction: random samples among candidates (common in practice).
Staff signal: call out “strict LRU is expensive; we’ll use an approximation” unprompted.
5) Stampede / Thundering Herd: What to Always Mention
Your “stampede toolkit” list should include:
- Request coalescing (singleflight) per key
- TTL jitter to avoid synchronized expirations
- Stale-while-revalidate (serve stale briefly while refreshing)
- Probabilistic early refresh (refresh before expiry based on load)
- Negative caching (cache misses for short TTL)
Staff signal: coalescing is the “backend protection” anchor.
6) Practice Drills (repeatable)
Drill A — 5-minute math opener
Pick a scenario and do: 1) data size, replication factor, node count 2) QPS per node 3) bandwidth sanity check (peak value size * miss rate)
Drill B — 3 failure modes, 3 mitigations
For each:
- “What breaks first?”
- “What will you see in metrics?”
- “How do we degrade safely?”
Suggested failures:
- hot key, stampede, partial partition, rebalancing storm
Drill C — Write path consistency
Practice saying clearly:
- “Primary/replica async replication → stale reads possible.”
- “Leaderless quorum → write/read quorum trade-offs and conflict resolution.”
7) Where This Maps Into This Series
- Node internals + eviction trade-offs: see 02-mvp-architecture.md
- Massive scale membership/topology control plane: see 03-scaling-challenges.md
- Production patterns (stampede/runbooks/security): see 06-production-mastery.md