Article 10: Battle-Tested Patterns & Production Mastery
The difference between a system that works on a whiteboard and one that survives Black Friday.
You have designed the system. Now comes the hardest part: Day 2 Operations.
This article covers the “unknown unknowns”—the reusable patterns for edge cases, the security layers for compliance, and the observability stack you need to sleep at night.
1. Five Timeless Patterns
These are not vendor features; they are logical patterns you implement in your cache client, cache service, or surrounding infrastructure.
Pattern 1: Stampede Prevention (A Toolkit, Not One Trick)
The Problem: 1 million keys expire at exactly 12:00:00. Your database receives 1 million requests at 12:00:01. This is a coordination failure: too many callers rebuild the same key at once.
The Fix: combine 2-3 of these.
1) TTL jitter: never use a fixed TTL.
1
2
3
4
5
6
# Bad
cache.set(key, value, ttl=3600)
# Good (Jitter)
jitter = random.randint(-300, 300)
cache.set(key, value, ttl=3600 + jitter)
Result: Expirations are spread over 10 minutes. Database load is smoothed.
2) Request coalescing (singleflight): one in-flight rebuild per key.
- On cache miss, acquire a per-key lock/lease so only one thread recomputes.
- Other concurrent misses wait briefly, or serve stale data if available.
3) Serve stale while revalidating: hide refresh latency.
- Keep a soft TTL + hard TTL.
- After soft TTL: serve stale and refresh asynchronously.
- After hard TTL: block and rebuild (or fail).
Pattern 2: Circuit Breaker
The Problem: A cache node fails. The client retries. The retry fails. The client retries again. The retry storm takes down the network. The Fix: Stop asking.
- State Closed (Normal): Requests flow.
- State Open (Broken): After 5 failures, block all requests for 10 seconds. Return “Cache Unavailable” instantly.
- State Half-Open: After 10 seconds, let 1 request through. If it works, close the circuit.
Upgrade: serve stale on error
If your cache tier is flaky, it’s often better to serve slightly stale data than to fail hard. This is the same concept as HTTP caches’ stale-if-error.
Pattern 3: Hot Key Replication
The Problem: “Justin Bieber” profile gets 100x more traffic than “User 1”. One shard melts. The Fix: Local Caching or Read Replicas.
- Detect: “Key
user:bieber> 1000 QPS”. - Action: Cache
user:bieberon the Application Server (L1 Cache) for 5 seconds. - Result: 0 network requests for the hottest key.
Pattern 4: Multi-Tier Caching
The Problem: Network latency (1ms) is too slow for reading configuration flags 1000 times/sec. The Fix:
- L1: In-process memory - ~100ns.
- L2: Distributed in-memory cache tier - ~1ms.
- L3: Database - ~20ms.
Pattern 5: Cache Warming
The Problem: You restart your cache cluster. It is empty. 100% of requests hit the database. The database dies. The Fix: Before switching traffic, replay the last 1 hour of hot-key access logs to the new cache, or preload from a snapshot/backup mechanism your system supports.
2. Security: Defense in Depth
Your cache contains user sessions, PII, and financial data. It is a high-value target.
Authentication & Authorization
- VPC Isolation: The first line of defense. Cache servers should never have public IPs.
- ACLs (Access Control Lists):
Service Acan read/writesession:*.Service Bcan only readconfig:*.- Build this into your cache front door (API gateway, sidecar, or the cache service itself).
Encryption
- In-Transit (TLS): Mandatory. Without TLS, a packet sniffer can read
session_tokenoff the wire. - At-Rest (Disk): If you use persistence (RDB/AOF), encrypt the disk (LUKS or AWS KMS).
Compliance (GDPR/PCI)
- Right to be Forgotten: When a user deletes their account, you must delete their cached data. Don’t wait for TTL.
- Card Data: Never cache PAN (Primary Account Numbers).
Abuse & DoS Edge Cases
- Hash collision / HashDoS: if an attacker can force many keys into the same hash buckets, latency can degrade dramatically. Use hardened hash tables and consider adding a small amount of randomness in admission/eviction decisions so the cache can’t be gamed deterministically.
3. Production Readiness: The Stack
The Four Golden Signals
If you only monitor four things, monitor these:
- Latency: p99 and p50. (Is it slow?)
- Traffic: Request rate. (Is it spiking?)
- Errors: Connection refused, timeouts. (Is it broken?)
- Saturation: Memory usage, CPU usage. (Is it full?)
Distributed Tracing
Critically Important and often forgotten. When a request takes 500ms, where was the time spent?
- App logic?
- Network?
- Cache queue?
- Serialization?
Tooling: OpenTelemetry (Jaeger/Zipkin).
Implementation: Pass a Trace-ID with every cache request.
Visual: unique waterfall chart showing App -> Redis (SET) -> 2ms.
SLOs (Service Level Objectives)
Define success mathematically.
- Availability: 99.9% (max 43m downtime/month).
- Latency: 99% of requests < 5ms.
- Hit Ratio: > 90%.
4. Operational Runbooks
When the pager goes off at 3 AM, don’t think. Read.
Scenario: High Latency
- Check Saturation: Is CPU > 80%? If yes, shed load or add read replicas.
- Check Slow Operations: Look for expensive commands (e.g., full key scans, large payload serialization, synchronous eviction work).
- Check Network: Is packet loss high?
Scenario: Low Hit Ratio
- Check TTLs: Did someone deploy code that sets
TTL=0? - Check Eviction: Is memory full? If
used_memory_humanis nearmaxmemory, you are evicting keys to make room. Scale up RAM.
Scenario: Cascading Failure
- Enable Circuit Breakers: Stop the bleeding.
- Disable Heavy Features: Turn off “Recommendations” or “Search”.
- Restart with Backoff: Don’t restart everything at once.
Scenario: Users See Stale Data After Writes
- Check invalidation ordering: commit to the source of truth before invalidating.
- Check for invalidation races: reads between commit and delete can repopulate stale data.
- Mitigate: add lease tokens (reject cache population if invalidated since lease issued) or move to event-driven invalidation for correctness-sensitive data.
5. Final Interview Wisdom
You are now ready.
- Design: You know how to start simple (MVP) and scale to Billions (Sharding).
- Choose: You know why to pick Memcached (Speed), Redis (Features), or Hazelcast (Consistency).
- Deep Dive: You know about Flash storage (Aerospike) and L1 caching (Caffeine).
- Survive: You know about Circuit Breakers and Runbooks.
Go build something independent of the database.
References / Further Reading
- Request coalescing (“one lock per cache key”): https://github.com/ZiggyCreatures/FusionCache/discussions/263
- Cache stampede background + mitigation options: https://en.wikipedia.org/wiki/Cache_stampede
- HTTP stale content directives (
stale-while-revalidate,stale-if-error): https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Cache-Control - HashDoS / admission jitter discussion in a production cache design: https://github.com/ben-manes/caffeine/wiki/Design
- Commit-before-invalidate + lease tokens: https://www.systemoverflow.com/learn/caching/cache-invalidation/write-path-patterns-write-through-write-behind-and-cache-aside-with-delete-on-write