Article 13: Production Readiness (SRE)
The Sleep-Well Architecture
You can have the best architecture on paper, but if it wakes you up at 3 AM every night, it’s a failure. “Production Readiness” is about Observability, Alerting, and Disaster Recovery.
1. Defining SLAs, SLOs, and SLIs
For a URL Shortener, not all requests are created equal.
The Redirect Service (Critical)
If this is down, Twitter users see 404s. The company loses money.
- SLO (Objective): 99.99% Availability.
- Latency Target: P99 < 50ms (Redirects must be instant).
The Analytics Service (Non-Critical)
If this is down, the dashboard is stale. Users probably won’t notice for an hour.
- SLO (Objective): 99.9% Availability.
- Latency Target: P99 < 5 seconds (Dashboards can be slow).
Key Takeaway: We alert aggressively on Redirect latency, but we sleep through an Analytics delay.
2. The 4 Golden Signals
We monitor 4 things:
- Latency: “How long does a redirect take?”
- Alert: If P99 > 200ms for 5 minutes.
- Traffic: “How many requests/sec?”
- Alert: If Traffic drops to 0 (Global Outage).
- Alert: If Traffic spikes 10x (DDoS or Viral).
- Errors: “How many 500s?”
- Alert: If > 1% of requests fail.
- Saturation: “How full is the CPU/RAM?”
- Alert: If Redis Memory > 80%.
3. Deployment Strategy (Canary)
We never deploy to 100% of servers at once.
The Canary Rollout:
- Deploy new code to 1 Server (The Canary).
- Route 1% of traffic to Canary.
- Wait 5 minutes.
- Check: Did Error Rate spike? Did Latency increase?
- If Green: Deploy to remaining 99 servers.
- If Red: Auto-Rollback.
4. Disaster Recovery (The “Oh No” Button)
Scenario 1: Redis Crash
- Impact: Redirects become slow (DB hit), but they still work.
- Plan: Auto-restart Redis. The system absorbs the load using the Postgres Read Replicas (Article 5).
Scenario 2: Primary DB Crash
- Impact: Writes (Creating links) fail. Reads (Redirects) still work!
- Plan: Auto-promote Read Replica to Primary. (30 seconds downtime for writes, 0 downtime for reads).
Scenario 3: Region Outage (us-east-1 goes down)
- Impact: Everything in US East is dead.
- Plan: Route DNS to
eu-west-1(Global Tables ensure data is there).
Summary
A Senior Engineer doesn’t just build features; they build safety nets. By splitting our SLAs (Critical vs Non-Critical) and automating our Rollouts, we ensure reliability without burnout.