Part 6: Security & Production Readiness

1. Security: Trust No One

Authentication & Authorization

Building a wall around the API isn’t enough; we need internal checkpoints.

  • The Problem: “Service A” calls “Tag Service”. Should we trust it?
  • The Zero Trust Solution: Even internal services must present a token (JWT) signed by the Gateway.
  • The Permission Check: Before adding a tag, the Tag Service asks: “Does this user actually own this Jira ticket?” This check is slow, so we cache the “Yes/No” results for 5 minutes.

Threat Models

  • The “Tag Spammer”: A malicious script adds 10,000 tags to a competitor’s repo.
    • Defense: Rate Limiting. We give each user a “bucket” of tokens. If they empty the bucket (too many writes), they get a 429 error. We also implement a hard cap: Max 100 tags per item.
  • XSS via Tag Names: A user names a tag <script>alert('hacked')</script>.
    • Defense: Sanitization. We strictly whitelist tag characters (alphanumeric and hyphens only).

2. Observability: Flying Instrument Rules

You can’t fix what you can’t see. In production, we rely on the “Three Pillars”:

Metrics (The Dashboard)

  • tag_write_latency: If this goes over 200ms, wake up the on-call engineer.
  • cache_hit_ratio: If this drops below 80%, the DB is about to die.
  • kafka_consumer_lag: “How far behind is the trending tags list?”

Logging (The Black Box)

  • Structured JSON logs are non-negotiable.
  • Bad: Log.info("Tag added") (Useless).
  • Good: {"event":"tag_added", "user_id":"u1", "latency_ms": 12, "trace_id":"xyz"}.

Tracing (The X-Ray)

  • With Jaeger/OpenTelemetry, we can follow a single request as it jumps from Gateway -> Service -> DB -> Cache. This is the only way to debug “Why was that one request slow?”

3. Capacity Planning: The Math

Assumptions:

  • 100M DAU.
  • User adds 1 tag/day on average -> 100M writes/day.
  • Write QPS = 100M / 86400 ≈ 1,150 (Avg) -> 50k (Peak Burst).
  • Storage: 100M rows/day * 100 bytes = 10GB/day = 3.6 TB/year.

Hardware Sizing:

  • DB: 3.6TB fits on modern SSDs, but IOPS is the bottleneck. Sharding solves the IOPS/CPU limit. 10 shards start.
  • Cache: 20% of hot data. 200GB Redis cluster (RAM is cheap).