Part 6: Security & Production Readiness

1. Security: Trust No One

Building a wall around the API isn’t enough; we need internal checkpoints.

The Problem: “Service A” calls “Tag Service”. Should we trust it?
The Zero Trust Solution: Even internal services must present a token (JWT) signed by the Gateway.
The Permission Check: Before adding a tag, the Tag Service asks: “Does this user actually own this Jira ticket?” This check is slow, so we cache the “Yes/No” results for 5 minutes.

The “Tag Spammer”: A malicious script adds 10,000 tags to a competitor’s repo.
- Defense: Rate Limiting. We give each user a “bucket” of tokens. If they empty the bucket (too many writes), they get a 429 error. We also implement a hard cap: Max 100 tags per item.
XSS via Tag Names: A user names a tag <script>alert('hacked')</script>.
- Defense: Sanitization. We strictly whitelist tag characters (alphanumeric and hyphens only).

You can’t fix what you can’t see. In production, we rely on the “Three Pillars”:

Structured JSON logs are non-negotiable.
Bad: Log.info("Tag added") (Useless).
Good: {"event":"tag_added", "user_id":"u1", "latency_ms": 12, "trace_id":"xyz"}.

With Jaeger/OpenTelemetry, we can follow a single request as it jumps from Gateway -> Service -> DB -> Cache. This is the only way to debug “Why was that one request slow?”

Assumptions:

Hardware Sizing:

DB: 3.6TB fits on modern SSDs, but IOPS is the bottleneck. Sharding solves the IOPS/CPU limit. 10 shards start.
Cache: 20% of hot data. 200GB Redis cluster (RAM is cheap).