Production Readiness: Operating YouTube at Scale

You’ve built a beautiful system. Real-time analytics at 1M QPS. Efficient transcoding saving billions in bandwidth. Metadata consistency across PostgreSQL, Elasticsearch, Redis. The architecture is sound.

Then 2 AM rolls around. Elasticsearch cluster in APAC goes down. Kafka broker crashes. CDN sees unexpected traffic surge. Creator uploads a file with invalid codec. An alert fires. Then another. Then twenty more.

This part is about operating YouTube at scale: SLOs (Service Level Objectives), multi-region failover, incident response, security, compliance, cost optimization. The systems that separate “technically correct” from “actually works in production.”


Service Level Objectives (SLOs): The Contract with Users

SLO = the promise you make about availability and performance.

YouTube’s SLOs

Service SLO Downtime/Year Target P95 Latency Notes
Video Playback 99.9% 8.76 hours <5s startup Users forgive video quality drops; not buffering
Metadata API 99.99% 52 minutes <1s Video info, comments, likes critical path
Upload Service 99.5% 43.8 hours <30s per chunk Uploads can retry; lower SLA acceptable
Search Service 99.9% 8.76 hours <200ms Can fallback to stale index
Authentication 99.99% 52 minutes <100ms Every request depends on this

Why different SLOs?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Playback: 99.9%
  ├─ CDN handles 90% of requests from cache
  ├─ If origin fails, users get cached version (degraded but working)
  └─ Temporary buffering annoying, but not critical (users retry)

Metadata API: 99.99%
  ├─ No cache fallback (users always expect fresh info)
  ├─ Every video page load hits this
  └─ Lower SLA = 99.9% = 8.76 hours downtime/year
     1000 creators × 8.76 hours = 8760 creator-hours lost per year (unacceptable)

Upload: 99.5%
  ├─ Upload failures can retry from checkpoint
  ├─ Users expect some unreliability (network conditions)
  └─ Higher tolerance acceptable

Calculating SLO Compliance

Metric: Uptime percentage = (Total Time - Downtime) / Total Time

1
2
3
4
5
6
7
8
9
10
11
Month: Jan 2024 (744 hours)
Service: Metadata API
Target: 99.99%
Budget: 744 hours × (1 - 0.9999) = 0.0744 hours = 4.46 minutes

Incidents in Jan:
  1. Database failover: 2 minutes (2024-01-05 14:32-14:34)
  2. Cache corruption: 1 minute (2024-01-18 08:15-08:16)
  3. Network issue: 1.46 minutes (2024-01-25 22:10-22:11:27)
  
Total downtime: 4.46 minutes ✓ Met SLO

If downtime exceeded 4.46 minutes, SLO is breached. Customer refunds may apply (SLA = contractual, penalty-backed version of SLO).


Multi-Region Deployment: Surviving Disasters

One data center fails. Millions of users see service degradation. The economic impact: $1M+ per minute of downtime. Solution: Multi-region active-active deployment.

Topology

graph TB
    subgraph UserTraffic["🌍 Global User Traffic"]
        DNSGeo["DNS Geolocation<br/>Route to nearest region"]
    end
    
    DNSGeo -->|"US-East users"| USEast
    DNSGeo -->|"EU users"| EU
    DNSGeo -->|"APAC users"| APAC
    DNSGeo -->|"South America users"| SAM
    
    subgraph USEast["🗽 US-East-1 Region (Primary)"]
        USEastAPI["API Servers<br/>100 instances<br/>Autoscaling"]
        USEastPG["PostgreSQL Primary<br/>Primary writes<br/>Accepts all updates"]
        USEastES["Elasticsearch Cluster<br/>3 nodes<br/>Full index"]
        USEastRedis["Redis Cluster<br/>100 shards<br/>10 nodes"]
        USEastKafka["Kafka Cluster<br/>10 brokers<br/>Event backbone"]
        USEastCDN["CDN Edges<br/>50 nodes<br/>90% cache hit"]
        USEastMonitor["Monitoring<br/>Prometheus, ELK, Jaeger"]
    end
    
    subgraph EU["🇪🇺 EU-West-1 Region"]
        EUAPI["API Servers<br/>80 instances"]
        EUPG["PostgreSQL Read Replica<br/>Async replication<br/>2-5s lag"]
        EUES["Elasticsearch Replica<br/>Cold standby"]
        EURedis["Redis Cache<br/>Local copy<br/>hot keys"]
        EUKafka["Kafka Consumer<br/>Event replication"]
        EUCDN["CDN Edges<br/>40 nodes"]
        EUMonitor["Monitoring"]
    end
    
    subgraph APAC["🌏 AP-Southeast-1 Region"]
        APACAPI["API Servers<br/>60 instances"]
        APACPG["PostgreSQL Read Replica<br/>Async replication"]
        APACES["Elasticsearch Replica"]
        APACRedis["Redis Cache"]
        APACKafka["Kafka Consumer"]
        APACCDN["CDN Edges<br/>30 nodes"]
    end
    
    subgraph SAM["🇧🇷 SA-East-1 Region"]
        SAMAPI["API Servers<br/>30 instances"]
        SAMCDN["CDN Edges<br/>20 nodes"]
        SAMCache["Cache + Fallback"]
    end
    
    USEastPG -->|"Async replicate<br/>every change"| EUPG
    USEastPG -->|"Async replicate"| APACPG
    USEastKafka -->|"Stream events"| EUKafka
    USEastKafka -->|"Stream events"| APACKafka
    
    USEastES -->|"Daily batch<br/>push index"| EUES
    USEastES -->|"Daily batch"| APACES
    
    USEastAPI -->|"Serve US traffic"| USEastCDN
    EUAPI -->|"Serve EU traffic"| EUCDN
    APACAPI -->|"Serve APAC traffic"| APACCDN
    SAMAPI -->|"Serve SAM traffic"| SAMCDN
    
    style USEast fill:#bbdefb
    style EU fill:#c8e6c9
    style APAC fill:#ffe0b2
    style SAM fill:#f8bbd0
    style USEastPG fill:#81c784
    style EUPG fill:#64b5f6
    style APACPG fill:#64b5f6

Failover Scenarios

Scenario 1: US-East primary database fails

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
T=0:00  Primary PostgreSQL crashes (hardware failure)
        ├─ Automated health check fails
        └─ Replication monitor detects lag > 5 seconds

T=0:15  EU-West read replica promoted to primary
        ├─ DNS updated: postgres-primary now points to EU instance
        ├─ Data loss: Last 10-30 seconds (replication lag)
        ├─ EU becomes new write master
        └─ EU writers start going to EU-West-1 PostgreSQL

Impact:
  ├─ US-East: Read-only mode (users see cached data)
  ├─ EU: Full write access restored
  ├─ APAC: Read-only mode (data from stale replica)
  
Recovery:
  ├─ US-East: Rebuild primary from backups (1-2 hours)
  ├─ Promote back to US-East after verification
  └─ Total outage: 15 min (failover) + 1-2 hours (rebuild)

RTO: 15 minutes (to failover) + 1-2 hours (to rebuild primary)
RPO: 10-30 seconds (data loss)

Why data loss?

PostgreSQL replication is asynchronous by default. Replication lag is typically 100ms-5 seconds. If primary crashes, the last few seconds of writes are not yet replicated. They’re lost.

To reduce RPO:

1
2
3
4
5
6
7
8
-- Synchronous replication (stronger guarantee, slower writes)
ALTER SYSTEM SET synchronous_commit = remote_apply;

-- Cost: Write latency increases from ~50ms to ~200ms
--       Because write must be acknowledged by replica before returning

-- Trade-off: RPO = ~1ms (writes wait for replication)
--            But P99 latency increases 4x

YouTube probably uses a hybrid approach: Synchronous for critical metadata, asynchronous for less critical (views, likes).


Circuit Breakers: Graceful Degradation

When Elasticsearch goes down, don’t crash the entire API. Degrade gracefully.

Pattern: Circuit Breaker

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
class CircuitBreaker {
  constructor(name, threshold = 5, resetTimeout = 60000) {
    this.name = name;
    this.failureCount = 0;
    this.threshold = threshold;  // Fail after 5 consecutive errors
    this.resetTimeout = resetTimeout;  // Try again after 60 seconds
    this.state = 'CLOSED';  // 'CLOSED' = working, 'OPEN' = failing, 'HALF_OPEN' = testing
    this.lastFailureTime = null;
  }
  
  async execute(fn) {
    if (this.state === 'OPEN') {
      // Check if timeout expired (try recovery)
      if (Date.now() - this.lastFailureTime > this.resetTimeout) {
        this.state = 'HALF_OPEN';
        this.failureCount = 0;
      } else {
        throw new Error(`Circuit breaker ${this.name} is OPEN`);
      }
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }
  
  onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }
  
  onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    
    if (this.failureCount >= this.threshold) {
      this.state = 'OPEN';
      logger.error(`Circuit breaker ${this.name} opened after ${this.failureCount} failures`);
    }
  }
}

// Usage
const esBreakerbreaker = new CircuitBreaker('elasticsearch', 5, 60000);

async function searchVideos(query) {
  try {
    // If Elasticsearch down, circuit breaker opens after 5 failures
    return await breaker.execute(async () => {
      return await elasticsearch.search({
        index: 'videos',
        q: query
      });
    });
  } catch (err) {
    // Fallback: search database directly (slower, but works)
    logger.warn('Elasticsearch unavailable, falling back to database search');
    
    return await postgres.query(
      'SELECT * FROM videos WHERE title ILIKE $1 OR description ILIKE $1 LIMIT 20',
      [`%${query}%`]
    );
  }
}

State machine:

1
2
3
4
5
6
7
CLOSED (working)
  ↓ (5 failures)
OPEN (failing, reject all requests)
  ↓ (60 sec timeout)
HALF_OPEN (testing recovery)
  ├─ (success) → CLOSED
  └─ (failure) → OPEN

Benefits:

  • Prevents cascading failures (one service down doesn’t take down others)
  • Automatic recovery (circuit breaker retries after timeout)
  • User experience: See fallback result instead of error
  • Reduces load on failing service (stops sending requests)

Observability: Metrics, Logs, Traces

You can’t fix what you don’t see. At 1M QPS, observability is the difference between “bug goes unnoticed for days” and “bug detected in seconds.”

graph TB
    subgraph DataSources["📊 Data Sources"]
        Apps["Application Servers<br/>API, Transcoding, CDN<br/>2000+ Prometheus metrics"]
        Databases["Databases<br/>PostgreSQL, Redis, ES<br/>Query latency, throughput"]
        Infra["Infrastructure<br/>Kubernetes, VMs<br/>CPU, memory, disk<br/>Network I/O"]
    end
    
    subgraph Collection["🔄 Collection Layer"]
        Prometheus["Prometheus<br/>Scrape metrics<br/>15s intervals<br/>1-year retention"]
        Filebeat["Filebeat<br/>Ship logs from<br/>application servers"]
        Jaeger["Jaeger Agent<br/>Collect traces<br/>Distributed tracing"]
    end
    
    subgraph Storage["💾 Storage Layer"]
        TimeseriesDB["Time-Series DB<br/>Store metrics<br/>100s of metrics/sec"]
        Elasticsearch["Elasticsearch<br/>Store logs<br/>1000s of logs/sec"]
        JaegerBackend["Jaeger Backend<br/>Store traces<br/>100s of traces/sec"]
    end
    
    subgraph Visualization["📈 Visualization & Analysis"]
        Grafana["Grafana<br/>Metrics dashboards<br/>Real-time graphs<br/>Alert rules"]
        Kibana["Kibana<br/>Log search & analysis<br/>Error patterns<br/>Trend analysis"]
        JaegerUI["Jaeger UI<br/>Request flow<br/>Latency breakdown<br/>Service dependencies"]
    end
    
    subgraph Alerting["🚨 Alerting"]
        AlertManager["Alert Manager<br/>Deduplicate alerts<br/>Routing rules"]
        Slack["Slack<br/>Send to channels<br/>On-call routing"]
        PagerDuty["PagerDuty<br/>Incident escalation<br/>Page on-call"]
    end
    
    Apps --> Prometheus
    Databases --> Prometheus
    Infra --> Prometheus
    
    Apps --> Filebeat
    
    Apps --> Jaeger
    
    Prometheus --> TimeseriesDB
    Filebeat --> Elasticsearch
    Jaeger --> JaegerBackend
    
    TimeseriesDB --> Grafana
    Elasticsearch --> Kibana
    JaegerBackend --> JaegerUI
    
    Grafana -->|"Alert if P95 > 5s"| AlertManager
    Grafana -->|"Alert if error rate > 1%"| AlertManager
    Grafana -->|"Alert if shard lag > 5min"| AlertManager
    
    AlertManager --> Slack
    AlertManager --> PagerDuty
    
    Slack -->|"Alert team"| OnCall["🚑 On-call Engineer"]
    PagerDuty -->|"Page escalation"| OnCall
    
    OnClick -->|"Click link in alert"| JaegerUI
    OnCall -->|"Debug with traces"| JaegerUI
    OnCall -->|"Search error logs"| Kibana
    OnCall -->|"Check metrics"| Grafana
    
    style DataSources fill:#ffe0b2
    style Collection fill:#81c784
    style Storage fill:#64b5f6
    style Visualization fill:#a1887f
    style Alerting fill:#ef5350
    style OnCall fill:#f91e63

Critical Metrics to Monitor:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Request latency (P95, P99)
http_request_duration_seconds{endpoint="/videos/{id}"}
  Alert if P95 > 5s for video startup
  Alert if P95 > 1s for metadata API

# Error rate
http_requests_total{status="5xx"}
  Alert if error rate > 1% for 2 minutes

# Redis shard distribution
redis_operations_total{shard="X"}
  Alert if any shard > 15K QPS (capacity is 10K)

# Kafka consumer lag
kafka_consumer_lag_offsets{group="elasticsearch-consumer"}
  Alert if lag > 5 minutes worth of events

# Transcoding job status
transcode_jobs_total{status="failed"}
  Alert if >2% failure rate

Incident Response: 3 AM Pager

It’s 3 AM. Alert fires: “Metadata API P95 latency > 5s”. On-call engineer wakes up.

Incident Response Playbook

Phase 1: Detect & Alert (automated)

1
2
3
4
5
6
Monitoring system detects problem:
  └─ P95 latency > 5s for 5 minutes
  └─ Triggers alert
  └─ Sends Slack: "CRITICAL: Metadata API latency spike"
  └─ Pages PagerDuty: phone call to on-call
  └─ Creates incident ticket (auto-populates with metrics, logs)

Phase 2: Triage (human, 5 minutes)

Engineer wakes up, opens incident dashboard:

1
2
3
4
5
6
7
8
9
10
Time:           2024-01-18 03:32:45 UTC
Service:        Metadata API
Metric:         P95 latency = 6.2s (target: 1s)
Error Rate:     0.5% (normal: 0.01%)
Throughput:     45K QPS (normal: 50K QPS, down 10%)

Last deployment: 1 hour ago (API version 2.15.3)
Recent changes:  Added video recommendation cache

Hypothesis: New code introduced a bottleneck

Phase 3: Quick Mitigation (5-10 minutes)

Options:

  1. Rollback: Revert the 1-hour-old deployment
  2. Scale up: Add more API instances
  3. Debug: Deep-dive into logs (takes time, higher risk)

Engineer chooses rollback (safest, fastest):

1
2
3
4
5
6
# Rollback to previous version (2.15.2)
kubectl rollout undo deployment/metadata-api --to-revision=1

# Monitor latency
# T+2 min: Latency drops to 800ms (normal)
# All alerts clear

Impact: ~10 minutes of degradation (slow search, slow video info pages). Acceptable.

Phase 4: Post-Incident (next day)

1
2
3
4
5
6
7
8
9
10
11
12
13
Incident Report:
  Title: Metadata API Latency Spike on 2024-01-18
  Duration: 10 minutes
  Root Cause: Recommendation cache implementation introduced O(n^2) loop
  Fix: Optimize cache logic, add unit tests
  SLO Impact: Missed 99.99% SLO (1 minute downtime)
  Action Items:
    1. Reimplement cache with O(n) complexity
    2. Add load testing for cache layer (catch before production)
    3. Improve code review process for performance-critical paths
    4. Create runbook for rollback procedures

Blameless culture: No individual blame. Focus on systems and processes.

SLOs & Incident Severity

Incidents are classified by impact:

Severity Definition Response Time Example
CRITICAL Service down or severely degraded Page immediately Metadata API latency > 5s, playback buffering >10% users
HIGH Significant degradation Alert within 5 min Elasticsearch unavailable, 50% error rate
MEDIUM Noticeable issues Alert within 15 min Transcode queue growing, 1% error rate
LOW Minor issues, can wait Ticket created Slow dashboard load time, monitoring system lag

Cost Optimization: 60% Reserved + 40% Spot

YouTube costs are dominated by compute and bandwidth. Optimization is critical.

Reserved vs Spot vs On-Demand

Instance Type Price Commitment Best For
On-Demand $X/hour None, pay as you go Unpredictable workloads, testing
Reserved $0.30X/hour (70% off) 1-3 year commitment Baseline workload (60%)
Spot $0.10X/hour (90% off) Can be revoked with 2min notice Batch jobs, fault-tolerant (40%)

Strategy:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Metadata API: Always needed, even during quiet hours
  ├─ Baseline: 100 instances needed always
  ├─ Buy reserved: 100 instances × $0.30X = $30M/year
  ├─ Peak surge: Need 200 instances sometimes
  ├─ Buy on-demand: 100 instances × $X = $100M/year
  └─ Total: $130M/year

Transcoding: Batch processing, flexible timing
  ├─ Peak: 10,000 GPU cores
  ├─ Buy reserved: 6,000 cores (60%) × $0.30 = $1.8M/month
  ├─ Buy spot: 4,000 cores (40%) × $0.10 = $0.4M/month
  └─ Total: $2.2M/month vs $3M/month (27% savings)

Batch analytics: Off-peak only
  ├─ Buy spot only: Run during cheap hours (night, weekends)
  ├─ Cost: Minimal
  └─ Trade-off: Jobs may take 2-3x longer (revoke = retry elsewhere)

Annual savings from optimization:

1
2
3
4
5
6
7
8
9
10
11
Without optimization:
  Metadata API (on-demand): $150M/year
  Transcoding (on-demand): $25M/year
  Total: $175M/year

With optimization (reserved + spot):
  Metadata API (70% reserved, 30% on-demand): $45M + $45M = $90M/year
  Transcoding (60% reserved, 40% spot): $12M + $4M = $16M/year
  Total: $106M/year

Savings: $175M - $106M = $69M/year (40% reduction)

Digital Rights Management (DRM)

Prevent users from downloading and redistributing videos.

How DRM works:

1
2
3
4
5
6
Video player (browser, app) receives encrypted video stream.
├─ Encrypted with Widevine/PlayReady key
├─ Key is stored securely on device
├─ User's device decrypts in memory
├─ Decrypted stream never written to disk
└─ DRM license manager ensures only authorized playback

Certificate pinning:

1
2
3
4
5
Client verifies server certificate using public key pinning.
├─ Hard-code YouTube's public key in app
├─ When connecting, verify cert matches pinned key
├─ Prevents man-in-the-middle attacks
└─ Ensures videos only decrypt on official YouTube, not pirated copies

Cost: Licenses from Widevine/PlayReady: $1-10M/year

Geo-Blocking

Restrict video access by country (licensing, laws).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
async function checkGeoRestrictions(videoId, userCountry) {
  const video = await getVideoMetadata(videoId);
  const restrictions = video.geo_restrictions;
  
  if (restrictions.type === 'ALLOW') {
    // Whitelist: only these countries can watch
    if (!restrictions.countries.includes(userCountry)) {
      throw new Error('Video unavailable in your country');
    }
  } else if (restrictions.type === 'BLOCK') {
    // Blacklist: these countries cannot watch
    if (restrictions.countries.includes(userCountry)) {
      throw new Error('Video unavailable in your country');
    }
  }
  
  return true;
}

Why? Music/sports videos have licensing deals that vary by country.

YouTube’s content ID system automatically detects copyrighted content.

1
2
3
4
5
6
7
8
9
10
11
12
13
Upload: video_xyz (1 hour)
  ↓
Content ID scan: Compare against 100M+ copyrighted works
  ├─ Match found: "This is Taylor Swift's song"
  ├─ Action options:
  │  ├─ Monetize: Revenue goes to copyright holder
  │  ├─ Block: Remove video
  │  └─ Track: Log views for rights holder
  └─ Creator notified of match

Benefit: Creators can legally use music if they give revenue share
Benefit: Rights holders get paid automatically
Cost: Expensive fingerprinting infrastructure (~$500M/year)

Compliance: GDPR, COPPA, DMCA

GDPR (General Data Protection Regulation)

Users have rights to their data.

Requirements:

  • Data must be stored in EU if user is in EU
  • User can request data export (JSON of all videos, comments, watch history)
  • User can request deletion (“right to be forgotten”)
  • Data breaches must be reported within 72 hours

Implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
User in France requests data export:
  1. System queries:
     ├─ All videos uploaded
     ├─ All watch history
     ├─ All comments posted
     └─ All likes/preferences
  2. Compile into JSON
  3. Encrypt with user's password
  4. Email download link
  5. User downloads 50GB of personal data

User requests deletion:
  1. Soft-delete all personal data
  2. But keep video history for abuse detection (anonymized)
  3. Comply with 30-day deadline

Cost: Legal/compliance team: $10M+/year

COPPA (Children’s Online Privacy Protection)

Protect users under 13 from data collection.

Requirements:

  • No personalized ads for users <13
  • No tracking cookies
  • Parental consent for features like commenting

Implementation:

1
2
3
4
5
User indicates age < 13:
  ├─ Disable recommendations (personalization requires tracking)
  ├─ Disable comments (prevent social features)
  ├─ No personalized ads (show generic ads instead)
  └─ Require parental consent for any account actions

If creator uploads copyrighted content, rights holder can request takedown.

Takedown flow:

1
2
3
4
5
6
7
8
9
10
Rights holder files DMCA notice:
  ├─ Identifies infringing video
  ├─ Provides evidence of copyright ownership
  └─ YouTube removes video within 24 hours

Creator can counter-notice:
  ├─ Claims fair use or ownership
  └─ Video is restored pending dispute resolution

Process is automated at scale (thousands of takedowns/day)

Scaling Scenarios: World Cup Traffic Spike

Every 4 years, World Cup final gets 1B+ concurrent viewers. Load spikes 10x normal.

Prediction & Preparation

1
2
3
4
5
6
7
8
9
10
11
6 months before: Engineering begins stress testing
3 months before: Reserve extra capacity (compute, CDN bandwidth, database)
1 month before: Load tests at 5x capacity
1 week before: Final preparations, on-call teams briefed

Expected metrics (World Cup final):
  ├─ Concurrent viewers: 100M → 1B (10x)
  ├─ QPS: 50K → 500K metadata (10x)
  ├─ View events: 1M → 10M per second (10x)
  ├─ Bandwidth: 500 Tbps → 5 Pbps (10x)
  └─ Duration: 120 minutes (plus pre-game, post-game)

Autoscaling

Kubernetes automatically scales:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: metadata-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: metadata-api
  minReplicas: 100
  maxReplicas: 1000
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 15
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15

What this means:

  • Minimum 100 instances always running
  • Scale up to 1000 if CPU > 70% or memory > 80%
  • Scale up fast (100% increase every 15 seconds during spike)
  • Scale down slowly (50% decrease every 300 seconds after spike)

Knowledge Sharing & Blameless Culture

When incidents happen, the goal is learning, not blame.

Postmortems

After every critical incident:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Meeting: Incident postmortem
Attendees: All engineers involved + manager
Duration: 1 hour
Goal: Understand what happened and prevent recurrence

Format:
  1. Timeline: What happened step-by-step
  2. Root cause: Why did it happen? (not "who messed up")
  3. Impact: How many users affected? How long? Revenue loss?
  4. Action items: Prevent recurrence (code changes, monitoring, training)
  5. Retrospective: What went well? What went poorly?

Example:
  Root cause: Code change introduced O(n^2) algorithm
  Action items:
    - Add Big-O complexity tests to CI/CD
    - Improve code review checklist for performance
    - Add load testing for cache layer
  What went well: Alert fired quickly, rollback was smooth
  What went poorly: Initial debugging was slow, runbook wasn't clear

On-Call Runbooks

Document every common issue and how to fix it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Runbook: Elasticsearch Latency Spike

Symptoms:
  ├─ Search latency P95 > 500ms
  ├─ Elasticsearch CPU > 80%
  └─ Kibana slow to load

Steps:
  1. Check shard distribution: `curl localhost:9200/_cat/shards`
     └─ If unbalanced, rebalance (see below)
  2. Check slow queries: `curl localhost:9200/_stats/search`
     └─ If queries > 1s, they're slow
  3. Check JVM heap: `curl localhost:9200/_nodes/stats/jvm`
     └─ If > 90%, increase heap size (requires restart)
  4. Disable expensive aggregations:
     kubectl set env deployment/elasticsearch ES_EXPENSIVE_AGGS_ENABLED=false
  5. Monitor latency: Watch Prometheus dashboard
  6. Once recovered, post-incident review

Estimated time to resolve: 5-15 minutes
Point of contact: On-call SRE
Escalation: Page SRE manager if still unresolved after 15 min

Knowledge Base

Every engineer can look up:

  • How to deploy a service
  • How to debug common issues
  • How to perform maintenance
  • Example incidents and resolutions

Conclusion: From Architecture to Operations

You’ve designed a system. Now operate it. That means:

  1. SLOs: Define the contract (99.9% availability = 8.76 hours downtime/year acceptable)
  2. Multi-region: Survive any single failure
  3. Circuit breakers: Degrade gracefully instead of cascading failure
  4. Observability: Metrics, logs, traces enable detection and debugging
  5. Incident response: Playbooks and automation get MTTR (mean time to repair) down
  6. Cost optimization: 60% reserved + 40% spot saves $69M/year
  7. Security: DRM, geo-blocking, copyright enforcement
  8. Compliance: GDPR, COPPA, DMCA don’t go away
  9. Scaling: Auto-scale, expect 10x spikes (World Cup), test at 5x
  10. Culture: Blameless postmortems, runbooks, knowledge sharing

The system we designed in Parts 1-4:

  • Part 1: Architecture for 1M concurrent viewers, 500K uploads/day
  • Part 2: Real-time analytics at 1M QPS (sharded counters, Kafka, stream processing)
  • Part 3: Efficient encoding (H.264 + VP9 + AV1, saves $4.56B/year bandwidth)
  • Part 4: Metadata consistency (CDC, Outbox pattern, eventual consistency)

This part tied it together: Operating reliably at planetary scale.

YouTube is not just architecture. It’s a living system, breathing millions of requests per second, spanning the globe, tolerating failures, serving billions of creators and viewers. This series has walked through the blueprint: why each decision matters, how each piece fits, why “good enough” isn’t enough at scale.


Key Takeaways:

  • SLOs define acceptable downtime: 99.9% = 8.76 hours/year, 99.99% = 52 minutes/year
  • Multi-region deployment with automated failover: RTO <15min, RPO <30sec
  • Circuit breakers prevent cascading failures: graceful degradation when dependencies fail
  • Observability (metrics, logs, traces) enables MTTR <5 min for critical incidents
  • Cost optimization: 60% reserved + 40% spot instances saves 40% ($69M/year)
  • Security (DRM, geo-blocking, copyright) is non-negotiable
  • Compliance (GDPR, COPPA, DMCA) is legal requirement
  • Autoscaling handles 10x traffic spikes (World Cup) automatically
  • Blameless postmortems build learning culture
  • Runbooks and knowledge sharing reduce time to recovery