Designing a Web Crawler: Production Readiness (SLOs, Observability, Reliability)

SLOs + error budgets

Based on the requirements and SLO targets, key SLO-style targets are:

  • Availability: 99.5% monthly
  • Crawl latency: p95 500ms
  • Freshness: 90% of content < 7 days

From the monitoring and observability plan:

  • 99.5% availability implies an error budget of ~3.6 hours/month.

Practical interpretation:

  • If we burn the error budget quickly (major outage, persistent backlog), we pause feature work and prioritize reliability fixes.

Observability (metrics/logs/traces; golden signals)

From the monitoring and observability plan, we track the four golden signals:

1) Latency

  • URL processing p50/p95/p99
  • Search query p50/p95/p99

2) Traffic

  • crawl rate (URLs/sec), by type (new vs recrawl vs retries)
  • queue depth / consumer lag (frontier health)

3) Errors

  • HTTP 4xx/5xx rates
  • timeout rates
  • parse failures
  • dedup DB error rate

4) Saturation

  • CPU/memory for crawler nodes
  • DB CPU/IOPS
  • Kafka broker disk/network
  • Elasticsearch heap/GC

Logging:

  • Structured logs for crawl events (url_id, domain, http_status, duration_ms)
  • Sampling for high-volume paths

Tracing:

  • Trace a “URL job” across: dequeue → fetch → parse → dedup → store → index

Reliability patterns (timeouts, retries, circuit breakers, bulkheads)

From the crawler node design:

  • Timeouts: connect 10s, read 30s
  • Retries: bounded (e.g., 3)

Add the standard reliability toolkit:

  • Circuit breakers per domain (stop hammering failing sites)
  • Bulkheads (limit concurrency per domain and per crawler)
  • Backpressure (reduce dequeue when downstreams are saturated)
  • Idempotency for at-least-once delivery (see Part 3)

Data safety (backups, DR, RTO/RPO)

From the storage design:

  • Metadata is in sharded Postgres; losing it breaks scheduling and dedup references.
  • Raw HTML lives in tiered object storage, and can be replicated across regions.

Recommended baseline:

  • Postgres PITR + regular snapshots
  • Cross-region replication for the most critical metadata
  • Clear RTO/RPO targets (interview: pick something realistic like RTO hours, RPO minutes to hours)

Deployments (canary/blue-green, rollbacks, runbooks)

From the crawler node execution model, crawlers are good candidates for Kubernetes-based rollouts.

Deployment strategy:

  • canary new crawler versions (1–5% of nodes)
  • monitor error rate and latency before full rollout
  • quick rollback path (image rollback)

Runbooks (grounded in the monitoring and observability plan):

  • queue backup / consumer lag rising
  • crawl error rate spike (DNS failures, timeouts, 429 bursts)
  • indexing lag (ES pressure)
  • DB shard saturation

First diagnostic step for each (keep it boring and fast):

  • Consumer lag rising: check lag by partition and identify the top N partitions by lag (often a few hot domains).
  • Error spike: break down errors by domain + status class (2xx/3xx/4xx/5xx/timeouts) to detect systemic issues vs a single domain distress event.
  • Indexing lag: check indexer queue depth + Elasticsearch ingestion latency/heap pressure before changing crawl rate.
  • DB shard saturation: identify the hottest shard and the top write queries (URL state updates, links inserts, hash inserts) before scaling blindly.

Observability stack

flowchart TB
	crawlers["Crawler Nodes"] --> prom["Prometheus"]
	prom --> graf["Grafana"]
	prom --> alert["Alertmanager"]
	alert --> oncall["On-Call"]
	alert --> slack["Chat Alerts"]

	crawlers --> logstash["Log Collector"]
	logstash --> es_logs["Log Index (Elasticsearch)"]
	es_logs --> kibana["Kibana"]

	crawlers --> tracing["Tracing Agent"]
	tracing --> jaeger["Jaeger"]

Figure 1: Observability stack (metrics, logs, traces, alerting).

Resilience patterns (fetch path)

flowchart TD
	request["Fetch Request"] --> timeouts["Timeouts"]
	timeouts --> retries["Bounded Retries"]
	retries --> breaker["Circuit Breaker"]
	breaker --> bulkhead["Bulkhead Limits"]
	bulkhead --> success["Success or Fail Fast"]

Figure 2: Resilience toolkit on the fetch path.

Deployment strategy (canary)

flowchart LR
	build["New Crawler Release"] --> canary["Canary Nodes (1-5%)"]
	canary --> gate["Metrics Gate"]
	gate -->|"Pass"| rollout["Full Rollout"]
	gate -->|"Fail"| rollback["Rollback"]

Figure 3: Canary rollout with a metrics gate and rollback.

DR topology (simplified)

flowchart LR
	subgraph primary["Primary Region"]
		pg_p["PostgreSQL Primary"]
		kafka_p["Kafka Cluster"]
	end

	subgraph secondary["Secondary Region"]
		pg_r["PostgreSQL Replica"]
		kafka_r["Kafka Standby"]
	end

	store["Multi-Region Object Storage"]
	failover["DNS Failover"]

	pg_p -->|"Replication"| pg_r
	kafka_p -->|"Mirror"| kafka_r
	primary --> store
	secondary --> store
	failover --> primary
	failover --> secondary

Figure 4: Simplified multi-region DR topology.

Series recap:

  • Parts 1–2 established the foundations and scaling approach.
  • Parts 3–4 compared frontier implementations.
  • Parts 5–6 focused on security and production operations.