Designing a Web Crawler: Production Readiness (SLOs, Observability, Reliability)

SLOs + error budgets

Based on the requirements and SLO targets, key SLO-style targets are:

Availability: 99.5% monthly
Crawl latency: p95 500ms
Freshness: 90% of content < 7 days

From the monitoring and observability plan:

99.5% availability implies an error budget of ~3.6 hours/month.

Practical interpretation:

If we burn the error budget quickly (major outage, persistent backlog), we pause feature work and prioritize reliability fixes.

Observability (metrics/logs/traces; golden signals)

From the monitoring and observability plan, we track the four golden signals:

1) Latency

URL processing p50/p95/p99
Search query p50/p95/p99

2) Traffic

crawl rate (URLs/sec), by type (new vs recrawl vs retries)
queue depth / consumer lag (frontier health)

3) Errors

HTTP 4xx/5xx rates
timeout rates
parse failures
dedup DB error rate

4) Saturation

CPU/memory for crawler nodes
DB CPU/IOPS
Kafka broker disk/network
Elasticsearch heap/GC

Logging:

Structured logs for crawl events (url_id, domain, http_status, duration_ms)
Sampling for high-volume paths

Tracing:

Trace a “URL job” across: dequeue → fetch → parse → dedup → store → index

Reliability patterns (timeouts, retries, circuit breakers, bulkheads)

From the crawler node design:

Timeouts: connect 10s, read 30s
Retries: bounded (e.g., 3)

Add the standard reliability toolkit:

Circuit breakers per domain (stop hammering failing sites)
Bulkheads (limit concurrency per domain and per crawler)
Backpressure (reduce dequeue when downstreams are saturated)
Idempotency for at-least-once delivery (see Part 3)

Data safety (backups, DR, RTO/RPO)

From the storage design:

Metadata is in sharded Postgres; losing it breaks scheduling and dedup references.
Raw HTML lives in tiered object storage, and can be replicated across regions.

Recommended baseline:

Postgres PITR + regular snapshots
Cross-region replication for the most critical metadata
Clear RTO/RPO targets (interview: pick something realistic like RTO hours, RPO minutes to hours)

Deployments (canary/blue-green, rollbacks, runbooks)

From the crawler node execution model, crawlers are good candidates for Kubernetes-based rollouts.

Deployment strategy:

canary new crawler versions (1–5% of nodes)
monitor error rate and latency before full rollout
quick rollback path (image rollback)

Runbooks (grounded in the monitoring and observability plan):

queue backup / consumer lag rising
crawl error rate spike (DNS failures, timeouts, 429 bursts)
indexing lag (ES pressure)
DB shard saturation

First diagnostic step for each (keep it boring and fast):

Consumer lag rising: check lag by partition and identify the top N partitions by lag (often a few hot domains).
Error spike: break down errors by domain + status class (2xx/3xx/4xx/5xx/timeouts) to detect systemic issues vs a single domain distress event.
Indexing lag: check indexer queue depth + Elasticsearch ingestion latency/heap pressure before changing crawl rate.
DB shard saturation: identify the hottest shard and the top write queries (URL state updates, links inserts, hash inserts) before scaling blindly.

Observability stack

flowchart TB
	crawlers["Crawler Nodes"] --> prom["Prometheus"]
	prom --> graf["Grafana"]
	prom --> alert["Alertmanager"]
	alert --> oncall["On-Call"]
	alert --> slack["Chat Alerts"]

	crawlers --> logstash["Log Collector"]
	logstash --> es_logs["Log Index (Elasticsearch)"]
	es_logs --> kibana["Kibana"]

	crawlers --> tracing["Tracing Agent"]
	tracing --> jaeger["Jaeger"]

Figure 1: Observability stack (metrics, logs, traces, alerting).

Resilience patterns (fetch path)

flowchart TD
	request["Fetch Request"] --> timeouts["Timeouts"]
	timeouts --> retries["Bounded Retries"]
	retries --> breaker["Circuit Breaker"]
	breaker --> bulkhead["Bulkhead Limits"]
	bulkhead --> success["Success or Fail Fast"]

Figure 2: Resilience toolkit on the fetch path.

Deployment strategy (canary)

flowchart LR
	build["New Crawler Release"] --> canary["Canary Nodes (1-5%)"]
	canary --> gate["Metrics Gate"]
	gate -->|"Pass"| rollout["Full Rollout"]
	gate -->|"Fail"| rollback["Rollback"]

Figure 3: Canary rollout with a metrics gate and rollback.

DR topology (simplified)

flowchart LR
	subgraph primary["Primary Region"]
		pg_p["PostgreSQL Primary"]
		kafka_p["Kafka Cluster"]
	end

	subgraph secondary["Secondary Region"]
		pg_r["PostgreSQL Replica"]
		kafka_r["Kafka Standby"]
	end

	store["Multi-Region Object Storage"]
	failover["DNS Failover"]

	pg_p -->|"Replication"| pg_r
	kafka_p -->|"Mirror"| kafka_r
	primary --> store
	secondary --> store
	failover --> primary
	failover --> secondary

Figure 4: Simplified multi-region DR topology.

Series recap:

Parts 1–2 established the foundations and scaling approach.
Parts 3–4 compared frontier implementations.
Parts 5–6 focused on security and production operations.