Designing a Web Crawler: Scale Analysis
What breaks in the basic design (bottlenecks)
At small scale, a single queue + a few crawlers works fine. At 100K URLs/sec, several things become the “real” system design problem.
From the URL frontier and queue/backpressure strategy, the URL frontier must simultaneously:
- keep millions of pending URLs,
- prioritize them,
- enforce politeness per domain,
- handle retries/backoff,
- and feed 100+ crawler nodes.
From the monitoring and observability plan, we also need to handle 5× spikes (up to ~500K URLs/sec), which amplifies queue/backpressure problems.
Common bottlenecks:
- Frontier hot-spotting
- If the frontier is centralized (single Redis or single DB), it becomes the throughput ceiling.
- Politeness enforcement
- Politeness is per-domain, but work is distributed. Without careful partitioning, multiple nodes can accidentally overload the same domain.
- Network latency and tail behavior
- Even if median fetch latency is low, p95/p99 latency can dominate end-to-end throughput.
- Storage write amplification
- Each successful fetch can cause multiple writes: URL state updates, content hash inserts, links insertions, content storage, indexing.
- Dedup performance
- Dedup requires a fast membership test at huge cardinality. If every URL hit the DB for hashing, the DB becomes the bottleneck.
Capacity & constraints (math, units, peak assumptions)
Based on the requirements and scale assumptions:
- Target peak crawl throughput: 100K URLs/sec
Based on observed/assumed spike behavior from the observability plan:
- Peak spikes: 500K URLs/sec (5×)
Implication: We design steady-state for 100K/sec, but we also need:
- buffering (queue backlog) to absorb spikes,
- elastic scale-up where possible,
- and graceful degradation (reduce crawl rate, pause certain domains, drop low-priority work).
Storage constraints from the storage design:
- Metadata DB: ~6TB total, sharded across 10 shards, with heavy read/write load
- Raw HTML: tiered object storage, huge but low QPS compared to metadata
- Search index: ~90TB, query/read heavy
Quick sizing sanity-check (crawler nodes)
From the crawler node execution model, the deployment example uses 120 crawler replicas.
If we want to sustain ~100K URLs/sec at peak (based on the capacity target), then a rough per-node target is:
- $100{,}000 / 120 \approx 833$ URLs/sec/node
This is feasible only if:
- most fetches are I/O-bound (not headless rendering),
- we keep per-domain concurrency bounded (politeness),
- and we aggressively time out slow sites (so tail latency doesn’t pin threads).
(This is intentionally a back-of-the-envelope; actual sizing comes from load testing with the target mix of domains and content types.)
Scalable HLD (sharding/replication/caching/async)
Tip: Kafka partitions (100) cap frontier consumer parallelism, and Postgres shards (10) cap metadata write parallelism. These become hard ceilings unless you re-partition.
Bottlenecks and backlog formation
flowchart TB
frontier["Frontier Throughput"] --> crawlers["Crawler Concurrency"]
crawlers --> network["Network and Tail Latency"]
crawlers --> dedup["Dedup Lookups"]
crawlers --> metadata_db["Metadata DB Writes"]
crawlers --> index["Indexing Throughput"]
network --> backlog["Queue Backlog"]
dedup --> backlog
metadata_db --> backlog
index --> backlog
Figure 1: Where backlogs typically form at scale.
Scaled architecture overview
flowchart TB
subgraph frontier_layer["Frontier Layer"]
kafka["Kafka Topic: crawler_urls"]
partitions["100 Partitions (Domain Hash)"]
kafka --> partitions
end
subgraph crawl_layer["Crawl Layer"]
crawlers["120 Crawler Nodes"]
cooldown["Domain Cooldown Manager"]
http_client["HTTP Client Pool"]
crawlers --> cooldown
crawlers --> http_client
end
subgraph data_layer["Data Layer"]
redis["Redis Cache"]
pg["PostgreSQL Shards (10)"]
s3["Object Storage"]
es["Elasticsearch"]
end
partitions --> crawlers
crawlers --> redis
crawlers --> pg
crawlers --> s3
crawlers --> es
Figure 2: Scaled architecture across frontier, crawlers, and data stores.
Frontier partitioning for politeness
flowchart LR
url["URL"] --> domain["Extract Domain"]
domain --> hash_fn["Hash(domain)"]
hash_fn --> mod_fn["Modulo 100"]
mod_fn --> partition["Kafka Partition"]
partition --> consumer["Single Consumer Group Member"]
Figure 3: Domain-hash partitioning makes politeness enforceable.
Sharding metadata by domain
flowchart LR
domain["Domain"] --> hash_fn["Hash(domain)"]
hash_fn --> mod_fn["Modulo 10"]
mod_fn --> shard0["Shard 0"]
mod_fn --> shard1["Shard 1"]
mod_fn --> shard2["Shard 2"]
mod_fn --> shardN["Shard 9"]
Figure 4: Metadata sharding by domain hash (10 shards).
Dedup cache hierarchy
flowchart TD
hash["Content Hash (SHA-256)"] --> bloom["Bloom Filter"]
bloom -->|"Maybe"| recent["Redis Recent Hash Cache"]
recent -->|"Miss"| db["Hash DB (Authoritative)"]
bloom -->|"No"| store_new["Store New Hash"]
db --> store_new
Figure 5: Dedup membership check from fast-path to authoritative store.
Raw HTML storage tiering
flowchart LR
hot["Hot (0-30d)"] --> warm["Warm (30-90d)"]
warm --> cold["Cold (90d+)"]
cold --> archive["Archive (1y+)"]
hot -->|"Fast Reads"| users["Debug and Reprocess"]
archive -->|"Slow Restore"| users
Figure 6: Tiered storage for raw HTML retention.
Backpressure controls
flowchart LR
lag["Kafka Consumer Lag"] --> autoscale["Scale Crawler Nodes"]
index_lag["Indexing Lag"] --> throttle["Throttle Dequeue"]
db_hot["DB Shard Saturation"] --> shed["Shed Low Priority Crawls"]
autoscale --> reduce_lag["Reduce Lag"]
throttle --> protect_es["Protect Search Index"]
shed --> protect_db["Protect Metadata DB"]
Figure 7: Backpressure signals and the control actions they trigger.
Search read path
flowchart LR
client["Internal Client"] --> api["Search API"]
api --> es["Elasticsearch"]
es --> api
api --> client
Figure 8: Search read path (API → Elasticsearch).
1) Frontier: make it distributed
Based on the URL frontier and queue/backpressure strategy, we choose Kafka for the URL frontier:
- Topic:
crawler_urls - Partitions: 100
- Replication factor: 3
- Consumer group: crawlers (one consumer per crawler node)
Key scaling trick:
- Partition by domain hash:
hash(domain) % 100 - Benefit: all URLs for a domain land in the same partition, making politeness enforcement tractable.
2) Crawler nodes: make network IO efficient
From the crawler node design:
- Connection pooling per domain (5–10 connections)
- DNS caching (~1 hour)
- Timeouts: connect 10s, read 30s
These reduce latency and improve throughput per node without increasing aggression against any one domain.
3) Dedup: avoid DB round-trips for the common case
From the deduplication strategy, we use a layered design:
- Layer 1: Bloom filter in Redis (fast, ~1% false positives)
- Layer 2: “recent hashes” cache in Redis (7-day TTL)
- Layer 3: authoritative DB table (
content_hashes)
This keeps the “is this new?” check fast and protects the database.
4) Metadata DB: shard by domain
From the storage design:
- Shard by
hash(domain) % 10 - Each shard handles ~1B URLs, enabling parallel reads/writes
- Keep critical indexes (
status,next_crawl_at,domain, etc.) to support scheduling and admin queries
5) Raw content: cheap, durable, tiered
From the storage design:
- Store compressed raw HTML in S3 (or HDFS for cost) with hot/warm/cold/archive tiers
- Keep raw content for retention/recrawl analysis without constantly pulling it into hot DB storage
6) Indexing: isolate it as an async pipeline
Indexing is CPU + IO heavy and can lag behind crawling.
- Decouple indexing via async workers
- Backpressure: if indexing falls behind, slow down crawl dequeue or reduce low-priority crawling
Performance characteristics (p95/p99 targets, throughput)
From the monitoring and observability plan:
- Crawl p95 latency target: 500ms
- Crawl p99 latency target: ~1000ms
How we protect latency targets:
- strict timeouts + bounded retries (avoid infinite hangs)
- circuit breakers for failing domains (Part 3)
- queue backlog monitoring (consumer lag, partition lag)
Cost lens (rough, but useful)
At this scale, the dominant cost drivers are usually:
- Raw HTML retention (object storage bytes/month → tiering choices)
- Index size + indexing throughput (Elasticsearch hardware and ops)
- Network egress (if crawlers run in cloud regions with paid egress)
The biggest lever is being strict about what you store and for how long: keep full raw HTML only as long as you need it for debugging/reprocessing, and rely on extracted text + metadata for most downstream uses.
Next: Part 3 deep dives the primary approach—Kafka frontier + politeness + retries—and how we make it reliable under failure.