Designing a Web Crawler: Deep Dive (Approach A - Kafka Frontier)
Problem statement
The hardest part of a high-throughput crawler is not “fetch a page”. It’s coordinating millions of URLs across 100+ nodes while:
- staying polite per domain,
- prioritizing freshness/importance,
- handling retries and backoff,
- and staying reliable during spikes and failures.
This is the URL frontier problem.
Proposed solution (protocols, data model, flows)
We implement the frontier using Kafka, with domain-based partitioning to make politeness enforceable.
From the URL frontier and queue/backpressure strategy:
- Topic:
crawler_urls - Partitions: 100
- Replication: 3
- Partition key: domain hash (
hash(domain) % 100)
From the crawler node design, each node has:
- Kafka consumer (URL dequeuer)
- Domain Cooldown Manager
- HTTP client with pooling + DNS cache
- Parser/extractor
- Dedup checker
- Result reporter
From the API contract, crawler nodes use:
GET /api/v1/crawler/dequeuePOST /api/v1/crawler/report
End-to-end flow
1) URL discovery
- Seeds, sitemaps, and extracted links produce candidate URLs.
- Normalize/canonicalize URLs (scheme, fragments, known tracking params) before enqueue.
2) Enqueue into frontier
- Producer publishes a message keyed by
domain. - Partitioning keeps a domain’s URLs in a single partition.
Priority and freshness scheduling (what makes a frontier a frontier)
Kafka is a distributed log, not a strict priority queue, so the “frontier behavior” comes from the message schema plus consumer-side scheduling.
From the URL frontier and queue/backpressure strategy:
priorityis defined on a 0–1000 scale (higher = process first).- Priority bands (examples): Critical, High, Medium, Low, Batch.
- Retries use exponential backoff and sometimes reduce priority (e.g., 429).
- Recrawls come from a background job that enqueues URLs where
next_crawl_at <= NOW().
Practical implementation sketch (per partition):
- Maintain separate in-memory queues per priority band, fed by Kafka messages.
- Dequeue with weighted random (e.g., critical > high > medium) to prevent starvation.
- For recrawls, treat the enqueue “reason” (
scheduled_recrawl) as metadata so we can measure freshness vs discovery throughput.
This keeps the system honest about what it optimizes for: freshness and importance, while still respecting politeness.
Frontier persistence and recovery semantics (Kafka replay vs DB state)
Kafka is the durable work log, but the metadata DB is the durable state machine.
Practical split:
- Kafka: “candidate work” events (discover, retry, scheduled recrawl) that can be replayed after failures.
- Metadata DB (
urls,crawl_queue): authoritative URL status and scheduling state (next_crawl_at, retry counters, last error, last fetched at).
Recovery pattern:
- If a crawler crashes mid-job, the Kafka message can be reprocessed (at-least-once). Idempotent writes (unique URL constraint, hash-keyed inserts, index upserts) keep this safe.
- A periodic reconciler re-enqueues URLs that are “due” (
next_crawl_at <= now) and reclaims stuck inflight work (e.g., heartbeat/lease expired).
3) Dequeue and politeness
- A crawler node consumes messages from its assigned partitions.
- Domain Cooldown Manager enforces per-domain delay and backs off on failures.
4) Fetch + parse
- HTTP client uses pooling + timeouts (connect 10s, read 30s per the crawler node design).
- Parser extracts text + metadata + outgoing links.
5) Dedup From the deduplication strategy:
- Normalize content, compute SHA-256
- Check Bloom filter → Redis cache → DB
- Mark duplicates by linking
duplicate_of_url_idinurls
6) Store + index From the storage design:
- Metadata state update to PostgreSQL shards (by domain hash % 10)
- Raw HTML stored to tiered object storage
- Index documents written to Elasticsearch
Core success path
flowchart TD
dequeue["Dequeue URL"] --> cooldown["Domain Cooldown Check"]
cooldown --> dns["DNS Resolve"]
dns --> http["HTTP Fetch"]
http --> parse["Parse HTML"]
parse --> extract["Extract Text and Links"]
extract --> hash["Compute Content Hash"]
hash --> dedup["Dedup Decision"]
dedup --> index["Index Document"]
Figure 1: Core success path from dequeue to indexing.
Error handling and retries
flowchart TD
dns["DNS Resolve"] -->|"Fail"| dns_retry["Retry With Backoff"]
dns -->|"OK"| http["HTTP Fetch"]
http --> status{"HTTP Status"}
status -->|"2xx"| parse["Parse HTML"]
status -->|"3xx"| redirect["Follow Redirect"]
status -->|"4xx"| drop["Log and Drop"]
status -->|"5xx"| backoff["Exponential Backoff"]
redirect --> http
backoff --> http
parse -->|"Fail"| failed["Mark Failed"]
parse -->|"OK"| store["Store and Index"]
Figure 2: Retry policy sketch (DNS/HTTP/redirects/parse failures).
Domain politeness at the worker
flowchart LR
kafka_partition["Kafka Partition (Domain Bucket)"] --> crawler["Crawler Node"]
crawler --> cooldown["Domain Cooldown Manager"]
cooldown --> per_domain_limit["Per-Domain Concurrency Limit"]
per_domain_limit --> http_pool["HTTP Connection Pool"]
Figure 3: Where politeness is enforced inside the crawler node.
Dedup layers (fast path to slow path)
flowchart TD
content["Normalized Content"] --> sha["SHA-256"]
sha --> bloom["Bloom Filter"]
bloom -->|"Maybe"| redis["Redis Recent Hash"]
redis -->|"Miss"| db["Hash DB"]
bloom -->|"No"| new_content["Treat As New"]
db --> new_content
Figure 4: Layered dedup to protect the DB at scale.
Idempotency keys (at-least-once safety)
flowchart TD
enqueue["Enqueue Candidate URL"] --> canonical["Canonicalize URL"]
canonical --> url_unique["Unique URL Constraint"]
report["Report Crawl Result"] --> hash_key["Hash Keyed Insert"]
hash_key --> index_upsert["Index Upsert (Doc ID = URL ID)"]
url_unique --> report
Figure 5: Idempotency strategy for at-least-once delivery.
Write path targets
flowchart LR
crawler["Crawler Node"] --> report["Report API"]
report --> pg["PostgreSQL Shard"]
report --> s3["Object Storage"]
report --> indexer["Index Worker"]
indexer --> es["Elasticsearch"]
Figure 6: Write targets for metadata, raw content, and indexing.
Failure modes + retries/timeouts/idempotency
1) The same URL appears many times
Duplicate URLs happen frequently (link graph repeats, parameters, retries). We handle this via idempotency at multiple layers:
- URL canonicalization before enqueue
urls.urlunique constraint in the metadata DB (from the data model and sharding plan)- Content hashing-based dedup (from the deduplication strategy)
2) Kafka consumer lag during spikes
From the monitoring and observability plan, peak spikes can be 5×. If crawlers can’t keep up:
- consumer lag grows per partition
- freshness degrades
Mitigations:
- scale crawler nodes horizontally (add consumers)
- reduce low-priority crawling temporarily
- increase queue retention (so backlog isn’t lost)
3) Domain distress / politeness failures
If a domain starts returning 429/503 or timing out:
- apply exponential backoff (as part of the retry/backoff policy)
- open a circuit breaker for that domain (as part of crawler node reliability)
- stop consuming more URLs for that domain until cooldown expires
4) Timeouts, retries, and retry storms
From the crawler node design:
- connect timeout: 10s
- read timeout: 30s
Rules of thumb:
- bounded retries (e.g., max 3)
- jittered backoff to avoid synchronized retry storms
- treat certain errors as non-retriable (e.g., persistent 404)
5) Exactly-once vs at-least-once semantics
In practice, Kafka consumption is at-least-once unless you build a full transactional pipeline. We embrace at-least-once and make all downstream writes idempotent:
- URL state updates use URL primary keys
- Content hash inserts are keyed by hash
- Index writes can be idempotent by document ID = URL ID
Performance & scale notes
- Partition count (100) defines parallelism for the frontier.
- Postgres shards (10) define parallelism for metadata.
- The effective system throughput is bounded by the slowest of: network fetch, storage writes, and indexing.
Trade-offs + why this approach wins
Why Kafka works well here:
- High throughput, horizontal scaling
- Durable log (replay possible)
- Consumer groups align with many crawler nodes
Trade-offs:
- Operational overhead (Kafka cluster management)
- Ordering/priority is approximate (per partition) unless you add extra scheduling logic
Next: Part 4 covers an alternative frontier design (Redis) and when it can be the better choice.