Designing a Web Crawler: Deep Dive (Approach B - Redis Frontier)
When this alternative is better
From the URL frontier and queue/backpressure strategy, Redis was considered as a centralized, in-memory option. Redis can be the right choice when:
- you’re not targeting 100K URLs/sec yet,
- you want simpler operations than Kafka,
- you want very fast priority operations (sorted sets) without building more scheduling logic,
- your crawler is small-to-medium and can tolerate a simpler HA story.
If your target is “small crawler for a product feature” rather than “internet-scale indexing”, Redis can be a pragmatic trade.
A second (independent) alternative: headless browser fetching
The “Approach B” axis doesn’t have to be about the frontier/queue. Based on the requirements, the crawler may need basic JavaScript support for dynamic content (via headless browser).
This is a separate decision from Kafka vs Redis:
- Kafka/Redis decides how we schedule work across nodes.
- Traditional fetch vs headless decides how we render/fetch a page.
Rule of thumb:
- Prefer traditional fetch-and-parse for most pages (fast and cheap).
- Use headless browsers only when the content is genuinely JS-rendered and required.
flowchart LR
traditional["Traditional Fetch and Parse"] --> fast["Fast and Cheap"]
traditional -->|"Works for"| static["Mostly Static Sites"]
headless["Headless Browser Rendering"] --> slow["Slow and Expensive"]
headless -->|"Needed for"| js["JavaScript Heavy Sites"]
Figure 1: Rendering axis is independent of frontier choice.
Architecture + key differences
Core idea
Use Redis as the “frontier brain”:
- A global priority structure (e.g., sorted set) holds candidate URLs.
- Per-domain gating is enforced by domain cooldown keys.
- Multiple crawler nodes pop work via atomic operations (Lua scripts).
Data structures (conceptual)
frontier:zset→ member = url_id/url, score = priority (and optionally next_allowed_time)domain:cooldown:{domain}→ a key with TTL; if present, domain is “cooling down”inflight:{crawler_id}→ set of URL IDs currently being processed
Politeness and cooldown
From the frontier strategy and crawler node execution model, politeness is mandatory. In Redis, you typically implement:
- a per-domain next-allowed timestamp
- jittered delays
- exponential backoff on 429/5xx/timeouts
Reliability / HA
This is where Redis differs the most from Kafka:
- Kafka provides a durable, partitioned log by design.
- Redis needs careful HA: replication + sentinel/cluster, persistence tuning, and handling failover correctness.
In other words: Redis can be “fast and simple” for moderate scale, but can become the bottleneck or operational risk at very high throughput.
Tip: Be explicit about failover semantics. If you can tolerate duplicates (and have idempotent downstream writes), Redis becomes much easier to operate.
What happens on Redis failover?
Redis failover is primarily a correctness concern: in-flight work can be lost or duplicated depending on your persistence and how you track leases.
Practical approach:
- Treat dequeue as a lease (URL moves to an
inflightset with a TTL/expiry timestamp). - If a crawler fails, a reaper job moves expired leases back to the frontier.
- After failover, assume at-least-once semantics and rely on the same idempotency protections described in Part 3 (unique URL constraints, hash-keyed inserts, index upserts).
Comparison table (complexity, cost, performance, ops)
Redis frontier mechanics
flowchart TD
redis_zset["Redis ZSET (Priority Frontier)"] --> pop["Atomic Pop (Lua)"]
pop --> crawler["Crawler Node"]
redis_cooldown["Domain Cooldown Keys"] --> pop
Figure 2: Redis frontier sketch (priority pop + cooldown gating).
Kafka vs Redis at a glance
flowchart LR
kafka["Kafka Frontier"] --> kafka_pro1["High Throughput"]
kafka --> kafka_pro2["Durable Replay"]
redis["Redis Frontier"] --> redis_pro1["Fast Priority Ops"]
redis --> redis_pro2["Simpler Setup"]
Figure 3: High-level tradeoff summary for frontier technologies.
| Dimension | Approach A (Kafka frontier) | Approach B (Redis frontier) |
|---|---|---|
| Throughput | Very high (partitioned log) | High, but central hot spots emerge |
| Ordering/priority | Approximate unless extra logic | Stronger priority semantics with ZSET |
| Politeness enforcement | Easy with domain→partition mapping | Needs careful per-domain gating logic |
| Replay / backfill | Built-in (log replay) | Requires custom persistence and requeue |
| Ops complexity | Kafka clusters + monitoring | Redis cluster/sentinel + persistence tuning |
| Failure behavior | Consumer lag / partition rebalances | Failover edge cases, hot keys, memory pressure |
| Best for | Large-scale distributed crawlers | Smaller crawlers, prototyping, simpler systems |
Next: Part 5 covers security/auth and the trust boundaries that matter for a crawler.