Designing a Web Crawler: Deep Dive (Approach B - Redis Frontier)

When this alternative is better

From the URL frontier and queue/backpressure strategy, Redis was considered as a centralized, in-memory option. Redis can be the right choice when:

  • you’re not targeting 100K URLs/sec yet,
  • you want simpler operations than Kafka,
  • you want very fast priority operations (sorted sets) without building more scheduling logic,
  • your crawler is small-to-medium and can tolerate a simpler HA story.

If your target is “small crawler for a product feature” rather than “internet-scale indexing”, Redis can be a pragmatic trade.

A second (independent) alternative: headless browser fetching

The “Approach B” axis doesn’t have to be about the frontier/queue. Based on the requirements, the crawler may need basic JavaScript support for dynamic content (via headless browser).

This is a separate decision from Kafka vs Redis:

  • Kafka/Redis decides how we schedule work across nodes.
  • Traditional fetch vs headless decides how we render/fetch a page.

Rule of thumb:

  • Prefer traditional fetch-and-parse for most pages (fast and cheap).
  • Use headless browsers only when the content is genuinely JS-rendered and required.
flowchart LR
	traditional["Traditional Fetch and Parse"] --> fast["Fast and Cheap"]
	traditional -->|"Works for"| static["Mostly Static Sites"]

	headless["Headless Browser Rendering"] --> slow["Slow and Expensive"]
	headless -->|"Needed for"| js["JavaScript Heavy Sites"]

Figure 1: Rendering axis is independent of frontier choice.

Architecture + key differences

Core idea

Use Redis as the “frontier brain”:

  • A global priority structure (e.g., sorted set) holds candidate URLs.
  • Per-domain gating is enforced by domain cooldown keys.
  • Multiple crawler nodes pop work via atomic operations (Lua scripts).

Data structures (conceptual)

  • frontier:zset → member = url_id/url, score = priority (and optionally next_allowed_time)
  • domain:cooldown:{domain} → a key with TTL; if present, domain is “cooling down”
  • inflight:{crawler_id} → set of URL IDs currently being processed

Politeness and cooldown

From the frontier strategy and crawler node execution model, politeness is mandatory. In Redis, you typically implement:

  • a per-domain next-allowed timestamp
  • jittered delays
  • exponential backoff on 429/5xx/timeouts

Reliability / HA

This is where Redis differs the most from Kafka:

  • Kafka provides a durable, partitioned log by design.
  • Redis needs careful HA: replication + sentinel/cluster, persistence tuning, and handling failover correctness.

In other words: Redis can be “fast and simple” for moderate scale, but can become the bottleneck or operational risk at very high throughput.

Tip: Be explicit about failover semantics. If you can tolerate duplicates (and have idempotent downstream writes), Redis becomes much easier to operate.

What happens on Redis failover?

Redis failover is primarily a correctness concern: in-flight work can be lost or duplicated depending on your persistence and how you track leases.

Practical approach:

  • Treat dequeue as a lease (URL moves to an inflight set with a TTL/expiry timestamp).
  • If a crawler fails, a reaper job moves expired leases back to the frontier.
  • After failover, assume at-least-once semantics and rely on the same idempotency protections described in Part 3 (unique URL constraints, hash-keyed inserts, index upserts).

Comparison table (complexity, cost, performance, ops)

Redis frontier mechanics

flowchart TD
	redis_zset["Redis ZSET (Priority Frontier)"] --> pop["Atomic Pop (Lua)"]
	pop --> crawler["Crawler Node"]
	redis_cooldown["Domain Cooldown Keys"] --> pop

Figure 2: Redis frontier sketch (priority pop + cooldown gating).

Kafka vs Redis at a glance

flowchart LR
	kafka["Kafka Frontier"] --> kafka_pro1["High Throughput"]
	kafka --> kafka_pro2["Durable Replay"]

	redis["Redis Frontier"] --> redis_pro1["Fast Priority Ops"]
	redis --> redis_pro2["Simpler Setup"]

Figure 3: High-level tradeoff summary for frontier technologies.

Dimension Approach A (Kafka frontier) Approach B (Redis frontier)
Throughput Very high (partitioned log) High, but central hot spots emerge
Ordering/priority Approximate unless extra logic Stronger priority semantics with ZSET
Politeness enforcement Easy with domain→partition mapping Needs careful per-domain gating logic
Replay / backfill Built-in (log replay) Requires custom persistence and requeue
Ops complexity Kafka clusters + monitoring Redis cluster/sentinel + persistence tuning
Failure behavior Consumer lag / partition rebalances Failover edge cases, hot keys, memory pressure
Best for Large-scale distributed crawlers Smaller crawlers, prototyping, simpler systems

Next: Part 5 covers security/auth and the trust boundaries that matter for a crawler.