← Back to all recaps

Anthropic SDE Exclusive Coding Question: Web Crawler (BFS + Multithreading + System Design)

2 min read

This problem is fundamentally graph traversal: pages are nodes, links are edges.

Anthropic


Problem Statement

Implement a web crawler:

  • Input: start_url
  • Helper function provided: get_urls(url)
    • HTTP fetch + basic link extraction is already handled
    • returns all links found on the page
  • Goal:
    1. Crawl only URLs under the same host as start_url
    2. Visit each URL at most once
    3. Return all crawled URLs

Implement a synchronous version first, then optimize to multithreaded.


1) Synchronous BFS Version

Why BFS?

  • The crawler is graph traversal
  • BFS naturally expands discovered links level by level
  • visited avoids cycles and duplicate work

Key ideas

  1. Use urlparse to get host from the seed URL
  2. Use a queue for BFS
  3. Use visited set for dedup
  4. Enqueue only same-host links

Python code (synchronous)

from collections import deque
from urllib.parse import urlparse


def crawl_sync(start_url, get_urls):
    host = urlparse(start_url).netloc

    queue = deque([start_url])
    visited = {start_url}
    result = []

    while queue:
        url = queue.popleft()
        result.append(url)

        for nxt in get_urls(url):
            if urlparse(nxt).netloc != host:
                continue
            if nxt in visited:
                continue
            visited.add(nxt)
            queue.append(nxt)

    return result

Complexity

Let (V) be crawled same-host pages and (E) be same-host links.

  • Time: (O(V + E))
  • Space: (O(V))

2) Optimized Multithreaded Version (ThreadPoolExecutor)

The bottleneck is network I/O, not CPU. So concurrency is the main optimization.

Design

  • Use ThreadPoolExecutor to submit fetch tasks
  • Track future -> url
  • Process tasks as soon as they complete
  • Submit newly discovered URLs immediately

This directly answers the follow-up: how to process tasks immediately when ready.

Python code (multithreaded)

from concurrent.futures import ThreadPoolExecutor, wait, FIRST_COMPLETED
from urllib.parse import urlparse


def crawl_threaded(start_url, get_urls, max_workers=8):
    host = urlparse(start_url).netloc

    visited = {start_url}
    result = []

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        in_flight = {executor.submit(get_urls, start_url): start_url}

        while in_flight:
            done, _ = wait(in_flight.keys(), return_when=FIRST_COMPLETED)

            for fut in done:
                url = in_flight.pop(fut)
                result.append(url)

                try:
                    neighbors = fut.result()
                except Exception:
                    continue

                for nxt in neighbors:
                    if urlparse(nxt).netloc != host:
                        continue
                    if nxt in visited:
                        continue
                    visited.add(nxt)
                    in_flight[executor.submit(get_urls, nxt)] = nxt

    return result

Why not strict layer-by-layer batching?

Layer batching waits for all tasks in a layer to finish before proceeding, which lowers throughput. “Complete-and-process immediately” keeps workers busier and usually crawls faster.


3) How urlparse helps

In this problem, urlparse is mainly used for:

  1. Same-host filtering via netloc
  2. URL normalization prep
  3. Avoiding accidental crawl of external links

Interview tip:

  • Compare normalized host (and optionally scheme/port) for strict matching.

4) Threads vs Processes

Dimension Threads Processes
Memory Shared Isolated
Overhead Lower Higher
Python GIL impact Present Separate interpreters
Best for I/O-bound tasks CPU-bound tasks
For this crawler ✅ Good fit Usually unnecessary

Conclusion: web crawling is typically I/O-bound, so threads (or async I/O) are preferred.


5) Politeness strategy (don’t overload servers)

A production crawler should be polite:

  1. Respect robots.txt
  2. Rate limit (global + per-host)
  3. Cap concurrency
  4. Retry with exponential backoff for 429/5xx
  5. Use clear User-Agent and contact info
  6. Add jitter between requests

6) Distributed design for millions of URLs

Typical large-scale architecture:

  1. URL Frontier Queue (Kafka/SQS/RabbitMQ)
  2. Fetcher Worker cluster (horizontally scalable)
  3. URL dedup service (Redis/Bloom filter + persistent KV)
  4. Parser + index pipeline (extract content/outlinks)
  5. Scheduler + monitoring (retry, DLQ, metrics)

Key reliability points:

  • At-least-once delivery + idempotent processing
  • Eventual consistency is acceptable for dedup at scale

7) Duplicate content under different URLs

Same content can appear under different URLs due to tracking params, mirrors, etc.

Use two layers:

  1. URL-level dedup
    • Canonicalize URL (strip tracking params, normalize query/path)
  2. Content-level dedup
    • Exact hash (SHA-256)
    • Near-duplicate fingerprint (SimHash / MinHash)

Interview-friendly summary:

Canonicalize URL first to reduce redundant fetches, then fingerprint content to catch same-page-under-different-URL cases.


8) Interview answer template

  1. Start with sync BFS: queue + visited + same-host filter
  2. Upgrade to concurrency: ThreadPoolExecutor + immediate ready-task processing
  3. Mention production concerns: politeness, retries, throttling
  4. Scale-out design: frontier + workers + dedup + storage
  5. Finish with duplicate-content strategy: canonical URL + content fingerprint

This structure covers coding + follow-up depth very effectively.


Common pitfalls

  • Missing visited → loops/redundant crawling
  • Incorrect host checks via substring matching
  • No concurrency cap
  • No retry/backoff on transient failures
  • Only URL dedup, no content dedup

Summary

This question is easy to start, but strong candidates stand out by completeness:

  • Baseline: correct BFS crawler
  • Better: efficient concurrent crawling with immediate task handling
  • Strong: can discuss politeness, distributed scale, and dedup architecture

🚀 Need VO/OA prep support?

oavoservice supports real interview prep across Anthropic, Google, Amazon, Meta, TikTok, Uber, Bloomberg, and more.

WeChat: Coding0201

  • ✅ High-frequency question roadmaps
  • ✅ 1v1 mock interviews (algorithms + system design)
  • ✅ Follow-up handling and communication coaching

Contact

Email: [email protected]
Telegram: @OAVOProxy