← 返回博客列表
Anthropic

Anthropic SDE Exclusive Coding Question: Web Crawler (BFS + Multithreading + System Design)

2026-03-30

This problem is fundamentally graph traversal: pages are nodes, links are edges.

Anthropic


Problem Statement

Implement a web crawler:

Implement a synchronous version first, then optimize to multithreaded.


1) Synchronous BFS Version

Why BFS?

Key ideas

  1. Use urlparse to get host from the seed URL
  2. Use a queue for BFS
  3. Use visited set for dedup
  4. Enqueue only same-host links

Python code (synchronous)

from collections import deque
from urllib.parse import urlparse


def crawl_sync(start_url, get_urls):
    host = urlparse(start_url).netloc

    queue = deque([start_url])
    visited = {start_url}
    result = []

    while queue:
        url = queue.popleft()
        result.append(url)

        for nxt in get_urls(url):
            if urlparse(nxt).netloc != host:
                continue
            if nxt in visited:
                continue
            visited.add(nxt)
            queue.append(nxt)

    return result

Complexity

Let (V) be crawled same-host pages and (E) be same-host links.


2) Optimized Multithreaded Version (ThreadPoolExecutor)

The bottleneck is network I/O, not CPU. So concurrency is the main optimization.

Design

This directly answers the follow-up: how to process tasks immediately when ready.

Python code (multithreaded)

from concurrent.futures import ThreadPoolExecutor, wait, FIRST_COMPLETED
from urllib.parse import urlparse


def crawl_threaded(start_url, get_urls, max_workers=8):
    host = urlparse(start_url).netloc

    visited = {start_url}
    result = []

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        in_flight = {executor.submit(get_urls, start_url): start_url}

        while in_flight:
            done, _ = wait(in_flight.keys(), return_when=FIRST_COMPLETED)

            for fut in done:
                url = in_flight.pop(fut)
                result.append(url)

                try:
                    neighbors = fut.result()
                except Exception:
                    continue

                for nxt in neighbors:
                    if urlparse(nxt).netloc != host:
                        continue
                    if nxt in visited:
                        continue
                    visited.add(nxt)
                    in_flight[executor.submit(get_urls, nxt)] = nxt

    return result

Why not strict layer-by-layer batching?

Layer batching waits for all tasks in a layer to finish before proceeding, which lowers throughput. “Complete-and-process immediately” keeps workers busier and usually crawls faster.


3) How urlparse helps

In this problem, urlparse is mainly used for:

  1. Same-host filtering via netloc
  2. URL normalization prep
  3. Avoiding accidental crawl of external links

Interview tip:


4) Threads vs Processes

Dimension Threads Processes
Memory Shared Isolated
Overhead Lower Higher
Python GIL impact Present Separate interpreters
Best for I/O-bound tasks CPU-bound tasks
For this crawler ✅ Good fit Usually unnecessary

Conclusion: web crawling is typically I/O-bound, so threads (or async I/O) are preferred.


5) Politeness strategy (don’t overload servers)

A production crawler should be polite:

  1. Respect robots.txt
  2. Rate limit (global + per-host)
  3. Cap concurrency
  4. Retry with exponential backoff for 429/5xx
  5. Use clear User-Agent and contact info
  6. Add jitter between requests

6) Distributed design for millions of URLs

Typical large-scale architecture:

  1. URL Frontier Queue (Kafka/SQS/RabbitMQ)
  2. Fetcher Worker cluster (horizontally scalable)
  3. URL dedup service (Redis/Bloom filter + persistent KV)
  4. Parser + index pipeline (extract content/outlinks)
  5. Scheduler + monitoring (retry, DLQ, metrics)

Key reliability points:


7) Duplicate content under different URLs

Same content can appear under different URLs due to tracking params, mirrors, etc.

Use two layers:

  1. URL-level dedup
    • Canonicalize URL (strip tracking params, normalize query/path)
  2. Content-level dedup
    • Exact hash (SHA-256)
    • Near-duplicate fingerprint (SimHash / MinHash)

Interview-friendly summary:

Canonicalize URL first to reduce redundant fetches, then fingerprint content to catch same-page-under-different-URL cases.


8) Interview answer template

  1. Start with sync BFS: queue + visited + same-host filter
  2. Upgrade to concurrency: ThreadPoolExecutor + immediate ready-task processing
  3. Mention production concerns: politeness, retries, throttling
  4. Scale-out design: frontier + workers + dedup + storage
  5. Finish with duplicate-content strategy: canonical URL + content fingerprint

This structure covers coding + follow-up depth very effectively.


Common pitfalls


Summary

This question is easy to start, but strong candidates stand out by completeness:


🚀 Need VO/OA prep support?

oavoservice supports real interview prep across Anthropic, Google, Amazon, Meta, TikTok, Uber, Bloomberg, and more.

WeChat: Coding0201


Contact

Email: [email protected]
Telegram: @OAVOProxy