Anthropic SDE Exclusive Coding Question: Web Crawler (BFS + Multithreading + System Design)

This problem is fundamentally graph traversal: pages are nodes, links are edges.

Anthropic

Problem Statement

Implement a web crawler:

Input: start_url
Helper function provided: get_urls(url)
- HTTP fetch + basic link extraction is already handled
- returns all links found on the page
Goal:
1. Crawl only URLs under the same host as start_url
2. Visit each URL at most once
3. Return all crawled URLs

Implement a synchronous version first, then optimize to multithreaded.

1) Synchronous BFS Version

Why BFS?

The crawler is graph traversal
BFS naturally expands discovered links level by level
visited avoids cycles and duplicate work

Key ideas

Use urlparse to get host from the seed URL
Use a queue for BFS
Use visited set for dedup
Enqueue only same-host links

Python code (synchronous)

from collections import deque
from urllib.parse import urlparse


def crawl_sync(start_url, get_urls):
    host = urlparse(start_url).netloc

    queue = deque([start_url])
    visited = {start_url}
    result = []

    while queue:
        url = queue.popleft()
        result.append(url)

        for nxt in get_urls(url):
            if urlparse(nxt).netloc != host:
                continue
            if nxt in visited:
                continue
            visited.add(nxt)
            queue.append(nxt)

    return result

Complexity

Let (V) be crawled same-host pages and (E) be same-host links.

Time: (O(V + E))
Space: (O(V))

2) Optimized Multithreaded Version (ThreadPoolExecutor)

The bottleneck is network I/O, not CPU. So concurrency is the main optimization.

Design

Use ThreadPoolExecutor to submit fetch tasks
Track future -> url
Process tasks as soon as they complete
Submit newly discovered URLs immediately

This directly answers the follow-up: how to process tasks immediately when ready.

Python code (multithreaded)

from concurrent.futures import ThreadPoolExecutor, wait, FIRST_COMPLETED
from urllib.parse import urlparse


def crawl_threaded(start_url, get_urls, max_workers=8):
    host = urlparse(start_url).netloc

    visited = {start_url}
    result = []

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        in_flight = {executor.submit(get_urls, start_url): start_url}

        while in_flight:
            done, _ = wait(in_flight.keys(), return_when=FIRST_COMPLETED)

            for fut in done:
                url = in_flight.pop(fut)
                result.append(url)

                try:
                    neighbors = fut.result()
                except Exception:
                    continue

                for nxt in neighbors:
                    if urlparse(nxt).netloc != host:
                        continue
                    if nxt in visited:
                        continue
                    visited.add(nxt)
                    in_flight[executor.submit(get_urls, nxt)] = nxt

    return result

Why not strict layer-by-layer batching?

Layer batching waits for all tasks in a layer to finish before proceeding, which lowers throughput. “Complete-and-process immediately” keeps workers busier and usually crawls faster.

3) How `urlparse` helps

In this problem, urlparse is mainly used for:

Same-host filtering via netloc
URL normalization prep
Avoiding accidental crawl of external links

Interview tip:

Compare normalized host (and optionally scheme/port) for strict matching.

4) Threads vs Processes

Dimension	Threads	Processes
Memory	Shared	Isolated
Overhead	Lower	Higher
Python GIL impact	Present	Separate interpreters
Best for	I/O-bound tasks	CPU-bound tasks
For this crawler	✅ Good fit	Usually unnecessary

Conclusion: web crawling is typically I/O-bound, so threads (or async I/O) are preferred.

5) Politeness strategy (don’t overload servers)

A production crawler should be polite:

Respect robots.txt
Rate limit (global + per-host)
Cap concurrency
Retry with exponential backoff for 429/5xx
Use clear User-Agent and contact info
Add jitter between requests

6) Distributed design for millions of URLs

Typical large-scale architecture:

URL Frontier Queue (Kafka/SQS/RabbitMQ)
Fetcher Worker cluster (horizontally scalable)
URL dedup service (Redis/Bloom filter + persistent KV)
Parser + index pipeline (extract content/outlinks)
Scheduler + monitoring (retry, DLQ, metrics)

Key reliability points:

At-least-once delivery + idempotent processing
Eventual consistency is acceptable for dedup at scale

7) Duplicate content under different URLs

Same content can appear under different URLs due to tracking params, mirrors, etc.

Use two layers:

URL-level dedup
- Canonicalize URL (strip tracking params, normalize query/path)
Content-level dedup
- Exact hash (SHA-256)
- Near-duplicate fingerprint (SimHash / MinHash)

Interview-friendly summary:

Canonicalize URL first to reduce redundant fetches, then fingerprint content to catch same-page-under-different-URL cases.

8) Interview answer template

Start with sync BFS: queue + visited + same-host filter
Upgrade to concurrency: ThreadPoolExecutor + immediate ready-task processing
Mention production concerns: politeness, retries, throttling
Scale-out design: frontier + workers + dedup + storage
Finish with duplicate-content strategy: canonical URL + content fingerprint

This structure covers coding + follow-up depth very effectively.

Common pitfalls

Missing visited → loops/redundant crawling
Incorrect host checks via substring matching
No concurrency cap
No retry/backoff on transient failures
Only URL dedup, no content dedup

Summary

This question is easy to start, but strong candidates stand out by completeness:

Baseline: correct BFS crawler
Better: efficient concurrent crawling with immediate task handling
Strong: can discuss politeness, distributed scale, and dedup architecture

🚀 Need VO/OA prep support?

oavoservice supports real interview prep across Anthropic, Google, Amazon, Meta, TikTok, Uber, Bloomberg, and more.

WeChat: Coding0201

✅ High-frequency question roadmaps
✅ 1v1 mock interviews (algorithms + system design)
✅ Follow-up handling and communication coaching

Contact

Email: [email protected]
Telegram: @OAVOProxy