This problem is fundamentally graph traversal: pages are nodes, links are edges.

Problem Statement
Implement a web crawler:
- Input:
start_url - Helper function provided:
get_urls(url)- HTTP fetch + basic link extraction is already handled
- returns all links found on the page
- Goal:
- Crawl only URLs under the same host as
start_url - Visit each URL at most once
- Return all crawled URLs
- Crawl only URLs under the same host as
Implement a synchronous version first, then optimize to multithreaded.
1) Synchronous BFS Version
Why BFS?
- The crawler is graph traversal
- BFS naturally expands discovered links level by level
visitedavoids cycles and duplicate work
Key ideas
- Use
urlparseto get host from the seed URL - Use a queue for BFS
- Use
visitedset for dedup - Enqueue only same-host links
Python code (synchronous)
from collections import deque
from urllib.parse import urlparse
def crawl_sync(start_url, get_urls):
host = urlparse(start_url).netloc
queue = deque([start_url])
visited = {start_url}
result = []
while queue:
url = queue.popleft()
result.append(url)
for nxt in get_urls(url):
if urlparse(nxt).netloc != host:
continue
if nxt in visited:
continue
visited.add(nxt)
queue.append(nxt)
return result
Complexity
Let (V) be crawled same-host pages and (E) be same-host links.
- Time: (O(V + E))
- Space: (O(V))
2) Optimized Multithreaded Version (ThreadPoolExecutor)
The bottleneck is network I/O, not CPU. So concurrency is the main optimization.
Design
- Use
ThreadPoolExecutorto submit fetch tasks - Track
future -> url - Process tasks as soon as they complete
- Submit newly discovered URLs immediately
This directly answers the follow-up: how to process tasks immediately when ready.
Python code (multithreaded)
from concurrent.futures import ThreadPoolExecutor, wait, FIRST_COMPLETED
from urllib.parse import urlparse
def crawl_threaded(start_url, get_urls, max_workers=8):
host = urlparse(start_url).netloc
visited = {start_url}
result = []
with ThreadPoolExecutor(max_workers=max_workers) as executor:
in_flight = {executor.submit(get_urls, start_url): start_url}
while in_flight:
done, _ = wait(in_flight.keys(), return_when=FIRST_COMPLETED)
for fut in done:
url = in_flight.pop(fut)
result.append(url)
try:
neighbors = fut.result()
except Exception:
continue
for nxt in neighbors:
if urlparse(nxt).netloc != host:
continue
if nxt in visited:
continue
visited.add(nxt)
in_flight[executor.submit(get_urls, nxt)] = nxt
return result
Why not strict layer-by-layer batching?
Layer batching waits for all tasks in a layer to finish before proceeding, which lowers throughput. “Complete-and-process immediately” keeps workers busier and usually crawls faster.
3) How urlparse helps
In this problem, urlparse is mainly used for:
- Same-host filtering via
netloc - URL normalization prep
- Avoiding accidental crawl of external links
Interview tip:
- Compare normalized host (and optionally scheme/port) for strict matching.
4) Threads vs Processes
| Dimension | Threads | Processes |
|---|---|---|
| Memory | Shared | Isolated |
| Overhead | Lower | Higher |
| Python GIL impact | Present | Separate interpreters |
| Best for | I/O-bound tasks | CPU-bound tasks |
| For this crawler | ✅ Good fit | Usually unnecessary |
Conclusion: web crawling is typically I/O-bound, so threads (or async I/O) are preferred.
5) Politeness strategy (don’t overload servers)
A production crawler should be polite:
- Respect
robots.txt - Rate limit (global + per-host)
- Cap concurrency
- Retry with exponential backoff for 429/5xx
- Use clear User-Agent and contact info
- Add jitter between requests
6) Distributed design for millions of URLs
Typical large-scale architecture:
- URL Frontier Queue (Kafka/SQS/RabbitMQ)
- Fetcher Worker cluster (horizontally scalable)
- URL dedup service (Redis/Bloom filter + persistent KV)
- Parser + index pipeline (extract content/outlinks)
- Scheduler + monitoring (retry, DLQ, metrics)
Key reliability points:
- At-least-once delivery + idempotent processing
- Eventual consistency is acceptable for dedup at scale
7) Duplicate content under different URLs
Same content can appear under different URLs due to tracking params, mirrors, etc.
Use two layers:
- URL-level dedup
- Canonicalize URL (strip tracking params, normalize query/path)
- Content-level dedup
- Exact hash (SHA-256)
- Near-duplicate fingerprint (SimHash / MinHash)
Interview-friendly summary:
Canonicalize URL first to reduce redundant fetches, then fingerprint content to catch same-page-under-different-URL cases.
8) Interview answer template
- Start with sync BFS: queue + visited + same-host filter
- Upgrade to concurrency: ThreadPoolExecutor + immediate ready-task processing
- Mention production concerns: politeness, retries, throttling
- Scale-out design: frontier + workers + dedup + storage
- Finish with duplicate-content strategy: canonical URL + content fingerprint
This structure covers coding + follow-up depth very effectively.
Common pitfalls
- Missing
visited→ loops/redundant crawling - Incorrect host checks via substring matching
- No concurrency cap
- No retry/backoff on transient failures
- Only URL dedup, no content dedup
Summary
This question is easy to start, but strong candidates stand out by completeness:
- Baseline: correct BFS crawler
- Better: efficient concurrent crawling with immediate task handling
- Strong: can discuss politeness, distributed scale, and dedup architecture
🚀 Need VO/OA prep support?
oavoservice supports real interview prep across Anthropic, Google, Amazon, Meta, TikTok, Uber, Bloomberg, and more.
WeChat: Coding0201
- ✅ High-frequency question roadmaps
- ✅ 1v1 mock interviews (algorithms + system design)
- ✅ Follow-up handling and communication coaching
Contact
Email: [email protected]
Telegram: @OAVOProxy