Anthropic's system design interview is nothing like traditional FAANG: instead of distributed KV stores or shorteners, the surface is LLM serving, RAG, tool-calling agents, and model evaluation pipelines. This article walks the four canonical question patterns from 2026 spring feedback with whiteboard scripts and a VO assist playbook.
Anthropic System Design Snapshot
| Dimension | Detail |
|---|---|
| Duration | 60 minutes |
| Format | Excalidraw / physical whiteboard |
| Pacing | 5-min clarify + 40-min design + 15-min follow-up |
| Signals | Scale + correctness + safety + extensibility |
| Mandatory tracks | LLM serving / RAG / agent / eval |
Pattern 1: Long-Context Inference Serving
Prompt
"Design Claude's 200K context inference serving stack supporting 100K QPS, p95 latency ≤ 2s, with cost under control."
Framework
- Clarify: 100K QPS means input tokens or requests per second? Average prompt length? Streaming vs non-streaming?
- Data flow: Client → LB → Tokenizer → Prefill GPU pool → Decode GPU pool → Streaming response
- Key design:
- Prefill / Decode separation: prefill is compute-bound; decode is memory-bandwidth-bound
- Continuous batching: vLLM / SGLang-style dynamic batching
- KV cache offload: long context → CPU offload or PagedAttention
- Prefix caching: shared KV cache across same prompt prefixes (Anthropic's official prompt caching)
- Scale math: 100K QPS × 200K avg context = 20B tokens/sec → estimate H100 node count
- Failure recovery: GPU node failures → routing skip → retry
Common traps
- No prefill / decode split → 30–50% lower utilization
- KV cache without offload → OOM
- No prefix caching → recomputing identical prompts
Pattern 2: 100M-Document RAG
Prompt
"Design a RAG system over 100M documents at 10 QPS with retrieval ≤ 100ms."
Framework
- Clarify: Average doc length? Update frequency? Multilingual?
- Data flow:
- Indexing: Doc → chunker → embedding → vector DB
- Query: Query → embedding → ANN → rerank → top-K → LLM context
- Key design:
- Vector DB: HNSW / IVF-PQ; Pinecone / Qdrant / Milvus
- Sharding: 100M / 10 = 10M per shard
- Rerank: top-100 → cross-encoder → top-10
- Hybrid retrieval: BM25 + dense embedding fusion
- Storage estimate: 100M × 4KB × 1024-dim float16 ≈ 400 GB embeddings + 400 GB raw
- Updates: incremental indexing + periodic rebuild
Common traps
- Pure dense embedding without BM25 → entity / number recall drops
- No rerank → top-K recall loses 15–25 pp
- Index and query non-distributed → single-point bottleneck
Pattern 3: Tool-Calling Agent
Prompt
"Design an LLM agent with 5 tools (search / calculator / API / file IO / code execution) that's resumable, rollbackable, and auditable."
Framework
- Clarify: Single agent vs multi-agent? Concurrent tool calls?
- Data flow: User query → LLM → tool plan → execute → feedback to LLM → final answer
- Key design:
- State machine: each step records
(state_id, tool, input, output, status) - Checkpoint: write-ahead log around every tool call enables rollback
- Sandbox: code execution in docker / wasm
- Audit log: full trace of every tool call
- Timeout / cancel: graceful exit on user abort or tool timeout
- State machine: each step records
- Failure recovery:
- Tool failure → feed error back to LLM for self-correction
- LLM format error → retry with structured output schema
Common traps
- No state machine → no recovery
- Tools execute raw user SQL / shell → security hole
- No audit log → cannot trace model behavior
Pattern 4: Model Evaluation Pipeline
Prompt
"Design an evaluation pipeline running 10 benchmarks × 1000 questions daily and powering a dashboard."
Framework
- Clarify: Eval metric? Checkpoint cadence? Compute budget?
- Data flow: Cron → pull latest checkpoint → benchmark concurrency → store → dashboard
- Key design:
- Benchmark batching: 10 × 1000 = 10K, batched inference
- Storage: S3 (raw) + Postgres (aggregates) + ClickHouse (analytics)
- Dashboard: Grafana / internal BI
- Regression alert: ≥ 1pp accuracy drop vs previous checkpoint pages
- Extensibility: new benchmark via config, no code change
Common traps
- Serial benchmark execution → 4h becomes 8h
- Not storing raw output → cannot debug regressions
- No regression monitor → silent model decay
VO Assist Playbook
What oavoservice VO assist gives you
- Four-pattern whiteboard scripts: long-context serving / RAG / agent / eval pipeline with scale math and trade-offs
- Follow-up drills: mentor mimics Anthropic's long-form follow-ups — "why design it this way?" repeatedly
- Safety dimension: each problem layered with adversarial input, prompt injection, and tool sandbox analysis
- Loop continuity: BQ + Constitution + manager round under the same mentor
What's hard about Anthropic system design
Anthropic interviewers explicitly score safety + auditability. We've seen candidates with strong RAG perf get a weak-signal mark for not addressing prompt injection. VO assist adds a safety dimension layer to every problem.
Add WeChat Coding0201 for pricing and scope.
FAQ
Should I draw diagrams?
Strongly recommended. Excalidraw is open by default; talking without diagrams is easy to derail.
Is 60 minutes enough for one problem?
Yes, but cap clarification at 5 minutes. Anthropic prompts deliberately omit scale numbers — without proactive clarification, the design will drift.
Overlap with OpenAI / Mistral?
LLM serving + RAG overlap ~80%. Agent design and eval pipelines are more frequent at Anthropic.
Can I pass without LLM serving experience?
Hard but possible. We recommend running vLLM / SGLang yourself + deploying a RAG demo for a month to internalize concepts.
Preparing for Anthropic / OpenAI / Mistral / xAI system design?
oavoservice tracks frontier AI lab system design surfaces. Mentors come from live LLM serving / RAG / agent teams and provide four-pattern whiteboard scripts, follow-up drills, safety dimension training, and full-loop continuity.
👉 Add WeChat: Coding0201 for the Anthropic system design bank + VO assist plan.
Contact
Email: [email protected]
Telegram: @OAVOProxy