Anthropic System Design Interview 2026｜LLM Serving + RAG + Tool-Calling Agent VO Assist Walkthrough

Anthropic's system design interview is nothing like traditional FAANG: instead of distributed KV stores or shorteners, the surface is LLM serving, RAG, tool-calling agents, and model evaluation pipelines. This article walks the four canonical question patterns from 2026 spring feedback with whiteboard scripts and a VO assist playbook.

Anthropic System Design Snapshot

Dimension	Detail
Duration	60 minutes
Format	Excalidraw / physical whiteboard
Pacing	5-min clarify + 40-min design + 15-min follow-up
Signals	Scale + correctness + safety + extensibility
Mandatory tracks	LLM serving / RAG / agent / eval

Pattern 1: Long-Context Inference Serving

Prompt

"Design Claude's 200K context inference serving stack supporting 100K QPS, p95 latency ≤ 2s, with cost under control."

Framework

Clarify: 100K QPS means input tokens or requests per second? Average prompt length? Streaming vs non-streaming?
Data flow: Client → LB → Tokenizer → Prefill GPU pool → Decode GPU pool → Streaming response
Key design:
- Prefill / Decode separation: prefill is compute-bound; decode is memory-bandwidth-bound
- Continuous batching: vLLM / SGLang-style dynamic batching
- KV cache offload: long context → CPU offload or PagedAttention
- Prefix caching: shared KV cache across same prompt prefixes (Anthropic's official prompt caching)
Scale math: 100K QPS × 200K avg context = 20B tokens/sec → estimate H100 node count
Failure recovery: GPU node failures → routing skip → retry

Common traps

No prefill / decode split → 30–50% lower utilization
KV cache without offload → OOM
No prefix caching → recomputing identical prompts

Pattern 2: 100M-Document RAG

Prompt

"Design a RAG system over 100M documents at 10 QPS with retrieval ≤ 100ms."

Framework

Clarify: Average doc length? Update frequency? Multilingual?
Data flow:
- Indexing: Doc → chunker → embedding → vector DB
- Query: Query → embedding → ANN → rerank → top-K → LLM context
Key design:
- Vector DB: HNSW / IVF-PQ; Pinecone / Qdrant / Milvus
- Sharding: 100M / 10 = 10M per shard
- Rerank: top-100 → cross-encoder → top-10
- Hybrid retrieval: BM25 + dense embedding fusion
Storage estimate: 100M × 4KB × 1024-dim float16 ≈ 400 GB embeddings + 400 GB raw
Updates: incremental indexing + periodic rebuild

Common traps

Pure dense embedding without BM25 → entity / number recall drops
No rerank → top-K recall loses 15–25 pp
Index and query non-distributed → single-point bottleneck

Pattern 3: Tool-Calling Agent

Prompt

"Design an LLM agent with 5 tools (search / calculator / API / file IO / code execution) that's resumable, rollbackable, and auditable."

Framework

Clarify: Single agent vs multi-agent? Concurrent tool calls?
Data flow: User query → LLM → tool plan → execute → feedback to LLM → final answer
Key design:
- State machine: each step records (state_id, tool, input, output, status)
- Checkpoint: write-ahead log around every tool call enables rollback
- Sandbox: code execution in docker / wasm
- Audit log: full trace of every tool call
- Timeout / cancel: graceful exit on user abort or tool timeout
Failure recovery:
- Tool failure → feed error back to LLM for self-correction
- LLM format error → retry with structured output schema

Common traps

No state machine → no recovery
Tools execute raw user SQL / shell → security hole
No audit log → cannot trace model behavior

Pattern 4: Model Evaluation Pipeline

Prompt

"Design an evaluation pipeline running 10 benchmarks × 1000 questions daily and powering a dashboard."

Framework

Clarify: Eval metric? Checkpoint cadence? Compute budget?
Data flow: Cron → pull latest checkpoint → benchmark concurrency → store → dashboard
Key design:
- Benchmark batching: 10 × 1000 = 10K, batched inference
- Storage: S3 (raw) + Postgres (aggregates) + ClickHouse (analytics)
- Dashboard: Grafana / internal BI
- Regression alert: ≥ 1pp accuracy drop vs previous checkpoint pages
Extensibility: new benchmark via config, no code change

Common traps

Serial benchmark execution → 4h becomes 8h
Not storing raw output → cannot debug regressions
No regression monitor → silent model decay

VO Assist Playbook

What oavoservice VO assist gives you

Four-pattern whiteboard scripts: long-context serving / RAG / agent / eval pipeline with scale math and trade-offs
Follow-up drills: mentor mimics Anthropic's long-form follow-ups — "why design it this way?" repeatedly
Safety dimension: each problem layered with adversarial input, prompt injection, and tool sandbox analysis
Loop continuity: BQ + Constitution + manager round under the same mentor

What's hard about Anthropic system design

Anthropic interviewers explicitly score safety + auditability. We've seen candidates with strong RAG perf get a weak-signal mark for not addressing prompt injection. VO assist adds a safety dimension layer to every problem.

Add WeChat Coding0201 for pricing and scope.

FAQ

Should I draw diagrams?

Strongly recommended. Excalidraw is open by default; talking without diagrams is easy to derail.

Is 60 minutes enough for one problem?

Yes, but cap clarification at 5 minutes. Anthropic prompts deliberately omit scale numbers — without proactive clarification, the design will drift.

Overlap with OpenAI / Mistral?

LLM serving + RAG overlap ~80%. Agent design and eval pipelines are more frequent at Anthropic.

Can I pass without LLM serving experience?

Hard but possible. We recommend running vLLM / SGLang yourself + deploying a RAG demo for a month to internalize concepts.

Preparing for Anthropic / OpenAI / Mistral / xAI system design?

oavoservice tracks frontier AI lab system design surfaces. Mentors come from live LLM serving / RAG / agent teams and provide four-pattern whiteboard scripts, follow-up drills, safety dimension training, and full-loop continuity.

👉 Add WeChat: Coding0201 for the Anthropic system design bank + VO assist plan.

Contact

Email: [email protected]
Telegram: @OAVOProxy