Anthropic (Claude) runs a recruiting loop unlike any other frontier lab: take-homes + a Constitutional AI values interview are unique to Anthropic, and the model behavior evaluation / system design / long behavioral rounds all carry weight. This article walks the full loop, the signals each round is testing, and the OA assist / VO assist playbook.
Anthropic Loop Snapshot
| Round | Format | Duration | Focus |
|---|---|---|---|
| Recruiter Screen | Phone | 30 min | Background + safety values + expectations |
| OA (some roles) | Coderpad / take-home | 60–90 min | LLM inference / evaluation / RAG |
| Take-home | Async | 2–4 hours | Real business problem |
| Technical 1 | Video | 60 min | LLM engineering + model evaluation |
| Technical 2 | Video | 60 min | System design / RAG / tool calling |
| Values + BQ | Video | 60 min | Constitutional values + behavioral |
| Manager Round | Video | 30–60 min | Team fit + long-term direction |
Step 1: OA / Take-home
Surface
Not every role has an OA. Whether Research / Research Engineer / Applied AI / SWE goes through one is up to the hiring manager:
- Research Engineer: mostly take-homes (~3 hours implementing an LLM eval pipeline)
- Applied AI: 60–90-minute Coderpad (LLM output parsing / evaluation / RAG)
- SWE: occasionally LeetCode-style OAs, but take-home is more common
Real Question 1: Evaluation Pipeline (Take-home)
"Given an LLM API endpoint and 100 math questions, design an evaluation pipeline that:
- Calls the API and robustly parses numeric answers
- Handles rate limits / retries
- Outputs accuracy, mis-answered questions, per-category breakdown"
Python Skeleton
import re
import time
def robust_parse_number(output):
pattern = r'(?:final answer|answer)[:\s]*([\-]?\d+(?:\.\d+)?)'
m = re.search(pattern, output, re.IGNORECASE)
if m:
return float(m.group(1))
nums = re.findall(r'[\-]?\d+(?:\.\d+)?', output)
return float(nums[-1]) if nums else None
def evaluate(api_call, problems):
correct = 0
wrong = []
for p in problems:
for attempt in range(5):
try:
resp = api_call(p['prompt'])
pred = robust_parse_number(resp)
if pred is not None and abs(pred - p['gt']) < 1e-6:
correct += 1
else:
wrong.append((p['id'], resp, p['gt']))
break
except Exception:
time.sleep(2 ** attempt)
return correct / len(problems), wrong
Signals: robust parsing, retry design, readable code, unit-test coverage.
Real Question 2: Model Behavior Evaluation (Coderpad)
"Given an LLM output + a safety policy, determine compliance. Design the evaluator and explain your trade-offs."
Signal: can you translate policy nuance into testable code logic.
Step 2: Technicals (LLM Engineering + System Design)
Technical 1 — LLM Engineering
- Numerically stable softmax + cross-entropy (pure numpy)
- KV cache in-place writes
- Top-k / Top-p sampling in numpy
- Evaluation metric trade-offs (exact match vs LLM-as-judge)
Technical 2 — System Design
- "Design a RAG system over 100M docs at 10 QPS"
- "Design a tool-calling agent: resumable, rollbackable, auditable"
- "Design Claude's long-context (200k) serving stack"
Framework
- Clarify scale numbers: QPS, doc count, context length
- Draw the data flow: every step user → model → user
- Label trade-offs: recall vs latency, retrain cadence vs data drift
- Estimate cost: H100 nodes, GPU-hours
Step 3: Constitution + BQ Round
Unique to Anthropic.
Surface
- "If Claude refuses a legitimate user request, how do you debug it?"
- "Safety vs helpfulness — how do you trade them off?"
- "Share a paper of yours. What finding made you most uncomfortable?"
Answer principles
- Authentic > polished: Anthropic favors candidates who can articulate their actual values
- Case-by-case > absolutes: avoid "I would never X" — prefer "in situation X I would Y"
- Acknowledge limits: know what you don't know
Anthropic Loop Timing
| Step | Median |
|---|---|
| Recruiter → first round | 5–10 days |
| Take-home → technicals | 1–2 weeks |
| Full loop | 4–8 weeks |
Community pass rates: OA / take-home ~40%, full onsite ~15%, offer ~8%.
OA Assist + VO Assist Playbook
What oavoservice provides
- Take-home review: mentor code-reviews along Anthropic's rubric (correctness + safety + readability)
- LLM engineering drills: a daily numpy problem covering numerical stability + KV cache
- System design scripts: RAG / agent / long-context serving whiteboard scripts
- Constitution mock: mentor role-plays interviewer, presses on safety vs helpfulness
What's hard about Anthropic loops
Anthropic interviewers don't follow STAR. They use long follow-ups. We've seen candidates ace LLM engineering yet wash out after three rounds of "why do you think that?" on the Constitution round. VO assist drills the follow-up loop and rewrites value answers iteratively.
Add WeChat Coding0201 for pricing and scope.
FAQ
Do all Anthropic roles include an OA?
No. Research / Research Engineer favor take-homes; Applied AI occasionally Coderpad; SWE is more resume + take-home.
Take-home language?
Python ~85% (Anthropic is Python-first internally). LLM API tools are allowed; you must disclose them in the write-up.
Constitution round prep time?
At least a week. Read Anthropic's Constitutional AI paper and Acceptable Use Policy, then run 10 scenario mocks with follow-ups.
Cooldown after no offer?
12 months. Cross-role (Research → Applied AI) typically uses a separate pool.
Preparing for Anthropic / OpenAI / Mistral / Cohere?
oavoservice tracks frontier AI lab OA / take-home / VO surfaces. Mentors come from live LLM / Infra / RLHF teams and provide take-home review, LLM engineering drills, system design scripts, and Constitution round mocks.
👉 Add WeChat: Coding0201 for the Anthropic full-loop OA assist + VO assist plan.
Contact
Email: [email protected]
Telegram: @OAVOProxy