The Ultimate Anthropic OA Guide｜From OA to Offer Loop Walkthrough with OA Assist + VO Assist

Anthropic (Claude) runs a recruiting loop unlike any other frontier lab: take-homes + a Constitutional AI values interview are unique to Anthropic, and the model behavior evaluation / system design / long behavioral rounds all carry weight. This article walks the full loop, the signals each round is testing, and the OA assist / VO assist playbook.

Anthropic Loop Snapshot

Round	Format	Duration	Focus
Recruiter Screen	Phone	30 min	Background + safety values + expectations
OA (some roles)	Coderpad / take-home	60–90 min	LLM inference / evaluation / RAG
Take-home	Async	2–4 hours	Real business problem
Technical 1	Video	60 min	LLM engineering + model evaluation
Technical 2	Video	60 min	System design / RAG / tool calling
Values + BQ	Video	60 min	Constitutional values + behavioral
Manager Round	Video	30–60 min	Team fit + long-term direction

Step 1: OA / Take-home

Surface

Not every role has an OA. Whether Research / Research Engineer / Applied AI / SWE goes through one is up to the hiring manager:

Research Engineer: mostly take-homes (~3 hours implementing an LLM eval pipeline)
Applied AI: 60–90-minute Coderpad (LLM output parsing / evaluation / RAG)
SWE: occasionally LeetCode-style OAs, but take-home is more common

Real Question 1: Evaluation Pipeline (Take-home)

"Given an LLM API endpoint and 100 math questions, design an evaluation pipeline that:

Calls the API and robustly parses numeric answers
Handles rate limits / retries
Outputs accuracy, mis-answered questions, per-category breakdown"

Python Skeleton

import re
import time

def robust_parse_number(output):
    pattern = r'(?:final answer|answer)[:\s]*([\-]?\d+(?:\.\d+)?)'
    m = re.search(pattern, output, re.IGNORECASE)
    if m:
        return float(m.group(1))
    nums = re.findall(r'[\-]?\d+(?:\.\d+)?', output)
    return float(nums[-1]) if nums else None

def evaluate(api_call, problems):
    correct = 0
    wrong = []
    for p in problems:
        for attempt in range(5):
            try:
                resp = api_call(p['prompt'])
                pred = robust_parse_number(resp)
                if pred is not None and abs(pred - p['gt']) < 1e-6:
                    correct += 1
                else:
                    wrong.append((p['id'], resp, p['gt']))
                break
            except Exception:
                time.sleep(2 ** attempt)
    return correct / len(problems), wrong

Signals: robust parsing, retry design, readable code, unit-test coverage.

Real Question 2: Model Behavior Evaluation (Coderpad)

"Given an LLM output + a safety policy, determine compliance. Design the evaluator and explain your trade-offs."

Signal: can you translate policy nuance into testable code logic.

Step 2: Technicals (LLM Engineering + System Design)

Technical 1 — LLM Engineering

Numerically stable softmax + cross-entropy (pure numpy)
KV cache in-place writes
Top-k / Top-p sampling in numpy
Evaluation metric trade-offs (exact match vs LLM-as-judge)

Technical 2 — System Design

"Design a RAG system over 100M docs at 10 QPS"
"Design a tool-calling agent: resumable, rollbackable, auditable"
"Design Claude's long-context (200k) serving stack"

Framework

Clarify scale numbers: QPS, doc count, context length
Draw the data flow: every step user → model → user
Label trade-offs: recall vs latency, retrain cadence vs data drift
Estimate cost: H100 nodes, GPU-hours

Step 3: Constitution + BQ Round

Unique to Anthropic.

Surface

"If Claude refuses a legitimate user request, how do you debug it?"
"Safety vs helpfulness — how do you trade them off?"
"Share a paper of yours. What finding made you most uncomfortable?"

Answer principles

Authentic > polished: Anthropic favors candidates who can articulate their actual values
Case-by-case > absolutes: avoid "I would never X" — prefer "in situation X I would Y"
Acknowledge limits: know what you don't know

Anthropic Loop Timing

Step	Median
Recruiter → first round	5–10 days
Take-home → technicals	1–2 weeks
Full loop	4–8 weeks

Community pass rates: OA / take-home ~40%, full onsite ~15%, offer ~8%.

OA Assist + VO Assist Playbook

What oavoservice provides

Take-home review: mentor code-reviews along Anthropic's rubric (correctness + safety + readability)
LLM engineering drills: a daily numpy problem covering numerical stability + KV cache
System design scripts: RAG / agent / long-context serving whiteboard scripts
Constitution mock: mentor role-plays interviewer, presses on safety vs helpfulness

What's hard about Anthropic loops

Anthropic interviewers don't follow STAR. They use long follow-ups. We've seen candidates ace LLM engineering yet wash out after three rounds of "why do you think that?" on the Constitution round. VO assist drills the follow-up loop and rewrites value answers iteratively.

Add WeChat Coding0201 for pricing and scope.

FAQ

Do all Anthropic roles include an OA?

No. Research / Research Engineer favor take-homes; Applied AI occasionally Coderpad; SWE is more resume + take-home.

Take-home language?

Python ~85% (Anthropic is Python-first internally). LLM API tools are allowed; you must disclose them in the write-up.

Constitution round prep time?

At least a week. Read Anthropic's Constitutional AI paper and Acceptable Use Policy, then run 10 scenario mocks with follow-ups.

Cooldown after no offer?

12 months. Cross-role (Research → Applied AI) typically uses a separate pool.

Preparing for Anthropic / OpenAI / Mistral / Cohere?

oavoservice tracks frontier AI lab OA / take-home / VO surfaces. Mentors come from live LLM / Infra / RLHF teams and provide take-home review, LLM engineering drills, system design scripts, and Constitution round mocks.

👉 Add WeChat: Coding0201 for the Anthropic full-loop OA assist + VO assist plan.

Contact

Email: [email protected]
Telegram: @OAVOProxy