Behind Duolingo's language-learning product sits an engine driven by learning science plus machine learning: which question to surface, how soon a user will forget, and how to retain memory with the least practice are all decided by models. The AI Research Engineer role lives between Research Scientist and ML Engineer—you must read papers and design experiments, yet also turn models into shippable production code.
This track's virtual onsite is nothing like a standard SDE loop: coding is only one round, while the other three test ML depth, research-design intuition, and the systems skill to embed a model in the product. This article lays out the four-round VO skeletons, the high-frequency follow-ups, and a prep path, based on real feedback from this track.
1. Duolingo AI Research Engineer VO Overview
| Dimension | Details |
|---|---|
| Rounds | 4 VO rounds (coding / ML depth / research design / applied system) |
| Per round | 45-60 minutes |
| Platform | Video + shared editor (CoderPad-style for coding) |
| Language | Primarily Python; whiteboard derivations in the ML round |
| Focus | Engineering + modeling depth + experiment thinking + product delivery |
| Flow | Recruiter → tech phone screen → four VO rounds → team match |
Key insight: the most common failure on this track is not the algorithm, but "can explain the paper yet writes messy code" or "can call the library yet cannot justify the modeling choice." Each round has a distinct focus, and a clear weakness in any single round can be a veto.
2. Round 1: Coding Implementation (Sequence Sampling)
Problem Statement
Given a vocabulary of length n where each word has weight weights[i] (frequency), implement a sampler that draws k distinct words without replacement, proportional to the weights in expectation. After one build, it must support efficient repeated sampling.
Approach
The classic without-replacement weighted sampling is A-Res (exponential jump key): generate key = u^(1/w) for each element (u uniform in (0,1)) and keep the top-k keys. This is equivalent to a weighted top-k, doable in a single O(n) scan with a heap.
Python Solution
import heapq
import random
class WeightedSampler:
def __init__(self, weights):
self.weights = weights
def sample(self, k):
# A-Res: key = u^(1/w), keep the largest k
heap = [] # current top-k (key, idx)
for i, w in enumerate(self.weights):
if w <= 0:
continue
u = random.random()
key = u ** (1.0 / w)
if len(heap) < k:
heapq.heappush(heap, (key, i))
elif key > heap[0][0]:
heapq.heapreplace(heap, (key, i))
return [i for _, i in heap]
Time complexity: O(n log k) Space complexity: O(k) Follow-up: What if weights update dynamically? Answer: a Fenwick tree over prefix sums plus binary-search locate, giving O(log n) per update/sample.
3. Round 2: ML Depth (Spaced-Repetition Modeling)
Scenario
Duolingo's core is an SRS (Spaced Repetition System): predict the probability a user still recalls a word after interval t, to decide the next review time. The interviewer asks you to design this "memory model" from scratch.
Modeling Approach
The classic baseline is Half-Life Regression: model memory decay as an exponential
$$p = 2^{-t / h}, \quad h = \exp(\theta \cdot x)$$
where h is the half-life produced by a linear layer over features x (past correct count, error count, word difficulty, etc.), guaranteed positive. The loss fits both the recall probability p and the half-life h.
import numpy as np
def hlr_loss(theta, X, t, recalled, alpha=0.01):
# X: (N, d) features; t: (N,) intervals; recalled: (N,) 0/1
h = np.exp(X @ theta) # half-life, always positive
p = np.power(2.0, -t / h) # predicted recall probability
p = np.clip(p, 1e-6, 1 - 1e-6)
# main loss: squared error on recall probability + L2
loss = np.mean((p - recalled) ** 2) + alpha * np.sum(theta ** 2)
return loss
Why not a plain classifier: the interviewer wants to hear that you understand the half-life h is an interpretable quantity that directly drives scheduling—a black-box classifier cannot output "how long until the next review."
High-frequency follow-ups:
- Cold start (new user/new word)? → Start with word-level priors plus a population mean.
- How to evaluate? → Not accuracy, but calibration of recall probability plus MAE on
p.
4. Round 3: Research Design and Experiments (A/B + Causal)
Scenario
"We want to verify whether shipping a new review-scheduling algorithm improves 7-day retention (D7). Design an experiment."
Breakdown Framework
- Metrics: primary is D7 retention; guardrails are daily lessons and review load (avoid buying retention by piling on practice).
- Unit and assignment: randomize by user (not session, to avoid contamination); run an A/A test before bucketing.
- Sample size: back out N per arm from baseline retention and the minimum detectable effect (MDE), fixing power=0.8, α=0.05.
- Novelty effect: scheduling changes carry a novelty effect—observe for ≥2 weeks rather than judging day one.
- Causal trap: retention is a survivorship-bias minefield—looking only at active users overstates the effect; use intention-to-treat (ITT) from the assignment point.
from math import sqrt
def required_n_per_arm(p0, mde, z_alpha=1.96, z_beta=0.84):
# two-proportion test approximate sample size
p1 = p0 + mde
p_bar = (p0 + p1) / 2
num = (z_alpha * sqrt(2 * p_bar * (1 - p_bar))
+ z_beta * sqrt(p0 * (1 - p0) + p1 * (1 - p1))) ** 2
return num / (mde ** 2)
# baseline D7=40%, want to detect +2 percentage points
print(int(required_n_per_arm(0.40, 0.02)))
What the interviewer is really watching: whether you proactively raise guardrail metrics and the novelty effect—just computing sample size is not enough.
5. Round 4: Applied System Design (Online Inference Pipeline)
Scenario
"Deploy the memory model above as a live service: when each user opens the app, return the 20 words to review today within 50ms. Tens of millions of daily actives."
Design Points
| Layer | Approach | Rationale |
|---|---|---|
| Features | Offline batch + light online features | Precompute historical stats offline; assemble only real-time features per request |
| Model serving | Light model (linear/small tree), vectorized batching | A 50ms SLA forbids a heavy model scoring word by word |
| Scheduling | Maintain each word's "next due time"; only due words enter the candidate pool | Turn "prediction" into "priority-queue top-k" |
| Storage | User-word state in a KV store (half-life, last review time) | Reads/writes are per-user, naturally sharded |
| Offline feedback | Nightly retrain on the day's feedback and update half-lives | Closed loop: review result → update h → influences tomorrow's schedule |
Core trade-off: push heavy computation offline and let the online path only "fetch due words + lightly rank," so it can sustain 50ms across tens of millions of daily actives.
6. Four-Round Prep Checklist
| Round | Focus | Resources |
|---|---|---|
| Coding | Weighted sampling, streaming top-k, reservoir sampling | LeetCode + probabilistic algorithms |
| ML depth | Memory models, calibration, cold start, interpretability | Duolingo HLR paper + recommender course |
| Research design | Full A/B workflow, guardrails, novelty, ITT | Trustworthy Online Experiments |
| System | Online inference, feature pipeline, scheduling queue | ML system design material |
FAQ
Q1: How does the AI Research Engineer VO differ from a standard ML Engineer loop?
The biggest difference is the weight on the research-design round and the ML deep-dive round. ML Engineer leans toward engineering delivery; Research Engineer must design experiments from scratch, justify modeling choices, and read papers. Coding is only one round, and it skews probabilistic/sampling rather than pure LeetCode.
Q2: Can I apply without an NLP/education background?
Yes. Duolingo values "modeling intuition plus experiment thinking," and memory models, recommendation, and ranking transfer well. To prepare, take one past ML project and be ready to answer three layers: why you modeled it this way, how you evaluated it, and how you validated it after launch.
Q3: What LeetCode level is the coding round?
The skeleton is around Medium, but it skews probabilistic and streaming (weighted sampling, reservoir, top-k) rather than standard DP/graph. The point is clean code plus answering complexity and dynamic-update follow-ups.
Q4: Will the ML round make me derive formulas by hand?
Yes. Half-life regression, the loss function, and calibration metrics may all be whiteboarded. You need not memorize formulas, but you must explain why exponential decay, why h must stay positive, and why accuracy is the wrong evaluation metric.
Q5: How long until results after the VO?
Usually recruiter feedback within a week of the four rounds, then team match. Research teams are fewer, so matching can add another 1-2 weeks; the overall pace is slower than a pure SDE loop.
Preparing for the Duolingo AI Research Engineer virtual onsite?
Each of the four rounds probes a different weak spot; the hard part is not one problem but keeping all four dimensions—coding, modeling, experiments, systems—from collapsing. If you want batch-specific VO question reconstructions, focused drilling on memory models and experiment design, or VO assistance / VO proxy real-time pacing support, reach out. Send a screenshot of the job description, and we will break down the rounds first, then build a practice plan.
Add WeChat Coding0201 now to get Duolingo VO real questions and four-round drilling.
Contact
- WeChat: Coding0201
- Email: [email protected]
- Telegram: @OAVOProxy