← Back to blog Duolingo AI Research Engineer Virtual Onsite Deep Dive: 4 VO Rounds + ML Depth + Research Design
Duolingo

Duolingo AI Research Engineer Virtual Onsite Deep Dive: 4 VO Rounds + ML Depth + Research Design

2026-06-04

Behind Duolingo's language-learning product sits an engine driven by learning science plus machine learning: which question to surface, how soon a user will forget, and how to retain memory with the least practice are all decided by models. The AI Research Engineer role lives between Research Scientist and ML Engineer—you must read papers and design experiments, yet also turn models into shippable production code.

This track's virtual onsite is nothing like a standard SDE loop: coding is only one round, while the other three test ML depth, research-design intuition, and the systems skill to embed a model in the product. This article lays out the four-round VO skeletons, the high-frequency follow-ups, and a prep path, based on real feedback from this track.

1. Duolingo AI Research Engineer VO Overview

Dimension Details
Rounds 4 VO rounds (coding / ML depth / research design / applied system)
Per round 45-60 minutes
Platform Video + shared editor (CoderPad-style for coding)
Language Primarily Python; whiteboard derivations in the ML round
Focus Engineering + modeling depth + experiment thinking + product delivery
Flow Recruiter → tech phone screen → four VO rounds → team match

Key insight: the most common failure on this track is not the algorithm, but "can explain the paper yet writes messy code" or "can call the library yet cannot justify the modeling choice." Each round has a distinct focus, and a clear weakness in any single round can be a veto.

2. Round 1: Coding Implementation (Sequence Sampling)

Problem Statement

Given a vocabulary of length n where each word has weight weights[i] (frequency), implement a sampler that draws k distinct words without replacement, proportional to the weights in expectation. After one build, it must support efficient repeated sampling.

Approach

The classic without-replacement weighted sampling is A-Res (exponential jump key): generate key = u^(1/w) for each element (u uniform in (0,1)) and keep the top-k keys. This is equivalent to a weighted top-k, doable in a single O(n) scan with a heap.

Python Solution

import heapq
import random

class WeightedSampler:
    def __init__(self, weights):
        self.weights = weights

    def sample(self, k):
        # A-Res: key = u^(1/w), keep the largest k
        heap = []  # current top-k (key, idx)
        for i, w in enumerate(self.weights):
            if w <= 0:
                continue
            u = random.random()
            key = u ** (1.0 / w)
            if len(heap) < k:
                heapq.heappush(heap, (key, i))
            elif key > heap[0][0]:
                heapq.heapreplace(heap, (key, i))
        return [i for _, i in heap]

Time complexity: O(n log k) Space complexity: O(k) Follow-up: What if weights update dynamically? Answer: a Fenwick tree over prefix sums plus binary-search locate, giving O(log n) per update/sample.

3. Round 2: ML Depth (Spaced-Repetition Modeling)

Scenario

Duolingo's core is an SRS (Spaced Repetition System): predict the probability a user still recalls a word after interval t, to decide the next review time. The interviewer asks you to design this "memory model" from scratch.

Modeling Approach

The classic baseline is Half-Life Regression: model memory decay as an exponential

$$p = 2^{-t / h}, \quad h = \exp(\theta \cdot x)$$

where h is the half-life produced by a linear layer over features x (past correct count, error count, word difficulty, etc.), guaranteed positive. The loss fits both the recall probability p and the half-life h.

import numpy as np

def hlr_loss(theta, X, t, recalled, alpha=0.01):
    # X: (N, d) features; t: (N,) intervals; recalled: (N,) 0/1
    h = np.exp(X @ theta)            # half-life, always positive
    p = np.power(2.0, -t / h)        # predicted recall probability
    p = np.clip(p, 1e-6, 1 - 1e-6)
    # main loss: squared error on recall probability + L2
    loss = np.mean((p - recalled) ** 2) + alpha * np.sum(theta ** 2)
    return loss

Why not a plain classifier: the interviewer wants to hear that you understand the half-life h is an interpretable quantity that directly drives scheduling—a black-box classifier cannot output "how long until the next review."

High-frequency follow-ups:

4. Round 3: Research Design and Experiments (A/B + Causal)

Scenario

"We want to verify whether shipping a new review-scheduling algorithm improves 7-day retention (D7). Design an experiment."

Breakdown Framework

  1. Metrics: primary is D7 retention; guardrails are daily lessons and review load (avoid buying retention by piling on practice).
  2. Unit and assignment: randomize by user (not session, to avoid contamination); run an A/A test before bucketing.
  3. Sample size: back out N per arm from baseline retention and the minimum detectable effect (MDE), fixing power=0.8, α=0.05.
  4. Novelty effect: scheduling changes carry a novelty effect—observe for ≥2 weeks rather than judging day one.
  5. Causal trap: retention is a survivorship-bias minefield—looking only at active users overstates the effect; use intention-to-treat (ITT) from the assignment point.
from math import sqrt

def required_n_per_arm(p0, mde, z_alpha=1.96, z_beta=0.84):
    # two-proportion test approximate sample size
    p1 = p0 + mde
    p_bar = (p0 + p1) / 2
    num = (z_alpha * sqrt(2 * p_bar * (1 - p_bar))
           + z_beta * sqrt(p0 * (1 - p0) + p1 * (1 - p1))) ** 2
    return num / (mde ** 2)

# baseline D7=40%, want to detect +2 percentage points
print(int(required_n_per_arm(0.40, 0.02)))

What the interviewer is really watching: whether you proactively raise guardrail metrics and the novelty effect—just computing sample size is not enough.

5. Round 4: Applied System Design (Online Inference Pipeline)

Scenario

"Deploy the memory model above as a live service: when each user opens the app, return the 20 words to review today within 50ms. Tens of millions of daily actives."

Design Points

Layer Approach Rationale
Features Offline batch + light online features Precompute historical stats offline; assemble only real-time features per request
Model serving Light model (linear/small tree), vectorized batching A 50ms SLA forbids a heavy model scoring word by word
Scheduling Maintain each word's "next due time"; only due words enter the candidate pool Turn "prediction" into "priority-queue top-k"
Storage User-word state in a KV store (half-life, last review time) Reads/writes are per-user, naturally sharded
Offline feedback Nightly retrain on the day's feedback and update half-lives Closed loop: review result → update h → influences tomorrow's schedule

Core trade-off: push heavy computation offline and let the online path only "fetch due words + lightly rank," so it can sustain 50ms across tens of millions of daily actives.

6. Four-Round Prep Checklist

Round Focus Resources
Coding Weighted sampling, streaming top-k, reservoir sampling LeetCode + probabilistic algorithms
ML depth Memory models, calibration, cold start, interpretability Duolingo HLR paper + recommender course
Research design Full A/B workflow, guardrails, novelty, ITT Trustworthy Online Experiments
System Online inference, feature pipeline, scheduling queue ML system design material

FAQ

Q1: How does the AI Research Engineer VO differ from a standard ML Engineer loop?

The biggest difference is the weight on the research-design round and the ML deep-dive round. ML Engineer leans toward engineering delivery; Research Engineer must design experiments from scratch, justify modeling choices, and read papers. Coding is only one round, and it skews probabilistic/sampling rather than pure LeetCode.

Q2: Can I apply without an NLP/education background?

Yes. Duolingo values "modeling intuition plus experiment thinking," and memory models, recommendation, and ranking transfer well. To prepare, take one past ML project and be ready to answer three layers: why you modeled it this way, how you evaluated it, and how you validated it after launch.

Q3: What LeetCode level is the coding round?

The skeleton is around Medium, but it skews probabilistic and streaming (weighted sampling, reservoir, top-k) rather than standard DP/graph. The point is clean code plus answering complexity and dynamic-update follow-ups.

Q4: Will the ML round make me derive formulas by hand?

Yes. Half-life regression, the loss function, and calibration metrics may all be whiteboarded. You need not memorize formulas, but you must explain why exponential decay, why h must stay positive, and why accuracy is the wrong evaluation metric.

Q5: How long until results after the VO?

Usually recruiter feedback within a week of the four rounds, then team match. Research teams are fewer, so matching can add another 1-2 weeks; the overall pace is slower than a pure SDE loop.


Preparing for the Duolingo AI Research Engineer virtual onsite?

Each of the four rounds probes a different weak spot; the hard part is not one problem but keeping all four dimensions—coding, modeling, experiments, systems—from collapsing. If you want batch-specific VO question reconstructions, focused drilling on memory models and experiment design, or VO assistance / VO proxy real-time pacing support, reach out. Send a screenshot of the job description, and we will break down the rounds first, then build a practice plan.

Add WeChat Coding0201 now to get Duolingo VO real questions and four-round drilling.

Contact