Building Marlin-2B · Part 1 of 3

The Map Was Wrong

Aryan Jain NemoStation 2026-05-26 12 min

Marlin checks the map

Marlin checks the map. The map is lying.

A 2B open-source video model is #3 on Charades-TimeLens — behind only Gemini-2.5-Pro and an 8B model — and on the way there we audited the standard video-caption benchmarks and found something the field hasn’t acknowledged: most of their ground-truth captions are wrong.

TL;DR

About Marlin-2B

2B-parameter video-language model · Apache 2.0 license · fine-tuned from Qwen3.5-2B-Base · built by NemoStation. Demo · Hugging Face · Part 1 of 3 in our build-journey series.

Glossary (click to expand)
Term Definition
mIoU Mean Intersection-over-Union — the standard temporal-grounding metric. For each predicted (start, end) span, compute IoU vs the GT span; average across the test set.
R@1@τ Recall@1 at IoU threshold τ — fraction of test instances where the top-1 predicted span achieves IoU ≥ τ.
AutoDQ Automatic Dense-caption Quality — DREAM-1K’s judge protocol. LLM extracts atomic events from GT, then checks each against the model’s caption. Yields precision / recall / F1.
MLLM / VLM Multimodal Large Language Model / Vision-Language Model. LLMs that ingest images or video alongside text.
VTG Video Temporal Grounding — given a natural-language query, locate the time span in a video where it happens.
GT Ground Truth — the reference annotation, treated as correct for evaluation. (Whether it actually is correct is the subject of much of this post.)
GDN / linear attention Gated DeltaNet — a recurrent-style attention variant with O(n) memory, used in Qwen3.5’s hybrid layer stack. Faster than full attention at long contexts.
DeepStack The mechanism for injecting visual tokens via residual into the first 3 LLM layers, instead of concatenating them to the text token stream.
GRPO Group Relative Policy Optimization — DeepSeek’s RL variant; samples N rollouts per prompt, computes per-rollout reward, optimizes relative advantages.
SimPO Simple Preference Optimization — a DPO variant we used for the post-GRPO preference-learning stage. (Detail in Post 2.)

The discovery

We started by evaluating Marlin-2B on the standard public dense-caption benches — CaReBench and DREAM-1K. The judges those benches use are text-only: they read our caption and the ground-truth caption side-by-side and decide which atomic facts each side gets right. The judge never sees the video.

That’s a problem for two reasons we didn’t appreciate until we dug in.

Problem 1: when the model is more accurate than the GT, it loses

Here’s a real row from our DREAM-1K eval. The benchmark video shows a yellow humanoid creature in a half-kneeling pose. Our model wrote:

“yellow humanoid creature stands in a crouched position”

The official DREAM-1K GT description said only:

“the character half kneels on the ground”

— with no mention of color. The text-only judge’s ruling on our predicted event:

“The video description mentions a character half kneeling, which is similar to a crouched position, but does not specify the color of the character.”

Marked neutral. Not a hit. Then four more “yellow …” predictions for the same clip got the same ruling — “does not specify the color”, “does not mention the color”, over and over. The judge couldn’t verify “yellow” against the actual video, so when the GT didn’t mention color, the judge defaulted to “can’t tell.” Our model was being penalized for being more accurate than the GT.

Text-only vs video-aware judge on the same input

Figure: same candidate caption, same GT, same clip. The text-only judge (left) doesn’t see the video and can’t verify “yellow” — it defaults to a neutral verdict that penalizes the correct detail. The video-aware judge (right) sees the creature is yellow and credits the candidate. The text-only protocol is what every public dense-caption benchmark uses today.

Problem 2: the GT itself is wrong, way more often than the field admits

So we built a video-aware judge — Gemini-3-Flash with the actual video frames in the prompt — to audit the benchmarks. We pointed it at the GT side of CaReBench and DREAM-1K and asked: how often is the ground truth itself wrong about what’s in the video?

The answer:

Benchmark Total GT entries audited GT with at least one factual error flagged GT with major errors (factual ≤ 5/10)
CaReBench (consensus across 4 independent video-aware judge runs) 795 clips (clean, both sides judged successfully) 558 (70.2%) 17 (2.1%)
CaReBench (any single video-aware judge run) 795 781 (98.2%) 163 (20.5%)
DREAM-1K (single-side video-aware judge, GT-only audit) 1000 658 (65.8%) 77 (7.7%)

Ground-truth annotation errors in the standard video-caption benchmarks

Figure: GT-error rates by benchmark. What the field has implicitly been assuming is that the GT is essentially correct; what we measured with an independent video-aware audit judge is shown above. The 70.2% and 65.8% are conservative (consensus / single-side), and ~2–8% of clips have severe enough errors that the GT itself scored ≤5/10 in our audit.

These aren’t subtle disagreements. They’re factual errors that a human checking the video would also flag. A few representative samples from the consensus-flagged CaReBench errors (cases where all four independent judge runs agreed the GT was wrong):

idx=3: “states the video is filmed from her perspective, but it is a front-facing/selfie camera view” idx=4: “typo in banner text: ‘NEW CIDS’ instead of ‘NEW VIDS’” idx=6: “claims she finishes applying eyeshadow to her left eye, which is never shown in the video” idx=9: “describes the background as a blue and yellow gradient when it is a solid teal/mint green” idx=12: “the athlete is wearing a sleeveless tank top, not a short-sleeve shirt” idx=13: “describes the javelin as blue, whereas it is primarily yellow/orange with a blue grip section” idx=15: GT describes a pole-vault clip as “male athlete during javelin throwing training”

The actual CaReBench idx=15 clip. This is unambiguously a pole vault — pole, run-up, bar, landing mat. The official GT caption describes it as “male athlete during javelin throwing training.” A text-only judge has no way to catch this; a viewer can in under a second.

And from DREAM-1K’s GT, audited single-side:

idx=0: “GT claims they grabbed two green monsters; the video shows one green and one blue-with-purple-spots.” idx=7: “GT calls a blue figure ‘cat-like’; the video shows a mechanical praying-mantis-like insect.” idx=11: “GT says ‘sword fight’; the actual combat is a body tackle with an axe-like weapon.” idx=14: “GT says ‘large bird’; the creature is a dragon with scales, horns, and a reptilian tail.”

This is what the standard text-only judges have been silently grading against for the past two years. None of them have any way to see the video, so none of them can catch any of these. Every model that’s ever reported a DREAM-1K AutoDQ or CaReBench score has been ranked relative to a reference that is wrong about something in roughly two out of every three clips.

The field’s benchmarks are not a map of model capability. They’re a map of how closely each model happens to share the GT’s specific (often wrong) vocabulary.


1 — Why this gap exists, and what’s actually grading what

The taxonomy of public dense-caption + grounding benchmarks looks something like this:

A model can do well on dense captioning while getting nothing useful from temporal grounding, or vice versa, and the field’s standard taxonomy doesn’t surface the gap. Real downstream applications — highlight reels, security review, content moderation, sports analytics, ad insertion — all need both jointly.

The bigger problem is the one above: on the two benches that do try to grade dense caption quality (DREAM-1K, CaReBench), the field’s default is a text-only LLM judge. That choice is historically defensible — when DREAM-1K and CaReBench were designed, nobody had a Gemini-2.5-Flash-quality multimodal judge available. But in 2026 the limitation is no longer technological, it’s inertial. The benches still use text-only judges. The text-only judges still can’t audit the GT. The leaderboards still treat the GT as a fixed reference. None of it survives a five-minute audit.

We’re not the first to notice the field has a problem here — Zheng et al, 2023 (LLM-as-a-judge), Wang et al, 2023 (positional bias), Saito et al, 2023 (verbosity bias) all document failure modes of LLM judges. What’s different here is the scale of the GT-quality issue. It’s not a bias of a few percent. It’s the majority of every benchmark’s GT.


2 — Why 2B was the right size to find this

Iteration is the rate-limiting step in finding judge bugs. We re-ran the full 1000-clip CaReBench audit a dozen-plus times in three weeks across different judge prompts — only possible if a full eval takes under two hours. A 2B model trains comfortably on H100s (frozen ViT + DeepSpeed Zero-2) and a full re-eval costs an evening. The same 2B then fits on a single L4 GPU for inference, so the public demo can run cheaply too.

This isn’t a “small model is better” claim. It’s “small model is fast enough to debug the evaluator.” If iteration took eight hours instead of two, we’d have shipped the wrong training recipe before we figured out the eval was broken.

The same constraint applied to data. We regenerated ~400K grounded training captions (~200K dense captions plus ~200K temporal-grounding events) every time we tightened the teacher prompt. And we tightened it more than you’d think — story in Post 2.


3 — Why Qwen3.5-2B specifically

Qwen3.5-2B uses a hybrid attention layer stack: 24 total layers, with 18 Gated DeltaNet (linear-attention, O(n)) layers and 6 full-attention layers in a strict 3-linear-then-1-full pattern (full attention at layers 3, 7, 11, 15, 19, and 23). Verified against the Qwen3.5-2B-Base config.json layer_types field.

Practical consequences for this project:

We trained on H100s with ms-swift + vLLM-backed rollouts (afforded 16 rollouts per prompt during GRPO data filtering). The public demo at vlm.nemostation.com runs the trained model on a single L4 GPU.

Architecture deep-dive (click to expand)

Qwen3.5-2B hybrid attention layer stack

Figure: Qwen3.5-2B’s 24-layer stack. Full attention at layers 3, 7, 11, 15, 19, 23 (6 total); the remaining 18 layers use Gated DeltaNet (linear attention). Visual tokens are injected via DeepStack residual into the first 3 layers. Verified from the model’s config.json layer_types array.

The 6 full-attention layers keep long-range modeling intact for dense captioning. The 18 GDN layers make decode cheap enough to fit GRPO rollouts on one GPU.


4 — The TimeLens-Bench leaderboard

This is the post’s payoff section, and not coincidentally it’s the bench where we trust the numbers most. TimeLens-Bench doesn’t use an LLM judge at all — it’s pure span-IoU arithmetic on (start_sec, end_sec) predictions. No GT-quality coupling, no text-judge blind spots. Whatever a model scores here is roughly what it’s actually doing.

We evaluated Marlin-2B on the refined TimeLens-Bench (released by Zhang et al, 2025 in December 2025 — manually re-annotated, much higher-quality versions of Charades-STA, ActivityNet-STA, and QVHighlights). Our sample counts match the paper’s exactly (3363 / 4500 / 1541), so the comparison is apples-to-apples on the same benchmark data.

Model Params Charades-TimeLens ActivityNet-TimeLens QVHighlights-TimeLens
Gemini-2.5-Pro proprietary 52.8 58.1 70.4
TimeLens-8B (RL on Qwen3-VL-8B) 8B 55.2 53.2 65.5
Gemini-2.5-Flash proprietary 48.6 52.5 64.3
Qwen3-VL-8B 8B 48.3 46.8 59.4
GPT-5 proprietary 40.5 42.9 56.8
TimeLens-7B (RL on Qwen2.5-VL-7B) 7B 48.8 46.2 56.0
Marlin-2B GRPO (cp-450) — ours 2B 50.05 45.88 55.50
GPT-4o proprietary 41.8 40.4 52.1
Marlin-2B SFT (cp-11894) — ours 2B 45.73 42.10 52.46
MiMo-VL-7B 7B 39.6 35.5 41.5
Qwen2.5-VL-7B 7B 39.3 31.4 31.6

Numbers for non-Marlin models are from Zhang et al, 2025, Table 1. Marlin-2B numbers from our own evaluation on the same refined TimeLens-Bench data. Full R@1@τ breakdown in the appendix.

Charades-TimeLens leaderboard — Marlin-2B is #3

Figure: Charades-TimeLens mIoU across the leaderboard. Marlin-2B GRPO (orange, 2B params) lands at #3 — the smallest model in the top five, with only TimeLens-8B and Gemini-2.5-Pro above it. Marlin-2B SFT (still 2B, no RL) already beats GPT-4o, GPT-5, MiMo-VL-7B, and Qwen2.5-VL-7B.

Sorted by Charades-TimeLens mIoU, our 2B model is #3 — behind only TimeLens-8B (8B) and Gemini-2.5-Pro. Specifically:

How did a 2B model end up in this leaderboard position? That’s Post 2.


5 — Where this leaves the field

None of this is a strawman of careless researchers. The text-only AutoDQ protocol predates the era of high-quality multimodal judges — it was the only reasonable choice when nobody had a Gemini-2.5-Flash-quality judge to evaluate against video. CaReBench and DREAM-1K were both significant contributions in their time. The reason the field’s defaults are broken in 2026 is inertia, not negligence.

But the inertia matters. With ~70% of CaReBench GT entries containing at least one consensus-flagged factual error, and ~66% of DREAM-1K GT entries similarly flagged, the dense-caption leaderboards aren’t measuring what they claim to. A model that scores well on AutoDQ is largely just vocabulary-matching the (wrong) reference — not actually describing the video accurately. A model that adds correct dense detail (e.g., color, on-screen text, camera angle) is penalized for diverging from the noisy GT. The incentive structure points toward writing captions that resemble the reference, not toward seeing what’s in the video.

We did three things about it. Training a model that does both grounding and dense captioning jointly is Post 2. Rebuilding the judge so it actually grades against the video — and releasing a new benchmark whose GT we generated under that judge — is Post 3.

And — quietly — there’s a fourth thing. There’s a single framing trick we found that makes Gemini-3-Flash hallucinate less, whether it’s writing captions during data curation or grading them at eval time. We used it on the teacher side for months while building the training set. We used it again when we rebuilt the judge. We’ll show you the recipe in Post 3.


5.5 — Limitations and honest disclosure

Indie credibility on Hugging Face and r/LocalLLaMA depends on owning the caveats up front. So:


6 — What’s next

Post 2 — coming next week. How we built the ~400K-clip grounded training set, the SFT → GRPO → SimPO recipe that lifted Marlin-2B from 45.73 to 50.05 mIoU on Charades-TimeLens, and the four bugs that nearly killed each training stage. Plus: the teacher-side version of the prompting trick we hinted at above.

Post 3 — gated on the videoeval-v2 v1.0 freeze. We’ll release the code, the judge prompts (including that prompting trick), and val_set — a new 1000-clip benchmark with timestamped GT, audited against video at generation time, designed to let you measure dense captioning and temporal grounding on the same model. Apache 2.0, full open release.

If you’re shipping a video VLM, your AutoDQ / CaReBench numbers are probably grading your model against a 70%-broken reference. We’d be happy to help you check — open an issue on the Marlin-2B HF repo with a sample of your judge outputs and we’ll look at them with you.


Drafted with Claude Opus 4.7, reviewed and edited by the NemoStation team.

Acknowledgements: the TimeLens team at ARC Lab / Tencent PCG for the refined VTG benchmark we evaluate on; the ms-swift team for the training framework; the Qwen team for the base model; Hazy Research for visual style inspiration on the post format; and the maintainers of the open-source video datasets we built on top of (Charades, ActivityNet, QVHighlights, CaReBench, DREAM-1K).


Appendix A — Full TimeLens-Bench numbers

Per-split, per-model R@1@{0.3, 0.5, 0.7} and mIoU. Marlin-2B rows are our own evaluation on the refined TimeLens-Bench annotations (n counts match the paper exactly: 3363 / 4500 / 1541). All other rows are from Zhang et al, 2025, Table 1.

Charades-TimeLens (n = 3363)

Model R@1@0.3 R@1@0.5 R@1@0.7 mIoU
TimeLens-8B 76.6 63.0 35.2 55.2
Gemini-2.5-Pro 74.1 61.1 34.0 52.8
Marlin-2B GRPO (cp-450) 70.71 55.63 30.60 50.05
TimeLens-7B 70.5 55.6 28.4 48.8
Gemini-2.5-Flash 68.7 56.1 30.6 48.6
Qwen3-VL-8B 69.2 53.4 27.5 48.3
Marlin-2B SFT (cp-11894) 65.24 50.25 27.00 45.73
GPT-4o 60.6 44.5 23.5 41.8
GPT-5 59.3 42.0 22.0 40.5
MiMo-VL-7B 57.9 42.6 20.5 39.6
Qwen2.5-VL-7B 59.7 37.8 16.6 39.3

ActivityNet-TimeLens (n = 4500)

Model R@1@0.3 R@1@0.5 R@1@0.7 mIoU
Gemini-2.5-Pro 72.3 64.2 47.1 58.1
TimeLens-8B 68.9 58.4 40.6 53.2
Gemini-2.5-Flash 66.8 57.5 41.3 52.5
Qwen3-VL-8B 62.1 51.2 34.4 46.8
TimeLens-7B 62.8 51.0 32.6 46.2
Marlin-2B GRPO (cp-450) 59.33 49.60 33.38 45.88
GPT-5 57.4 44.9 30.4 42.9
Marlin-2B SFT (cp-11894) 54.76 45.27 29.84 42.10
GPT-4o 55.2 41.4 25.8 40.4
MiMo-VL-7B 49.3 38.7 22.4 35.5
Qwen2.5-VL-7B 44.1 31.0 16.1 31.4

QVHighlights-TimeLens (n = 1541)

Model R@1@0.3 R@1@0.5 R@1@0.7 mIoU
Gemini-2.5-Pro 84.1 75.9 61.1 70.4
TimeLens-8B 80.2 71.6 55.5 65.5
Gemini-2.5-Flash 78.2 69.4 55.0 64.3
Qwen3-VL-8B 74.2 64.6 49.3 59.4
GPT-5 72.4 60.4 46.4 56.8
TimeLens-7B 74.1 62.7 43.1 56.0
Marlin-2B GRPO (cp-450) 69.05 58.21 44.06 55.50
Marlin-2B SFT (cp-11894) 65.41 54.57 41.79 52.46
GPT-4o 69.0 54.8 38.5 52.1
MiMo-VL-7B 57.1 42.6 28.4 41.5
Qwen2.5-VL-7B 41.5 27.8 15.2 31.6

Appendix B — Sources and reproducibility

Every number in this post is verifiable against on-disk files. For anyone who wants to spot-check:

Bench audit (Section “The discovery”, Diagram A)

TimeLens-Bench numbers (Section 4, Diagram B, Appendix A)

Architecture (Section 3, Diagram 1)

If you spot a number that doesn’t reproduce against these sources, open an issue on the Marlin-2B HF repo and we’ll fix it.