Building Marlin-2B · Part 1 of 3

Part 1: The Map Was Wrong

Aryan Jain NemoStation 2026-05-26 18 min

Marlin checks the map

A month ago, we came across the Tarsier model, which was specifically trained for dense video captioning. Something felt off, though. The captions were dense, but there was no information about the time frames where those events occurred. We searched through the existing video-understanding benchmarks for models that were good at this, but the results were subpar. Most benchmarks don’t focus on the temporal axis at all, and the ones that do try to measure it use text-based judges, which introduce several biases. So we built Marlin-2B and a new audit pipeline. The rest of this post is what we found, and where the field’s evaluation actually breaks.

Marlin-2B: dense captioning with timestamps Marlin-2B: natural-language temporal grounding queries
Marlin-2B emits time-anchored events for dense captioning (left) and resolves natural-language grounding queries into time spans (right) — both from a single forward pass.

TL;DR

The standard video-captioning benchmarks (CaReBench, DREAM-1K) carry minor factual errors in their ground truth, and they sometimes miss details that the model being benchmarked actually got right. Those correct details then get marked as hallucinations. Fixing this requires a video-aware judge rather than a text-only one. We built such a judge, audited both benches, and found GT errors in 70.2% of CaReBench clips and 65.8% of DREAM-1K entries. Based on this audit, we’re releasing our own benchmark, Argus, with video-verified ground truth, in Post 3. We then trained Marlin-2B, a 2B open-source VLM, which sits #5 among open-source models on TimeLens-Bench (no LLM judge involved) and beats GPT-5 by +3.9 mIoU on the combined average despite being 3.5× smaller.

About Marlin-2B

2B-parameter vision-language model · Apache 2.0 license · fine-tuned from Qwen3.5-2B-Base · built by NemoStation. Demo · Hugging Face · Part 1 of 3 in our build-journey series.


The discovery

We wanted Marlin-2B to be good at the video tasks we actually use: locating an event in a long clip, generating dense captions with timestamps. So we read the recent video-VLM literature, picked DREAM-1K, CaReBench, and the refined TimeLens-Bench, and built a training mix on top of the Tarsier paper’s recipe (Charades-Ego, COIN, LSMDC, ActivityNet, YouCook, WebVid subsets). The plan was to train against the field’s standard benchmarks and ship a useful open-source model.

But the benchmarks themselves were broken.

We started by evaluating Marlin-2B on CaReBench and DREAM-1K. The judges those benches use are text-only: they read our caption and the ground-truth caption side-by-side and decide which atomic facts each side gets right. The judge never sees the video. That choice produces two problems.

Problem 1: when the model is more accurate than the GT, it loses

Here’s a real row from our DREAM-1K eval. The benchmark video shows a yellow humanoid creature in a half-kneeling pose. Our model wrote:

“yellow humanoid creature stands in a crouched position”

The official DREAM-1K GT description said only:

“the character half kneels on the ground”

The GT made no mention of color. The text-only judge’s ruling on our predicted event:

“The video description mentions a character half kneeling, which is similar to a crouched position, but does not specify the color of the character.”

Marked neutral. Not a hit. Then four more “yellow …” predictions for the same clip got the same ruling, “does not specify the color”, “does not mention the color”, over and over. The judge couldn’t verify “yellow” against the actual video, so when the GT didn’t mention color, the judge defaulted to “can’t tell.” Our model was being penalized for being more accurate than the GT.

Text-only vs video-aware judge on the same input

Figure: same candidate caption, same GT, same clip. The text-only judge (left) doesn’t see the video and can’t verify “yellow” — it defaults to a neutral verdict that penalizes the correct detail. The video-aware judge (right) sees the creature is yellow and credits the candidate. The text-only protocol is what every public dense-caption benchmark uses today.

Problem 2: the GT itself is wrong, way more often than the field admits

So we built a video-aware judge, Gemini-3-Flash with the actual video frames in the prompt, to audit the benchmarks. We pointed it at the GT side of CaReBench and DREAM-1K and asked: how often is the ground truth itself wrong about what’s in the video?

The answer:

Benchmark Total GT entries audited GT with at least one factual error flagged GT with major errors (factual ≤ 5/10)
CaReBench (consensus across 4 independent video-aware judge runs) 795 clips (clean, both sides judged successfully) 558 (70.2%) 17 (2.1%)
CaReBench (any single video-aware judge run) 795 781 (98.2%) 163 (20.5%)
DREAM-1K (single-side video-aware judge, GT-only audit) 1000 658 (65.8%) 77 (7.7%)

Ground-truth annotation errors in the standard video-caption benchmarks

Figure: GT-error rates by benchmark. What the field has implicitly been assuming is that the GT is essentially correct; what we measured with an independent video-aware audit judge is shown above. The 70.2% and 65.8% are conservative (consensus / single-side), and ~2–8% of clips have severe enough errors that the GT itself scored ≤5/10 in our audit.

A few consensus-flagged CaReBench errors (all four independent judge runs agreed the GT was wrong):

And from DREAM-1K, audited single-side:

DREAM-1K clip 1. No hand-raising, no staircase — the characters just walk around the space.

This is what the standard text-only judges have been grading against for the past two years. None of them have any way to see the video, so none of them can catch any of these. Every model that’s ever reported a DREAM-1K AutoDQ or CaReBench score has been ranked relative to a reference that is wrong about something in roughly two out of every three clips.

The field’s benchmarks are not a map of model capability. They’re a map of how closely each model happens to share the GT’s specific (often wrong) vocabulary.


1 – Why this gap exists, and what’s actually grading what

Here’s the taxonomy of public dense-caption + grounding benchmarks and what’s actually scoring each one:

Benchmark What it grades How it scores
DREAM-1K (Wang et al, 2024) Dense captions, no grounding Text-only AutoDQ judge
CaReBench Fine-grained captions, GT has no timestamps Text-only LLM judge
TimeLens-Bench (Zhang et al, 2025) Temporal grounding only (refined Charades-STA, ANet-STA, QVHighlights) Pure span-IoU math, no LLM
ActivityNet-Captions Both nominally CIDEr / METEOR / BLEU (n-gram)

The TimeLens paper documents that on legacy Charades-STA “open-source models deceptively surpass state-of-the-art proprietary models” due to annotation-quality issues. Same kind of problem we found on CaReBench, just on a different bench.

A model can do well on dense captioning while getting nothing useful from temporal grounding, or vice versa, and the field’s standard taxonomy doesn’t surface the gap. Real downstream applications (highlight reels, security review, content moderation, sports analytics, ad insertion) all need both jointly.

The bigger problem is the one above: on the two benches that do try to grade dense caption quality (DREAM-1K, CaReBench), the field’s default is a text-only LLM judge. That choice was historically defensible. When DREAM-1K and CaReBench were designed, nobody had a Gemini-2.5-Flash-quality multimodal judge available. But in 2026 the limitation is inertial rather than technological. The benches still use text-only judges. The text-only judges still can’t audit the GT. The leaderboards still treat the GT as a fixed reference. None of it survives a five-minute audit.

Other work has documented failure modes of LLM judges. Zheng et al, 2023 (LLM-as-a-judge), Wang et al, 2023 (positional bias), and Saito et al, 2023 (verbosity bias) cover the well-known ones. What’s different here is the scale: the majority of every dense-caption benchmark’s GT has factual errors, not just a few percent.


2 – Why we kept the model small

Iteration is the rate-limiting step in finding judge bugs. We re-ran the full 1000-clip CaReBench audit a dozen-plus times in three weeks across different judge prompts, which is only possible if a full eval takes under two hours. A 2B model trains comfortably on H100s (frozen ViT + DeepSpeed Zero-2) and a full re-eval costs an evening. The same 2B then fits on a single L4 GPU for inference, so the public demo can run cheaply too.

The audit judge (Gemini-3-Flash) is what surfaces the GT errors. The small 2B model is what made it affordable to re-run that audit a dozen-plus times across different prompts; without the iteration count, the errors stay invisible.

The point of keeping the model small was iteration cost. If a full eval cycle took eight hours instead of two, we’d have shipped the wrong training recipe before figuring out the eval was broken.

The same constraint applied to data. We regenerated ~400K grounded training captions (~200K dense captions plus ~200K temporal-grounding events) every time we tightened the teacher prompt. And we tightened it more than you’d think. Full story in Post 2.


3 – Why Qwen3.5-2B specifically

Qwen3.5-2B uses a hybrid attention layer stack: 24 total layers, with 18 Gated DeltaNet (linear-attention, O(n)) layers and 6 full-attention layers in a strict 3-linear-then-1-full pattern (full attention at layers 3, 7, 11, 15, 19, and 23). Verified against the Qwen3.5-2B-Base config.json layer_types field.

Practical consequences for this project:

We trained on H100s with ms-swift + vLLM-backed rollouts (afforded 16 rollouts per prompt during GRPO data filtering). The public demo at vlm.nemostation.com runs the trained model on a single L4 GPU.

Qwen3.5-2B hybrid attention layer stack

Figure: Qwen3.5-2B’s 24-layer stack. Full attention at layers 3, 7, 11, 15, 19, 23 (6 total); the remaining 18 layers use Gated DeltaNet (linear attention). Visual tokens are injected via DeepStack residual into the first 3 layers. Verified from the model’s config.json layer_types array.

The 6 full-attention layers keep long-range modeling intact for dense captioning. The 18 GDN layers make decode cheap enough to fit GRPO rollouts on one GPU.


4 – The TimeLens-Bench leaderboard

This is the post’s payoff section, and not coincidentally it’s the bench where we trust the numbers most. TimeLens-Bench doesn’t use an LLM judge at all; it’s pure span-IoU arithmetic on (start_sec, end_sec) predictions. No GT-quality coupling, no text-judge blind spots. Whatever a model scores here is roughly what it’s actually doing.

We evaluated Marlin-2B on the refined TimeLens-Bench (released by Zhang et al, 2025 in December 2025, manually re-annotated, much higher-quality versions of Charades-STA, ActivityNet-STA, and QVHighlights). Our sample counts match the paper’s exactly (3363 / 4500 / 1541), so the comparison is apples-to-apples on the same benchmark data.

# Model Params Charades-TimeLens ActivityNet-TimeLens QVHighlights-TimeLens Avg
1 Gemini-2.5-Pro proprietary 52.8 58.1 70.4 62.9
2 TimeLens-8B (RL on Qwen3-VL-8B) 8B 55.2 53.2 65.5 60.3
3 Gemini-2.5-Flash proprietary 48.6 52.5 64.3 57.4
4 Qwen3-VL-235B-A22B 235B MoE 47.8 52.1 64.6 56.8
5 Gemini-2.0-Flash proprietary 46.7 49.3 60.8 54.1
6 Qwen3-VL-8B 8B 48.3 46.8 59.4 53.4
7 TimeLens-7B (RL on Qwen2.5-VL-7B) 7B 48.8 46.2 56.0 52.7
8 Marlin-2B (shipped) — ours 2B 48.82 46.50 56.32 51.8
9 GPT-5 proprietary 40.5 42.9 56.8 47.9
10 Marlin-2B SFT — ours 2B 45.73 42.10 52.46 47.9
11 GPT-4o proprietary 41.8 40.4 52.1 45.6
12 MiMo-VL-7B 7B 39.6 35.5 41.5 39.7
13 Qwen2.5-VL-7B 7B 39.3 31.4 31.6 32.7

Numbers for non-Marlin models are from Zhang et al, 2025, Table 1. Marlin-2B numbers from our own evaluation on the same refined TimeLens-Bench data. Full R@1@τ breakdown in the appendix.

Charades-TimeLens leaderboard, Marlin-2B is #3

Figure: TimeLens-Bench combined average across the leaderboard. Marlin-2B (orange, 2B params) lands at #8 overall and #5 among open-source models, beating GPT-5 by +3.9 mIoU despite being 3.5× smaller. It’s the smallest model in the open-source top 5.

Sorted by combined leaderboard average, Marlin-2B sits at #8 overall and #5 among open-source models. Specifically:

The training recipe that got us here (SFT → GRPO → SimPO) is the subject of Post 2.


5 – Where this leaves the field

The text-only AutoDQ protocol predates the era of high-quality multimodal judges; it was the only reasonable choice when nobody had a Gemini-2.5-Flash-quality judge available. CaReBench and DREAM-1K were both significant contributions in their time. The reason the field’s defaults are broken in 2026 is inertia.

But the inertia matters. With ~70% of CaReBench GT entries containing at least one consensus-flagged factual error, and ~66% of DREAM-1K GT entries similarly flagged, the dense-caption leaderboards are mostly measuring how closely each model’s vocabulary matches the wrong reference. A model that adds correct dense detail (color, on-screen text, camera angle) gets penalized for diverging from the noisy GT. The incentive structure points toward writing captions that resemble the reference rather than describing the video.

We did three things about it:

One technical detail worth flagging now: we found that asking the judge to enumerate visible frame-level evidence before scoring any axis consistently reduces Gemini-3-Flash hallucinations. We use the same mechanic on the training-data teacher and on the audit judge. Full prompt and ablation in Post 3.


5.5 – Limitations and honest disclosure

Indie credibility on Hugging Face and r/LocalLLaMA depends on owning the caveats up front. So:


6 – What’s next

With the benchmarks corrected and our north star set, the next step (training the model) became the easier part of the journey. To make captions timestamp-aware, we trained Marlin-2B on dense captioning and temporal grounding jointly. I’ll cover how we trained Marlin-2B in the next post.

If you’re shipping a video VLM, your AutoDQ / CaReBench numbers are probably grading your model against a 70%-broken reference. We’d be happy to help you check. Open an issue on the Marlin-2B HF repo with a sample of your judge outputs and we’ll look at them with you.


Written by Aryan Jain.

Acknowledgements: the TimeLens team at ARC Lab / Tencent PCG for the refined VTG benchmark we evaluate on; the ms-swift team for the training framework; the Qwen team for the base model; and the maintainers of the open-source video datasets we built on top of (Charades, ActivityNet, QVHighlights, CaReBench, DREAM-1K).


Appendix A – Full TimeLens-Bench numbers

Per-split, per-model R@1@{0.3, 0.5, 0.7} and mIoU. Marlin-2B rows are our own evaluation on the refined TimeLens-Bench annotations (n counts match the paper exactly: 3363 / 4500 / 1541). All other rows are from Zhang et al, 2025, Table 1.

Charades-TimeLens (n = 3363)

Model R@1@0.3 R@1@0.5 R@1@0.7 mIoU
TimeLens-8B 76.6 63.0 35.2 55.2
Gemini-2.5-Pro 74.1 61.1 34.0 52.8
TimeLens-7B 70.5 55.6 28.4 48.8
Marlin-2B (shipped) 69.10 54.09 29.59 48.82
Gemini-2.5-Flash 68.7 56.1 30.6 48.6
Qwen3-VL-8B 69.2 53.4 27.5 48.3
Qwen3-VL-235B-A22B 71.7 50.8 24.5 47.8
Gemini-2.0-Flash 66.4 53.5 27.1 46.7
Marlin-2B SFT 65.24 50.25 27.00 45.73
GPT-4o 60.6 44.5 23.5 41.8
GPT-5 59.3 42.0 22.0 40.5
MiMo-VL-7B 57.9 42.6 20.5 39.6
Qwen2.5-VL-7B 59.7 37.8 16.6 39.3

ActivityNet-TimeLens (n = 4500)

Model R@1@0.3 R@1@0.5 R@1@0.7 mIoU
Gemini-2.5-Pro 72.3 64.2 47.1 58.1
TimeLens-8B 68.9 58.4 40.6 53.2
Gemini-2.5-Flash 66.8 57.5 41.3 52.5
Qwen3-VL-235B-A22B 69.0 57.5 39.3 52.1
Gemini-2.0-Flash 62.9 54.0 37.7 49.3
Qwen3-VL-8B 62.1 51.2 34.4 46.8
Marlin-2B (shipped) 59.82 50.38 34.49 46.50
TimeLens-7B 62.8 51.0 32.6 46.2
GPT-5 57.4 44.9 30.4 42.9
Marlin-2B SFT 54.76 45.27 29.84 42.10
GPT-4o 55.2 41.4 25.8 40.4
MiMo-VL-7B 49.3 38.7 22.4 35.5
Qwen2.5-VL-7B 44.1 31.0 16.1 31.4

QVHighlights-TimeLens (n = 1541)

Model R@1@0.3 R@1@0.5 R@1@0.7 mIoU
Gemini-2.5-Pro 84.1 75.9 61.1 70.4
TimeLens-8B 80.2 71.6 55.5 65.5
Qwen3-VL-235B-A22B 79.6 70.2 54.5 64.6
Gemini-2.5-Flash 78.2 69.4 55.0 64.3
Gemini-2.0-Flash 76.2 66.4 48.3 60.8
Qwen3-VL-8B 74.2 64.6 49.3 59.4
GPT-5 72.4 60.4 46.4 56.8
Marlin-2B (shipped) 69.57 58.14 46.20 56.32
TimeLens-7B 74.1 62.7 43.1 56.0
Marlin-2B SFT 65.41 54.57 41.79 52.46
GPT-4o 69.0 54.8 38.5 52.1
MiMo-VL-7B 57.1 42.6 28.4 41.5
Qwen2.5-VL-7B 41.5 27.8 15.2 31.6

Appendix B – Methodology and reproducibility

Bench audit (Section “The discovery”, GT-error figure)

The 70.2% and 65.8% headline numbers come from running an independent video-aware judge (Gemini-3-Flash, frames + GT caption + axis-scoring prompt) over the GT side of each benchmark and counting how many GT entries received a non-empty errors_gt list. Specifics:

TimeLens-Bench numbers (Section 4 leaderboard + Appendix A)

Architecture (Section 3 layer diagram)

What’s public today vs coming with Post 3

Until then, if a number doesn’t seem to add up or you’d like the raw judge output for a specific clip, open an issue on the Marlin-2B HF repo and we’ll share that slice directly.


Appendix C – Glossary

Term Definition
mIoU Mean Intersection-over-Union — the standard temporal-grounding metric. For each predicted (start, end) span, compute IoU vs the GT span; average across the test set.
R@1@τ Recall@1 at IoU threshold τ — fraction of test instances where the top-1 predicted span achieves IoU ≥ τ.
AutoDQ Automatic Dense-caption Quality — DREAM-1K’s judge protocol. LLM extracts atomic events from GT, then checks each against the model’s caption. Yields precision / recall / F1.
MLLM / VLM Multimodal Large Language Model / Vision-Language Model. LLMs that ingest images or video alongside text.
VTG Video Temporal Grounding — given a natural-language query, locate the time span in a video where it happens.
GT Ground Truth — the reference annotation, treated as correct for evaluation. (Whether it actually is correct is the subject of much of this post.)
GDN / linear attention Gated DeltaNet — a recurrent-style attention variant with O(n) memory, used in Qwen3.5’s hybrid layer stack. Faster than full attention at long contexts.
DeepStack The mechanism for injecting visual tokens via residual into the first 3 LLM layers, instead of concatenating them to the text token stream.
GRPO Group Relative Policy Optimization — DeepSeek’s RL variant; samples N rollouts per prompt, computes per-rollout reward, optimizes relative advantages.
SimPO Simple Preference Optimization — a DPO variant we used for the post-GRPO preference-learning stage. (Detail in Post 2.)