
Wrong sports, wrong camera angles, misspelled on-screen text, claims of actions that don’t happen on screen. Every model that’s ever reported a DREAM-1K AutoDQ or CaReBench score has been ranked against a reference that is wrong about something in roughly two out of every three clips.
That’s the headline. The rest of this post is the audit, the methodology, and a 2B open-source model that lands #3 on the cleaner TimeLens-Bench leaderboard while the field is still chasing the wrong target.
The two standard dense-caption benchmarks (CaReBench, DREAM-1K) ship with broken ground truth: 70.2% of CaReBench clips and 65.8% of DREAM-1K entries have factual errors that a video-aware judge flags. Their judges are text-only and can’t catch any of it; worse, when a model adds correct dense detail the GT omits, the judge marks it as a hallucination. On TimeLens-Bench, which uses no LLM judge, Marlin-2B (2B params, open source) lands #3, ahead of GPT-5, GPT-4o, and Qwen2.5-VL-7B. Posts 2 and 3 cover the training journey and the audit-judge release.
About Marlin-2B
2B-parameter video-language model · Apache 2.0 · fine-tuned from Qwen3.5-2B-Base · built by NemoStation. Demo · Hugging Face · Part 1 of 3 in our build-journey series.
We started by evaluating Marlin-2B on the standard public dense-caption benches, CaReBench and DREAM-1K. Both use text-only LLM judges that read our model’s caption next to the ground-truth caption and decide which side gets each atomic fact right. The judge never sees the video. That choice produces two problems.
Here’s a real row from our DREAM-1K eval. The clip shows a yellow humanoid creature in a half-kneeling pose. Our model wrote:
“yellow humanoid creature stands in a crouched position”
The DREAM-1K GT said:
“the character half kneels on the ground”
No mention of color. The text-only judge ruled on our predicted event:
“The video description mentions a character half kneeling, which is similar to a crouched position, but does not specify the color of the character.”
The judge marked it neutral, didn’t credit the prediction, and gave the same ruling to four more “yellow…” predictions on the same clip, each time citing that the GT does not specify color. The judge can’t watch the video, so when the candidate adds correct detail the GT happens to omit, that detail is unverifiable and gets dropped. Our model was being penalized for being more accurate than the reference.
Figure: same candidate caption, same GT, same clip. Left, the text-only judge can’t verify “yellow” and defaults to neutral. Right, the video-aware judge sees the creature is yellow and credits the candidate. The text-only protocol is what every public dense-caption benchmark uses today.
To measure how often this was happening, we built a video-aware audit judge: Gemini-3-Flash with the actual video frames in the prompt, scoring the GT caption directly. Across roughly 2,000 GT entries:
| Benchmark | Entries audited | GT with at least one factual error | GT with major errors (≤5/10) |
|---|---|---|---|
| CaReBench, consensus across 4 independent runs | 795 | 558 (70.2%) | 17 (2.1%) |
| CaReBench, any single run | 795 | 781 (98.2%) | 163 (20.5%) |
| DREAM-1K, single-side audit | 1000 | 658 (65.8%) | 77 (7.7%) |
Figure: GT-error rates by benchmark. The field has assumed GT is essentially correct; the audit shows otherwise. 70.2% and 65.8% are the conservative numbers; 2-8% of clips are bad enough that the GT itself scored ≤5/10.
A few representative consensus-flagged errors from CaReBench:
idx=3: states the video is filmed from her perspective, but it is a front-facing/selfie camera view idx=4: typo in banner text, “NEW CIDS” instead of “NEW VIDS” idx=6: claims she finishes applying eyeshadow to her left eye, which is never shown in the video idx=9: describes the background as a blue and yellow gradient when it is a solid teal/mint green idx=12: the athlete is wearing a sleeveless tank top, not a short-sleeve shirt idx=13: describes the javelin as blue, whereas it is primarily yellow/orange with a blue grip section
And from DREAM-1K:
idx=0: GT claims they grabbed two green monsters; the video shows one green and one blue-with-purple-spots idx=1: GT narrates “one character raises their hand and then puts it down. The other character walks away and is followed by the former, both characters go up a staircase together.” None of those events happen in the clip; the characters just walk around the space.
DREAM-1K idx=1. No hand-raising, no staircase. The characters just walk around the space.
idx=7: GT calls a blue figure “cat-like”; the video shows a mechanical praying-mantis-like insect idx=11: GT says “sword fight”; the actual combat is a body tackle with an axe-like weapon idx=14: GT says “large bird”; the creature is a dragon with scales, horns, and a reptilian tail
This is what the standard text-only judges have been grading against for the past two years. They can’t watch the video, so they can’t catch any of it. The leaderboards aren’t a map of model capability; they’re a map of how closely each model’s vocabulary happens to match the GT’s vocabulary, even when the GT is wrong.
Here’s how the major dense-caption and grounding benchmarks evaluate models:
| Benchmark | What it scores | How it scores |
|---|---|---|
| DREAM-1K (Wang et al, 2024) | Dense captions, no grounding | Text-only AutoDQ judge |
| CaReBench | Fine-grained captions, GT has no timestamps | Text-only LLM judge |
| TimeLens-Bench (Zhang et al, 2025) | Temporal grounding only | Pure span-IoU math, no LLM |
| ActivityNet-Captions | Both nominally | CIDEr / METEOR / BLEU (n-gram) |
On the two benches that try to grade dense caption quality, the field’s default is a text-only LLM judge. That made sense when the benchmarks were designed; Gemini-2.5-Flash didn’t exist yet. The limitation is inertial now, not technological. The benches still use text-only judges, the text-only judges still can’t audit the GT, and five minutes of running a video-aware judge over the GT side exposes the problem at scale.
LLM-judge failure modes aren’t a new observation. Zheng et al, 2023, Wang et al, 2023 on positional bias, Saito et al, 2023 on verbosity bias have all documented them. What’s different here is scale: the majority of every dense-caption benchmark’s GT is wrong, not just a few percent.
Iteration is the rate-limiting step in finding judge bugs. We re-ran the full 1000-clip CaReBench audit a dozen-plus times in three weeks across different judge prompts. That cadence is only affordable if a full eval finishes in under two hours. A 2B model trains comfortably on H100s with a frozen ViT and DeepSpeed Zero-2, and a full re-eval costs an evening. The same 2B fits on a single L4 GPU for inference, so the public demo runs cheaply.
The audit judge (Gemini-3-Flash) is what surfaces the GT errors. The small model is what made it affordable to re-run that audit many times across different prompts. Without the iteration count, the errors stay invisible.
The same constraint applied to data. We regenerated roughly 400K grounded training captions (200K dense captions plus 200K temporal-grounding events) every time we tightened the teacher prompt. We tightened it more than you’d think; that story is in Post 2.
Qwen3.5-2B uses a hybrid attention layer stack: 24 layers total, with 18 Gated DeltaNet layers (linear attention, O(n)) and 6 full-attention layers in a strict pattern of 3 linear then 1 full, repeating. Full attention sits at layers 3, 7, 11, 15, 19, and 23. Verified from the model’s config.json layer_types array.
Practical consequences for this project:
We trained on H100s with ms-swift and vLLM-backed rollouts, which gave us 16 rollouts per prompt during GRPO data filtering. The public demo at vlm.nemostation.com runs the trained model on a single L4 GPU.
Figure: Qwen3.5-2B’s 24-layer stack. Full attention at layers 3, 7, 11, 15, 19, 23 (six total). The other 18 use Gated DeltaNet (linear attention). Visual tokens are injected via DeepStack residual into the first 3 layers.
The 6 full-attention layers keep long-range modeling intact for dense captioning. The 18 GDN layers make decode cheap enough to fit GRPO rollouts on one GPU.
TimeLens-Bench is the bench we trust most because there’s no LLM judge in the loop. Scoring is pure span-IoU arithmetic on (start_sec, end_sec) predictions. Whatever a model scores here is roughly what it’s actually doing.
We evaluated Marlin-2B on the refined TimeLens-Bench (Zhang et al, 2025, Dec 2025), which is a manually re-annotated version of Charades-STA, ActivityNet-STA, and QVHighlights. Our sample counts match the paper’s exactly (3363 / 4500 / 1541), so this is apples-to-apples on the same data.
| Model | Params | Charades-TimeLens | ActivityNet-TimeLens | QVHighlights-TimeLens |
|---|---|---|---|---|
| Gemini-2.5-Pro | proprietary | 52.8 | 58.1 | 70.4 |
| TimeLens-8B (RL on Qwen3-VL-8B) | 8B | 55.2 | 53.2 | 65.5 |
| Gemini-2.5-Flash | proprietary | 48.6 | 52.5 | 64.3 |
| Qwen3-VL-8B | 8B | 48.3 | 46.8 | 59.4 |
| GPT-5 | proprietary | 40.5 | 42.9 | 56.8 |
| TimeLens-7B (RL on Qwen2.5-VL-7B) | 7B | 48.8 | 46.2 | 56.0 |
| Marlin-2B GRPO (cp-450), ours | 2B | 50.05 | 45.88 | 55.50 |
| GPT-4o | proprietary | 41.8 | 40.4 | 52.1 |
| Marlin-2B SFT (cp-11894), ours | 2B | 45.73 | 42.10 | 52.46 |
| MiMo-VL-7B | 7B | 39.6 | 35.5 | 41.5 |
| Qwen2.5-VL-7B | 7B | 39.3 | 31.4 | 31.6 |
Non-Marlin rows from Zhang et al, 2025, Table 1. Marlin-2B rows from our own evaluation on the same refined TimeLens-Bench data. Full R@1@τ breakdown in Appendix A.
Figure: Charades-TimeLens mIoU across the leaderboard. Marlin-2B GRPO (orange, 2B params) lands at #3, ahead of TimeLens-7B, Gemini-2.5-Flash, and Qwen3-VL-8B.
Sorted by Charades-TimeLens mIoU, our 2B model is #3 overall, behind only TimeLens-8B (8B) and Gemini-2.5-Pro:
How a 2B model ended up in this leaderboard position is Post 2.
None of this is a strawman of careless researchers. The text-only AutoDQ protocol predates the era of high-quality multimodal judges. CaReBench and DREAM-1K were both significant contributions in their time. The reason the field’s defaults are broken in 2026 is inertia.
The inertia matters. With roughly 70% of CaReBench GT entries containing at least one consensus-flagged factual error, and 66% of DREAM-1K GT entries similarly flagged, the dense-caption leaderboards aren’t measuring what they claim to. A model that scores well on AutoDQ is mostly vocabulary-matching a wrong reference. A model that adds correct dense detail (color, on-screen text, camera angle) gets penalized for diverging from a noisy GT. The incentive structure points toward writing captions that resemble the reference rather than seeing what’s in the video.
We did three things about it:
One thing worth flagging now: we found a prompt-side fix that consistently reduces Gemini-3-Flash hallucinations. The short version is asking the judge to enumerate visible frame-level evidence before scoring any axis. We use it on both the training-data teacher and the audit judge. Full prompt and ablation in Post 3.
errors_gt list. Some flagged items are minor (misspelled watermark, small color disagreement). The more conservative version (GT scored ≤5/10 in all 4 independent runs) is 2.1% on CaReBench and 7.7% on DREAM-1K. The 70% number is the loose definition, the 2% number is the tight one. Both are real.gemini-3-flash-preview versions. We pin the model version in every metrics.json and recommend others do the same.FPS_MAX_FRAMES=240, Qwen2.5-VL / Qwen3-VL / MiMo-VL use 2 FPS sampling from the TimeLens paper, GPT and Gemini use protocols described in the paper’s appendix. The benchmark data is identical for everyone. Forcing every model into one shared frame budget would penalize models trained at higher resolutions.In Post 2, we walk through how we built the roughly 400K-clip grounded training set and the SFT → GRPO → SimPO recipe that lifted Marlin-2B from 45.73 to 50.05 mIoU on Charades-TimeLens. One interesting finding was that 4 bugs killed each training stage; we’ll walk through each of them. The teacher-side version of the frame-evidence prompting trick is also documented there.
In Post 3, we release videoeval-v2: the audit judge code, the judge prompts, the 1000-clip GT-error annotations on CaReBench and DREAM-1K, and val_set, a new 1000-clip benchmark with timestamped GT audited against video at generation time. Apache 2.0, full open release.
If you’re shipping a video VLM, your AutoDQ / CaReBench numbers are probably grading your model against a 70%-broken reference. Open an issue on the Marlin-2B HF repo with a sample of your judge outputs and we’ll look at them with you.
Written by Aryan Jain.
Acknowledgements: the TimeLens team at ARC Lab / Tencent PCG for the refined VTG benchmark we evaluate on; the ms-swift team for the training framework; the Qwen team for the base model; Hazy Research for visual style inspiration; and the maintainers of the open-source video datasets we built on top of (Charades, ActivityNet, QVHighlights, CaReBench, DREAM-1K).
Per-split, per-model R@1@{0.3, 0.5, 0.7} and mIoU. Marlin-2B rows are our own evaluation on the refined TimeLens-Bench annotations (n counts match the paper exactly: 3363 / 4500 / 1541). All other rows are from Zhang et al, 2025, Table 1.
| Model | R@1@0.3 | R@1@0.5 | R@1@0.7 | mIoU |
|---|---|---|---|---|
| TimeLens-8B | 76.6 | 63.0 | 35.2 | 55.2 |
| Gemini-2.5-Pro | 74.1 | 61.1 | 34.0 | 52.8 |
| Marlin-2B GRPO (cp-450) | 70.71 | 55.63 | 30.60 | 50.05 |
| TimeLens-7B | 70.5 | 55.6 | 28.4 | 48.8 |
| Gemini-2.5-Flash | 68.7 | 56.1 | 30.6 | 48.6 |
| Qwen3-VL-8B | 69.2 | 53.4 | 27.5 | 48.3 |
| Marlin-2B SFT (cp-11894) | 65.24 | 50.25 | 27.00 | 45.73 |
| GPT-4o | 60.6 | 44.5 | 23.5 | 41.8 |
| GPT-5 | 59.3 | 42.0 | 22.0 | 40.5 |
| MiMo-VL-7B | 57.9 | 42.6 | 20.5 | 39.6 |
| Qwen2.5-VL-7B | 59.7 | 37.8 | 16.6 | 39.3 |
| Model | R@1@0.3 | R@1@0.5 | R@1@0.7 | mIoU |
|---|---|---|---|---|
| Gemini-2.5-Pro | 72.3 | 64.2 | 47.1 | 58.1 |
| TimeLens-8B | 68.9 | 58.4 | 40.6 | 53.2 |
| Gemini-2.5-Flash | 66.8 | 57.5 | 41.3 | 52.5 |
| Qwen3-VL-8B | 62.1 | 51.2 | 34.4 | 46.8 |
| TimeLens-7B | 62.8 | 51.0 | 32.6 | 46.2 |
| Marlin-2B GRPO (cp-450) | 59.33 | 49.60 | 33.38 | 45.88 |
| GPT-5 | 57.4 | 44.9 | 30.4 | 42.9 |
| Marlin-2B SFT (cp-11894) | 54.76 | 45.27 | 29.84 | 42.10 |
| GPT-4o | 55.2 | 41.4 | 25.8 | 40.4 |
| MiMo-VL-7B | 49.3 | 38.7 | 22.4 | 35.5 |
| Qwen2.5-VL-7B | 44.1 | 31.0 | 16.1 | 31.4 |
| Model | R@1@0.3 | R@1@0.5 | R@1@0.7 | mIoU |
|---|---|---|---|---|
| Gemini-2.5-Pro | 84.1 | 75.9 | 61.1 | 70.4 |
| TimeLens-8B | 80.2 | 71.6 | 55.5 | 65.5 |
| Gemini-2.5-Flash | 78.2 | 69.4 | 55.0 | 64.3 |
| Qwen3-VL-8B | 74.2 | 64.6 | 49.3 | 59.4 |
| GPT-5 | 72.4 | 60.4 | 46.4 | 56.8 |
| TimeLens-7B | 74.1 | 62.7 | 43.1 | 56.0 |
| Marlin-2B GRPO (cp-450) | 69.05 | 58.21 | 44.06 | 55.50 |
| Marlin-2B SFT (cp-11894) | 65.41 | 54.57 | 41.79 | 52.46 |
| GPT-4o | 69.0 | 54.8 | 38.5 | 52.1 |
| MiMo-VL-7B | 57.1 | 42.6 | 28.4 | 41.5 |
| Qwen2.5-VL-7B | 41.5 | 27.8 | 15.2 | 31.6 |
Bench audit. The 70.2% and 65.8% headline numbers come from running an independent video-aware judge (Gemini-3-Flash, frames + GT caption + axis-scoring prompt) over the GT side of each benchmark and counting how many entries received a non-empty errors_gt list. CaReBench used 4 independent runs (consensus across all 4 = 70.2%; any single run = 98.2%; severe ≤5/10 in all 4 = 2.1%). DREAM-1K used a single-side video-aware judge on the GT (any-error = 65.8%; severe = 7.7%).
TimeLens-Bench numbers. Non-Marlin rows from Zhang et al, 2025, Table 1 verbatim. Marlin-2B rows from our own evaluation on the same refined TimeLens-Bench annotations. Scoring is pure span-IoU arithmetic; anyone with the same weights and annotations reproduces the numbers deterministically.
Architecture. Qwen3.5-2B layer_types array directly from the public Qwen/Qwen3.5-2B-Base config.json.
What’s public today vs coming with Post 3. Public now: Marlin-2B model weights, the TimeLens paper baseline numbers, the Qwen3.5-2B architecture details. Releasing with Post 3: the audit judge code, the judge prompts (including the frame_evidence mechanic), the 1000-clip GT-error annotations on CaReBench and DREAM-1K, and the new val_set benchmark. If a number doesn’t seem to add up before then, open an issue on the Marlin-2B HF repo and we’ll share that slice directly.
| Term | Definition |
|---|---|
| mIoU | Mean Intersection-over-Union. The standard temporal-grounding metric: for each predicted (start, end) span, compute IoU vs the GT span, then average. |
| R@1@τ | Recall@1 at IoU threshold τ. Fraction of test instances where the top-1 predicted span achieves IoU ≥ τ. |
| AutoDQ | DREAM-1K’s text-only judge protocol. An LLM extracts atomic events from the GT, then checks each against the model’s caption. Outputs precision / recall / F1. |
| MLLM / VLM | Multimodal Large Language Model / Vision-Language Model. Models that take images or video alongside text. |
| VTG | Video Temporal Grounding. Given a natural-language query, find the time span in the video where it happens. |
| GT | Ground truth: the reference annotation, treated as correct for evaluation. (Whether it actually is correct is the point of this post.) |
| GDN | Gated DeltaNet. A recurrent-style linear-attention variant with O(n) memory. Used for the 18 non-full-attention layers in Qwen3.5. |
| DeepStack | A residual mechanism that injects visual tokens directly into the first 3 LLM layers, instead of concatenating them to the text sequence. |
| GRPO | Group Relative Policy Optimization, DeepSeek’s RL variant. Samples N rollouts per prompt, computes per-rollout reward, optimizes relative advantages. |
| SimPO | Simple Preference Optimization. A DPO variant we used for the post-GRPO preference-learning stage. Details in Post 2. |