
A month ago, we came across the Tarsier model, which was specifically trained for dense video captioning. Something felt off, though. The captions were dense, but there was no information about the time frames where those events occurred. We searched through the existing video-understanding benchmarks for models that were good at this, but the results were subpar. Most benchmarks don’t focus on the temporal axis at all, and the ones that do try to measure it use text-based judges, which introduce several biases. So we built Marlin-2B and a new audit pipeline. The rest of this post is what we found, and where the field’s evaluation actually breaks.
The standard video-captioning benchmarks (CaReBench, DREAM-1K) carry minor factual errors in their ground truth, and they sometimes miss details that the model being benchmarked actually got right. Those correct details then get marked as hallucinations. Fixing this requires a video-aware judge rather than a text-only one. We built such a judge, audited both benches, and found GT errors in 70.2% of CaReBench clips and 65.8% of DREAM-1K entries. Based on this audit, we’re releasing our own benchmark, Argus, with video-verified ground truth, in Post 3. We then trained Marlin-2B, a 2B open-source VLM, which sits #5 among open-source models on TimeLens-Bench (no LLM judge involved) and beats GPT-5 by +3.9 mIoU on the combined average despite being 3.5× smaller.
About Marlin-2B
2B-parameter video-language model · Apache 2.0 license · fine-tuned from Qwen3.5-2B-Base · built by NemoStation. Demo · Hugging Face · Part 1 of 3 in our build-journey series.
We wanted Marlin-2B to be good at the video tasks we actually use: locating an event in a long clip, generating dense captions with timestamps. So we read the recent video-VLM literature, picked DREAM-1K, CaReBench, and the refined TimeLens-Bench, and built a training mix on top of the Tarsier paper’s recipe (Charades-Ego, COIN, LSMDC, ActivityNet, YouCook, WebVid subsets). The plan was to train against the field’s standard benchmarks and ship a useful open-source model.
But the benchmarks themselves were broken.
We started by evaluating Marlin-2B on CaReBench and DREAM-1K. The judges those benches use are text-only: they read our caption and the ground-truth caption side-by-side and decide which atomic facts each side gets right. The judge never sees the video. That choice produces two problems.
Here’s a real row from our DREAM-1K eval. The benchmark video shows a yellow humanoid creature in a half-kneeling pose. Our model wrote:
“yellow humanoid creature stands in a crouched position”
The official DREAM-1K GT description said only:
“the character half kneels on the ground”
The GT made no mention of color. The text-only judge’s ruling on our predicted event:
“The video description mentions a character half kneeling, which is similar to a crouched position, but does not specify the color of the character.”
Marked neutral. Not a hit. Then four more “yellow …” predictions for the same clip got the same ruling, “does not specify the color”, “does not mention the color”, over and over. The judge couldn’t verify “yellow” against the actual video, so when the GT didn’t mention color, the judge defaulted to “can’t tell.” Our model was being penalized for being more accurate than the GT.
Figure: same candidate caption, same GT, same clip. The text-only judge (left) doesn’t see the video and can’t verify “yellow” — it defaults to a neutral verdict that penalizes the correct detail. The video-aware judge (right) sees the creature is yellow and credits the candidate. The text-only protocol is what every public dense-caption benchmark uses today.
So we built a video-aware judge, Gemini-3-Flash with the actual video frames in the prompt, to audit the benchmarks. We pointed it at the GT side of CaReBench and DREAM-1K and asked: how often is the ground truth itself wrong about what’s in the video?
The answer:
| Benchmark | Total GT entries audited | GT with at least one factual error flagged | GT with major errors (factual ≤ 5/10) |
|---|---|---|---|
| CaReBench (consensus across 4 independent video-aware judge runs) | 795 clips (clean, both sides judged successfully) | 558 (70.2%) | 17 (2.1%) |
| CaReBench (any single video-aware judge run) | 795 | 781 (98.2%) | 163 (20.5%) |
| DREAM-1K (single-side video-aware judge, GT-only audit) | 1000 | 658 (65.8%) | 77 (7.7%) |
Figure: GT-error rates by benchmark. What the field has implicitly been assuming is that the GT is essentially correct; what we measured with an independent video-aware audit judge is shown above. The 70.2% and 65.8% are conservative (consensus / single-side), and ~2–8% of clips have severe enough errors that the GT itself scored ≤5/10 in our audit.
A few consensus-flagged CaReBench errors (all four independent judge runs agreed the GT was wrong):
- clip 3: GT says the video is filmed from her perspective, but it’s a front-facing/selfie camera view.
- clip 4: Banner text typo. GT writes “NEW CIDS” but the on-screen banner reads “NEW VIDS”.
- clip 6: GT claims she finishes applying eyeshadow to her left eye, which is never shown.
- clip 9: Background described as a blue and yellow gradient when it’s a solid teal/mint green.
- clip 12: GT says short-sleeve shirt; the athlete is wearing a sleeveless tank top.
- clip 13: GT says the javelin is blue; it’s primarily yellow/orange with a blue grip section.
And from DREAM-1K, audited single-side:
- clip 0: GT claims they grabbed two green monsters; the video shows one green and one blue-with-purple-spots.
- clip 1: GT narrates a structured sequence: hand raise, hand down, walking away, climbing a staircase together. None of those events happen in the clip; the characters just walk around the space. (shown below)
- clip 7: GT calls a blue figure “cat-like”; the video shows a mechanical praying-mantis-like insect.
- clip 11: GT says “sword fight”; the actual combat is a body tackle with an axe-like weapon.
- clip 14: GT says “large bird”; the creature is a dragon with scales, horns, and a reptilian tail.
DREAM-1K clip 1. No hand-raising, no staircase — the characters just walk around the space.
This is what the standard text-only judges have been grading against for the past two years. None of them have any way to see the video, so none of them can catch any of these. Every model that’s ever reported a DREAM-1K AutoDQ or CaReBench score has been ranked relative to a reference that is wrong about something in roughly two out of every three clips.
The field’s benchmarks are not a map of model capability. They’re a map of how closely each model happens to share the GT’s specific (often wrong) vocabulary.
Here’s the taxonomy of public dense-caption + grounding benchmarks and what’s actually scoring each one:
| Benchmark | What it grades | How it scores |
|---|---|---|
| DREAM-1K (Wang et al, 2024) | Dense captions, no grounding | Text-only AutoDQ judge |
| CaReBench | Fine-grained captions, GT has no timestamps | Text-only LLM judge |
| TimeLens-Bench (Zhang et al, 2025) | Temporal grounding only (refined Charades-STA, ANet-STA, QVHighlights) | Pure span-IoU math, no LLM |
| ActivityNet-Captions | Both nominally | CIDEr / METEOR / BLEU (n-gram) |
The TimeLens paper documents that on legacy Charades-STA “open-source models deceptively surpass state-of-the-art proprietary models” due to annotation-quality issues. Same kind of problem we found on CaReBench, just on a different bench.
A model can do well on dense captioning while getting nothing useful from temporal grounding, or vice versa, and the field’s standard taxonomy doesn’t surface the gap. Real downstream applications (highlight reels, security review, content moderation, sports analytics, ad insertion) all need both jointly.
The bigger problem is the one above: on the two benches that do try to grade dense caption quality (DREAM-1K, CaReBench), the field’s default is a text-only LLM judge. That choice was historically defensible. When DREAM-1K and CaReBench were designed, nobody had a Gemini-2.5-Flash-quality multimodal judge available. But in 2026 the limitation is inertial rather than technological. The benches still use text-only judges. The text-only judges still can’t audit the GT. The leaderboards still treat the GT as a fixed reference. None of it survives a five-minute audit.
Other work has documented failure modes of LLM judges. Zheng et al, 2023 (LLM-as-a-judge), Wang et al, 2023 (positional bias), and Saito et al, 2023 (verbosity bias) cover the well-known ones. What’s different here is the scale: the majority of every dense-caption benchmark’s GT has factual errors, not just a few percent.
Iteration is the rate-limiting step in finding judge bugs. We re-ran the full 1000-clip CaReBench audit a dozen-plus times in three weeks across different judge prompts, which is only possible if a full eval takes under two hours. A 2B model trains comfortably on H100s (frozen ViT + DeepSpeed Zero-2) and a full re-eval costs an evening. The same 2B then fits on a single L4 GPU for inference, so the public demo can run cheaply too.
The audit judge (Gemini-3-Flash) is what surfaces the GT errors. The small 2B model is what made it affordable to re-run that audit a dozen-plus times across different prompts; without the iteration count, the errors stay invisible.
The point of keeping the model small was iteration cost. If a full eval cycle took eight hours instead of two, we’d have shipped the wrong training recipe before figuring out the eval was broken.
The same constraint applied to data. We regenerated ~400K grounded training captions (~200K dense captions plus ~200K temporal-grounding events) every time we tightened the teacher prompt. And we tightened it more than you’d think. Full story in Post 2.
Qwen3.5-2B uses a hybrid attention layer stack: 24 total layers, with 18 Gated DeltaNet (linear-attention, O(n)) layers and 6 full-attention layers in a strict 3-linear-then-1-full pattern (full attention at layers 3, 7, 11, 15, 19, and 23). Verified against the Qwen3.5-2B-Base config.json layer_types field.
Practical consequences for this project:
We trained on H100s with ms-swift + vLLM-backed rollouts (afforded 16 rollouts per prompt during GRPO data filtering). The public demo at vlm.nemostation.com runs the trained model on a single L4 GPU.
Figure: Qwen3.5-2B’s 24-layer stack. Full attention at layers 3, 7, 11, 15, 19, 23 (6 total); the remaining 18 layers use Gated DeltaNet (linear attention). Visual tokens are injected via DeepStack residual into the first 3 layers. Verified from the model’s config.json layer_types array.
The 6 full-attention layers keep long-range modeling intact for dense captioning. The 18 GDN layers make decode cheap enough to fit GRPO rollouts on one GPU.
This is the post’s payoff section, and not coincidentally it’s the bench where we trust the numbers most. TimeLens-Bench doesn’t use an LLM judge at all; it’s pure span-IoU arithmetic on (start_sec, end_sec) predictions. No GT-quality coupling, no text-judge blind spots. Whatever a model scores here is roughly what it’s actually doing.
We evaluated Marlin-2B on the refined TimeLens-Bench (released by Zhang et al, 2025 in December 2025, manually re-annotated, much higher-quality versions of Charades-STA, ActivityNet-STA, and QVHighlights). Our sample counts match the paper’s exactly (3363 / 4500 / 1541), so the comparison is apples-to-apples on the same benchmark data.
| # | Model | Params | Charades-TimeLens | ActivityNet-TimeLens | QVHighlights-TimeLens | Avg |
|---|---|---|---|---|---|---|
| 1 | Gemini-2.5-Pro | proprietary | 52.8 | 58.1 | 70.4 | 62.9 |
| 2 | TimeLens-8B (RL on Qwen3-VL-8B) | 8B | 55.2 | 53.2 | 65.5 | 60.3 |
| 3 | Gemini-2.5-Flash | proprietary | 48.6 | 52.5 | 64.3 | 57.4 |
| 4 | Qwen3-VL-235B-A22B | 235B MoE | 47.8 | 52.1 | 64.6 | 56.8 |
| 5 | Gemini-2.0-Flash | proprietary | 46.7 | 49.3 | 60.8 | 54.1 |
| 6 | Qwen3-VL-8B | 8B | 48.3 | 46.8 | 59.4 | 53.4 |
| 7 | TimeLens-7B (RL on Qwen2.5-VL-7B) | 7B | 48.8 | 46.2 | 56.0 | 52.7 |
| 8 | Marlin-2B (shipped) — ours | 2B | 48.82 | 46.50 | 56.32 | 51.8 |
| 9 | GPT-5 | proprietary | 40.5 | 42.9 | 56.8 | 47.9 |
| 10 | Marlin-2B SFT — ours | 2B | 45.73 | 42.10 | 52.46 | 47.9 |
| 11 | GPT-4o | proprietary | 41.8 | 40.4 | 52.1 | 45.6 |
| 12 | MiMo-VL-7B | 7B | 39.6 | 35.5 | 41.5 | 39.7 |
| 13 | Qwen2.5-VL-7B | 7B | 39.3 | 31.4 | 31.6 | 32.7 |
Numbers for non-Marlin models are from Zhang et al, 2025, Table 1. Marlin-2B numbers from our own evaluation on the same refined TimeLens-Bench data. Full R@1@τ breakdown in the appendix.
Figure: TimeLens-Bench combined average across the leaderboard. Marlin-2B (orange, 2B params) lands at #8 overall and #5 among open-source models, beating GPT-5 by +3.9 mIoU despite being 3.5× smaller. It’s the smallest model in the open-source top 5.
Sorted by combined leaderboard average, Marlin-2B sits at #8 overall and #5 among open-source models. Specifically:
The training recipe that got us here (SFT → GRPO → SimPO) is the subject of Post 2.
The text-only AutoDQ protocol predates the era of high-quality multimodal judges; it was the only reasonable choice when nobody had a Gemini-2.5-Flash-quality judge available. CaReBench and DREAM-1K were both significant contributions in their time. The reason the field’s defaults are broken in 2026 is inertia.
But the inertia matters. With ~70% of CaReBench GT entries containing at least one consensus-flagged factual error, and ~66% of DREAM-1K GT entries similarly flagged, the dense-caption leaderboards are mostly measuring how closely each model’s vocabulary matches the wrong reference. A model that adds correct dense detail (color, on-screen text, camera angle) gets penalized for diverging from the noisy GT. The incentive structure points toward writing captions that resemble the reference rather than describing the video.
We did three things about it:
One technical detail worth flagging now: we found that asking the judge to enumerate visible frame-level evidence before scoring any axis consistently reduces Gemini-3-Flash hallucinations. We use the same mechanic on the training-data teacher and on the audit judge. Full prompt and ablation in Post 3.
Indie credibility on Hugging Face and r/LocalLLaMA depends on owning the caveats up front. So:
errors_gt list. Some of those errors are minor (a misspelled watermark, a small color disagreement). The more conservative number, clips where the GT scored ≤5/10 in all four independent runs, is 2.1% on CaReBench. The 70% number is the loose definition; the 2% number is the tight one. Both are real.gemini-3-flash-preview versions. We pin the model version in every metrics.json and recommend others do the same.FPS_MAX_FRAMES=240; Qwen2.5-VL / Qwen3-VL / MiMo-VL use the 2 FPS sampling from the TimeLens paper; GPT and Gemini use the protocols described in the paper’s appendix. The benchmark data is identical for everyone. This is standard practice; forcing every model into one shared frame budget would systematically penalize models trained at higher resolutions.In Post 2, we walk through how we built the ~400K-clip grounded training set and the SFT → GRPO → SimPO recipe that lifted Marlin-2B from 47.9 to 51.8 on the TimeLens-Bench combined average. Four bugs nearly killed each stage; we go through each of them.
In Post 3, we release videoeval-v2: the audit judge code, the judge prompts, and Argus, a new benchmark with timestamped GT, audited against video at generation time. Apache 2.0, full open release.
If you’re shipping a video VLM, your AutoDQ / CaReBench numbers are probably grading your model against a 70%-broken reference. We’d be happy to help you check. Open an issue on the Marlin-2B HF repo with a sample of your judge outputs and we’ll look at them with you.
Written by Aryan Jain.
Acknowledgements: the TimeLens team at ARC Lab / Tencent PCG for the refined VTG benchmark we evaluate on; the ms-swift team for the training framework; the Qwen team for the base model; and the maintainers of the open-source video datasets we built on top of (Charades, ActivityNet, QVHighlights, CaReBench, DREAM-1K).
Per-split, per-model R@1@{0.3, 0.5, 0.7} and mIoU. Marlin-2B rows are our own evaluation on the refined TimeLens-Bench annotations (n counts match the paper exactly: 3363 / 4500 / 1541). All other rows are from Zhang et al, 2025, Table 1.
| Model | R@1@0.3 | R@1@0.5 | R@1@0.7 | mIoU |
|---|---|---|---|---|
| TimeLens-8B | 76.6 | 63.0 | 35.2 | 55.2 |
| Gemini-2.5-Pro | 74.1 | 61.1 | 34.0 | 52.8 |
| TimeLens-7B | 70.5 | 55.6 | 28.4 | 48.8 |
| Marlin-2B (shipped) | 69.10 | 54.09 | 29.59 | 48.82 |
| Gemini-2.5-Flash | 68.7 | 56.1 | 30.6 | 48.6 |
| Qwen3-VL-8B | 69.2 | 53.4 | 27.5 | 48.3 |
| Qwen3-VL-235B-A22B | 71.7 | 50.8 | 24.5 | 47.8 |
| Gemini-2.0-Flash | 66.4 | 53.5 | 27.1 | 46.7 |
| Marlin-2B SFT | 65.24 | 50.25 | 27.00 | 45.73 |
| GPT-4o | 60.6 | 44.5 | 23.5 | 41.8 |
| GPT-5 | 59.3 | 42.0 | 22.0 | 40.5 |
| MiMo-VL-7B | 57.9 | 42.6 | 20.5 | 39.6 |
| Qwen2.5-VL-7B | 59.7 | 37.8 | 16.6 | 39.3 |
| Model | R@1@0.3 | R@1@0.5 | R@1@0.7 | mIoU |
|---|---|---|---|---|
| Gemini-2.5-Pro | 72.3 | 64.2 | 47.1 | 58.1 |
| TimeLens-8B | 68.9 | 58.4 | 40.6 | 53.2 |
| Gemini-2.5-Flash | 66.8 | 57.5 | 41.3 | 52.5 |
| Qwen3-VL-235B-A22B | 69.0 | 57.5 | 39.3 | 52.1 |
| Gemini-2.0-Flash | 62.9 | 54.0 | 37.7 | 49.3 |
| Qwen3-VL-8B | 62.1 | 51.2 | 34.4 | 46.8 |
| Marlin-2B (shipped) | 59.82 | 50.38 | 34.49 | 46.50 |
| TimeLens-7B | 62.8 | 51.0 | 32.6 | 46.2 |
| GPT-5 | 57.4 | 44.9 | 30.4 | 42.9 |
| Marlin-2B SFT | 54.76 | 45.27 | 29.84 | 42.10 |
| GPT-4o | 55.2 | 41.4 | 25.8 | 40.4 |
| MiMo-VL-7B | 49.3 | 38.7 | 22.4 | 35.5 |
| Qwen2.5-VL-7B | 44.1 | 31.0 | 16.1 | 31.4 |
| Model | R@1@0.3 | R@1@0.5 | R@1@0.7 | mIoU |
|---|---|---|---|---|
| Gemini-2.5-Pro | 84.1 | 75.9 | 61.1 | 70.4 |
| TimeLens-8B | 80.2 | 71.6 | 55.5 | 65.5 |
| Qwen3-VL-235B-A22B | 79.6 | 70.2 | 54.5 | 64.6 |
| Gemini-2.5-Flash | 78.2 | 69.4 | 55.0 | 64.3 |
| Gemini-2.0-Flash | 76.2 | 66.4 | 48.3 | 60.8 |
| Qwen3-VL-8B | 74.2 | 64.6 | 49.3 | 59.4 |
| GPT-5 | 72.4 | 60.4 | 46.4 | 56.8 |
| Marlin-2B (shipped) | 69.57 | 58.14 | 46.20 | 56.32 |
| TimeLens-7B | 74.1 | 62.7 | 43.1 | 56.0 |
| Marlin-2B SFT | 65.41 | 54.57 | 41.79 | 52.46 |
| GPT-4o | 69.0 | 54.8 | 38.5 | 52.1 |
| MiMo-VL-7B | 57.1 | 42.6 | 28.4 | 41.5 |
| Qwen2.5-VL-7B | 41.5 | 27.8 | 15.2 | 31.6 |
Bench audit (Section “The discovery”, GT-error figure)
The 70.2% and 65.8% headline numbers come from running an independent video-aware judge (Gemini-3-Flash, frames + GT caption + axis-scoring prompt) over the GT side of each benchmark and counting how many GT entries received a non-empty errors_gt list. Specifics:
TimeLens-Bench numbers (Section 4 leaderboard + Appendix A)
Architecture (Section 3 layer diagram)
layer_types array: directly from the public Qwen/Qwen3.5-2B-Base config.json.What’s public today vs coming with Post 3
frame_evidence mechanic), the 1000-clip GT-error annotations on CaReBench and DREAM-1K, and the new Argus benchmark. Once those are out, every number in this post will be reproducible end-to-end.Until then, if a number doesn’t seem to add up or you’d like the raw judge output for a specific clip, open an issue on the Marlin-2B HF repo and we’ll share that slice directly.
| Term | Definition |
|---|---|
| mIoU | Mean Intersection-over-Union — the standard temporal-grounding metric. For each predicted (start, end) span, compute IoU vs the GT span; average across the test set. |
| R@1@τ | Recall@1 at IoU threshold τ — fraction of test instances where the top-1 predicted span achieves IoU ≥ τ. |
| AutoDQ | Automatic Dense-caption Quality — DREAM-1K’s judge protocol. LLM extracts atomic events from GT, then checks each against the model’s caption. Yields precision / recall / F1. |
| MLLM / VLM | Multimodal Large Language Model / Vision-Language Model. LLMs that ingest images or video alongside text. |
| VTG | Video Temporal Grounding — given a natural-language query, locate the time span in a video where it happens. |
| GT | Ground Truth — the reference annotation, treated as correct for evaluation. (Whether it actually is correct is the subject of much of this post.) |
| GDN / linear attention | Gated DeltaNet — a recurrent-style attention variant with O(n) memory, used in Qwen3.5’s hybrid layer stack. Faster than full attention at long contexts. |
| DeepStack | The mechanism for injecting visual tokens via residual into the first 3 LLM layers, instead of concatenating them to the text token stream. |
| GRPO | Group Relative Policy Optimization — DeepSeek’s RL variant; samples N rollouts per prompt, computes per-rollout reward, optimizes relative advantages. |
| SimPO | Simple Preference Optimization — a DPO variant we used for the post-GRPO preference-learning stage. (Detail in Post 2.) |