TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

1Nanjing University    2ARC Lab, Tencent PCG    3Shanghai AI Lab
arXiv 2025
Paper Code 🤗 Model & Data TimeLens-Bench Leaderboard

Overview

We rethink the key factors for building multimodal large language models (MLLMs) with strong video temporal grounding (VTG) capabilities, along two primary dimensions:

§1 Data Quality

  • §1.1Through rigorous diagnosis, we expose critical quality issues in existing VTG datasets.
  • §1.2We curate TimeLens-Bench, a reliable evaluation suite, via manual refinement on three popular VTG benchmarks. Its necessity is validated by the dramatic model re-rankings observed compared to legacy benchmarks.
  • §1.3We construct TimeLens-100K, a large-scale, high-quality VTG training dataset, via an automated re-annotation pipeline.

§2 Algorithmic Design

  • §2.1For time representation, we find that an interleaved textual prefix is the most effective strategy while maintaining simplicity.
  • §2.2For training paradigm, we demonstrate that a thinking-free reinforcement learning with verifiable rewards (RLVR) approach excels in performance and efficiency.
  • §2.3For RLVR training recipes, we identify two critical factors for success: early stopping when reward metrics plateau, and difficulty-based data sampling.
Overview

As illustrated below, each of our contributions yields a significant performance gain. These efforts culminate in TimeLens models, achieving state-of-the-art VTG performance among open-source models, even surpassing proprietary models like GPT-5 and Gemini-2.5-Flash. (Details in our Leaderboard)

Overview

Rethinking Data Quality

Diagnosing Existing Datasets

We begin by establishing strict quality criteria for VTG annotation, ensuring query clarity, event existence, etc. Based on this, we introduce a rigorous Diagnose-then-Refine pipeline to manually audit and correct existing datasets.

Applying this pipeline to three widely-used VTG benchmarks (Charades-STA, ActivityNet Captions and QVHighlights), we expose an alarmingly high proportion of errors.

While the error distribution varies by category, all datasets exhibit consistently high overall error rates.

Dataset diagnosis examples
Dataset diagnosis statistics

TimeLens-Bench: Reliable Evaluation Suite

Through meticulous refinement and correction of the three aforementioned benchmarks, we present TimeLens-Bench, a comprehensive evaluation suite featuring both domain diversity and high-quality annotations.

Benchmarking frontier models on both the original and refined benchmarks reveals drastically contrasting performance trends.

On original benchmarks, proprietary models receive poor scores, while performance of open-source model are deceptively inflated. Conversely, TimeLens-Bench exposes the true performance gap: proprietary models demonstrate better capabilities, while open-source models suffer substantial performance degradation.

TimeLens-Bench

High-Quality Training Data

To address the even more severe noise in large-scale training corpora, we developed an automated re-annotation pipeline powered by Gemini-2.5-Pro. This process yielded TimeLens-100K, a large-scale, high-quality VTG training dataset. Compared to original noisy data, TimeLens-100K significantly improves model performance, validating its enhanced quality.

TimeLens-100K

Exploring Algorithmic Designs

Timestamp Encoding

To enable MLLMs to perform temporal grounding, a critical design decision is timestamp encoding (i.e., aligning the timestamp of each frame with its corresponding features). We conduct a comprehensive comparison of different timestamp encoding methods:

  • Position-embedding based methods adapt position embeddings in LLMs to represent the temporal position of each frame.
  • Visual overlay methods directly overlay timestamps or frame index onto each frame.
  • Textual encoding methods convert timestamps into text tokens using the MLLM's text tokenizer. There are two main variants: the interleaved approach inserts timestamp tokens before the visual tokens of each frame, while the non-interleaved approach adds an instruction specifying the timestamps of all frames into the prompt.
Timestamp encoding illustration

Our experiments reveal that interleaved textual prefix with raw timestamps achieves the best performance among all approaches, while remaining simple and intuitive.

Timestamp encoding exp

Optimization Paradigms

Supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) are the two primary training paradigms for improving MLLMs' VTG capability. However, some key questions remain unanswered:

  • Under equal training budgets (rather than equal amounts of training data), is RLVR superior to SFT?
  • For VTG, which appears to be a predominantly perception-oriented rather than reasoning-oriented task, is the explicit thinking process in RLVR necessary?
  • Does a preceding SFT phase facilitate subsequent RLVR training by raising the performance ceiling of the final model?

Our results reveal that a pure thinking-free RLVR approach maintains simplicity, superior performance, and high efficiency.
Adding a preceding SFT phase before RLVR yields no significant performance gain.

RLVR optimization paradigm

Effective RL Recipes

Building on the finding that thinking-free RLVR is the optimal training paradigm, we use the TimeLens-100K training corpus to further explore effective recipes for RLVR training, focusing on two key questions:

(i) How long should we train?

  • We tracked reward metrics and evaluated model checkpoints at different training steps.
  • When temporal IoU reward and within-group reward standard deviation plateau, model performance peaks.
  • Continued training beyond this point causes performance degradation.
  • Best practice: Perform early stopping when reward metrics plateau to save computation and prevent performance degradation.
early_stopping

(ii) How to effectively sample training data?

  • We estimate the difficulty of each training sample by running offline inference with the model to be trained and compute IoU metrics.
  • We sample data from Gaussian distributions with varying means μ to obtain training sets of distinct difficulty levels.
  • Model performance improves with higher average sample difficulty, plateauing at difficulty > 0.75.
  • Key insight: Selecting training samples with sufficiently high difficulty is crucial for RLVR performance.
data_sampling

TimeLens-Bench Leaderboard

Instructions for viewing the leaderboard:

  • Separated View vs. Combined View: The Separated View displays Open-Source and Proprietary models in separate tables. The Combined View shows all models together.
  • Sort by Metrics: You can click on any metric column header to sort the models by that specific metric.

To submit your results to the leaderboard, please contact junzhang00@foxmail.com.

Proprietary Models

# Model Charades-TimeLens ActivityNet-TimeLens QVHighlights-TimeLens Avg.
R1@
0.3
R1@
0.5
R1@
0.7
mIoU R1@
0.3
R1@
0.5
R1@
0.7
mIoU R1@
0.3
R1@
0.5
R1@
0.7
mIoU

Open-Source Models

# Model Charades-TimeLens ActivityNet-TimeLens QVHighlights-TimeLens Avg.
R1@
0.3
R1@
0.5
R1@
0.7
mIoU R1@
0.3
R1@
0.5
R1@
0.7
mIoU R1@
0.3
R1@
0.5
R1@
0.7
mIoU

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}