Saliency-Guided DETR for Moment Retrieval and Highlight Detection | ✓ Link | 71.10 | 52.80 | | | | | SG-DETR (w/ PT) | 2024-10-02 |
LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval | ✓ Link | 70.65 | 49.58 | | | | | LLaVA-MR | 2024-11-21 |
FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding | ✓ Link | 70.32 | 49.87 | | | | | FlashVTG | 2024-12-18 |
Saliency-Guided DETR for Moment Retrieval and Highlight Detection | ✓ Link | 70.20 | 49.50 | | | | | SG-DETR | 2024-10-02 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 70.03 | 48.95 | | | | | InternVideo2-6B | 2024-03-22 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 68.36 | 45.03 | | | | | InternVideo2-1B | 2024-03-22 |
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning | ✓ Link | 67.1 | 43.0 | | | | | VideoChat-T (FT) | 2024-10-25 |
UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection | ✓ Link | 63.98 | 44.46 | 91.94 | 67.72 | | | UniMD+Sync. | 2024-04-07 |
LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection | ✓ Link | 62.58 | 41.56 | | | 73.92 | 53.44 | LD-DETR | 2025-01-18 |
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval | ✓ Link | 61.96 | 41.05 | | | 73.33 | 52.94 | VideoLights-B-pt | 2024-12-02 |
UnLoc: A Unified Framework for Video Localization Tasks | ✓ Link | 60.8 | 38.4 | 88.2 | 61.1 | | | UnLoc-L | 2023-08-21 |
BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos | ✓ Link | 59.95 | 39.38 | | | | | BAM-DETR | 2023-11-30 |
Background-aware Moment Detection for Video Moment Retrieval | ✓ Link | 59.48 | 38.33 | | | | | BM-DETR | 2023-06-05 |
Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection | ✓ Link | 59.25 | 36.64 | | | | | UVCOM | 2023-11-28 |
Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding | ✓ Link | 58.44 | 36.34 | | | | | CG-DETR | 2023-11-15 |
Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval | ✓ Link | 58.31 | 36.49 | | | | | LLMEPET | 2024-07-21 |
UnLoc: A Unified Framework for Video Localization Tasks | ✓ Link | 58.1 | 35.4 | 87.4 | 59.1 | | | UnLoc-B | 2023-08-21 |
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection | ✓ Link | 57.31 | 32.55 | | | | | QD-DETR (Only Video) | 2023-03-24 |
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding | ✓ Link | 57.18 | 36.05 | | | | | video-mamba-suite | 2024-03-14 |
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries | ✓ Link | 55.65 | 34.17 | | | | | Moment-DETR w/ PT (on 10K HowTo100M videos) | 2021-07-20 |
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries | ✓ Link | 53.63 | 31.37 | | | | | Moment-DETR | 2021-07-20 |
UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection | ✓ Link | 49.35 | 26.16 | 89.41 | 54.95 | | | UMT (VO) | 2022-03-23 |
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning | ✓ Link | 48.7 | 24.0 | | | | 45.43 | VideoChat-T (ZS) | 2024-10-25 |
UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection | ✓ Link | 48.31 | 29.25 | 88.79 | 56.08 | | | UMT (VA) | 2022-03-23 |
SimVTP: Simple Video Text Pre-training with Masked Autoencoders | | 44.7 | 26.3 | 83.7 | 55.1 | | | SimVTP | 2022-12-07 |