Saliency-Guided DETR for Moment Retrieval and Highlight Detection | ✓ Link | 58.80 | 74.20 | 60.40 | 76.20 | 60.80 | SG-DETR (w/ PT) | 2024-10-02 |
Saliency-Guided DETR for Moment Retrieval and Highlight Detection | ✓ Link | 54.10 | 72.20 | 56.60 | 73.20 | 55.80 | SG-DETR | 2024-10-02 |
LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval | ✓ Link | 52.73 | 76.59 | 61.48 | 69.41 | 54.40 | LLaVA-MR | 2024-11-21 |
FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding | ✓ Link | 52.00 | 70.69 | 53.96 | 72.33 | 53.85 | FlashVTG | 2024-12-18 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 49.24 | 71.42 | 56.45 | | | InternVideo2-6B | 2024-03-22 |
Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding | ✓ Link | 47.97 | 68.48 | 53.11 | 69.40 | 49.12 | CG-DETR (w/ PT) | 2023-11-15 |
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval | ✓ Link | 47.94 | 70.36 | 55.25 | 69.53 | 49.17 | VideoLights-B-pt | 2024-12-02 |
Length-Aware DETR for Robust Moment Retrieval | ✓ Link | 47.93 | 63.94 | 51.10 | 65.65 | 49.44 | LA-DETR | 2024-12-30 |
BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos | ✓ Link | 46.91 | 64.07 | 48.12 | 65.61 | 47.51 | BAM-DETR (w/ audio) | 2023-11-30 |
BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos | ✓ Link | 46.67 | 63.88 | 47.92 | 66.33 | 48.22 | BAM-DETR (w/ PT ASR Captions) | 2023-11-30 |
LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection | ✓ Link | 46.41 | 66.80 | 51.04 | 67.61 | 46.99 | LD-DETR | 2025-01-18 |
$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding | ✓ Link | 46.17 | 68.03 | 49.35 | 69.04 | 47.56 | R^2-Tuning | 2024-03-31 |
BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos | ✓ Link | 45.36 | 62.71 | 48.64 | 64.57 | 46.33 | BAM-DETR | 2023-11-30 |
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding | ✓ Link | 45.18 | 66.65 | 52.19 | 64.37 | 46.68 | video-mamba-suite | 2024-03-14 |
Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval | ✓ Link | 44.05 | 66.73 | 49.94 | 65.76 | 43.91 | LLMEPET | 2024-07-21 |
Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection | ✓ Link | 43.8 | 64.53 | 48.31 | 64.78 | 43.65 | UVCOM (w/ PT ASR Captions) | 2023-11-28 |
UniVTG: Towards Unified Video-Language Temporal Grounding | ✓ Link | 43.63 | 65.43 | 50.06 | 64.06 | 45.02 | UniVTG (w/ PT) | 2023-07-31 |
Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection | ✓ Link | 43.18 | 63.55 | 47.47 | 63.37 | 42.67 | UVCOM | 2023-11-28 |
Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding | ✓ Link | 42.86 | 65.43 | 48.38 | 64.51 | 42.77 | CG-DETR | 2023-11-15 |
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection | ✓ Link | 40.62 | 64.1 | 46.1 | 64.3 | 40.5 | QD-DETR (w/ PT) | 2023-03-24 |
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection | ✓ Link | 40.19 | 63.06 | 45.10 | 63.04 | 40.10 | QD-DETR (w/ audio) | 2023-03-24 |
Background-aware Moment Detection for Video Moment Retrieval | ✓ Link | 40.08 | 60.12 | 43.05 | 63.08 | 40.18 | BM-DETR | 2023-06-05 |
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection | ✓ Link | 40.0 | 63.2 | 45.2 | 63.4 | 40.4 | QD-DETR (only Video w/ PT ASR Captions) | 2023-03-24 |
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection | ✓ Link | 39.86 | 62.40 | 44.98 | 62.52 | 39.88 | QD-DETR (only Video) | 2023-03-24 |
UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection | ✓ Link | 38.08 | | | | | UMT (w/ audio + PT ASR Cpations) | 2022-03-23 |
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries | ✓ Link | 36.14 | 59.78 | 40.33 | 60.51 | 35.36 | Moment-DETR (w/ PT ASR Cpations) | 2021-07-20 |
UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection | ✓ Link | 36.12 | | | | | UMT | 2022-03-23 |
UniVTG: Towards Unified Video-Language Temporal Grounding | ✓ Link | 35.47 | 58.86 | 40.86 | 57.60 | 35.59 | UniVTG | 2023-07-31 |
[]() | | 32.3 | 54.5 | 36.5 | | | SeViLA-Localizer | |
UnLoc: A Unified Framework for Video Localization Tasks | ✓ Link | | 66.1 | 46.7 | | | UnLoc-L | 2023-08-21 |
UnLoc: A Unified Framework for Video Localization Tasks | ✓ Link | | 64.5 | 48.8 | | | UnLoc-B | 2023-08-21 |
Boundary-Denoising for Video Activity Localization | ✓ Link | | 59.27 | 45.07 | | | DenoiseLoc | 2023-04-06 |