Paper | Code | R@1,IoU=0.1 | R@1,IoU=0.3 | R@1,IoU=0.5 | R@10,IoU=0.1 | R@10,IoU=0.3 | R@10,IoU=0.5 | R@100,IoU=0.1 | R@100,IoU=0.3 | R@100,IoU=0.5 | R@5,IoU=0.1 | R@5,IoU=0.5 | R@50,IoU=0.1 | R@50,IoU=0.3 | R@50,IoU=0.5 | R@5,IoU=0.3 | ModelName | ReleaseDate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos | ✓ Link | 17.3 | 12.7 | 6.7 | ReVisionLLM | 2024-11-22 | ||||||||||||
DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos | ✓ Link | 13.25 | 10.96 | 7.06 | 27.73 | 16.13 | 23.68 | DeCafNet | 2025-05-22 | |||||||||
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos | ✓ Link | 12.43 | 9.48 | 5.61 | 25.12 | 10.86 | 18.72 | RGNet | 2023-12-11 | |||||||||
Localizing Moments in Long Video Via Multimodal Guidance | ✓ Link | 9.3 | 4.65 | 2.16 | 24.30 | 17.73 | 11.09 | 47.35 | 39.58 | 29.68 | 18.96 | 7.4 | 39.79 | 32.23 | 23.21 | 13.06 | Zero-Shot CLIP + Guidance Model | 2023-02-26 |
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions | ✓ Link | 6.57 | 3.13 | 1.39 | 20.26 | 14.13 | 8.38 | 47.73 | 36.98 | 24.99 | 15.05 | 5.44 | 37.92 | 28.71 | 18.80 | 9.85 | CLIP | 2021-12-01 |
Localizing Moments in Long Video Via Multimodal Guidance | ✓ Link | 5.60 | 4.28 | 2.48 | 23.64 | 19.86 | 13.72 | 55.59 | 49.38 | 39.12 | 16.07 | 8.78 | 45.35 | 39.77 | 30.22 | VLG-Net + Guidance Model | 2023-02-26 | |
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions | ✓ Link | 3.50 | 2.63 | 1.61 | 18.32 | 15.2 | 10.18 | 49.65 | 43.95 | 34.18 | 11.74 | 6.23 | 38.41 | 33.68 | 25.33 | 9.49 | VLG-Net | 2021-12-01 |
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions | ✓ Link | 0.09 | 0.04 | 0.01 | 0.88 | 0.39 | 0.14 | 8.47 | 3.80 | 1.40 | 0.44 | 0.07 | 4.33 | 1.92 | 0.71 | 0.19 | Random Chance | 2021-12-01 |