OpenCodePapers
zero-shot-video-retrieval-on-lsmdc
Zero-Shot Video Retrieval
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
Show papers without code
Paper
Code
text-to-video R@1
↕
text-to-video R@5
↕
text-to-video R@10
↕
text-to-video Median Rank
↕
text-to-video Mean Rank
↕
video-to-text R@1
↕
video-to-text R@5
↕
video-to-text R@10
↕
ModelName
ReleaseDate
↕
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
✓ Link
33.8
55.9
62.2
30.1
47.7
54.8
InternVideo2-6B
2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
✓ Link
32.0
52.4
59.4
27.3
44.2
51.6
InternVideo2-1B
2024-03-22
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
✓ Link
27.7
46.5
54.6
7
VAST, HowToCaption-finetuned
2023-10-07
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
✓ Link
25.2
43.0
50.5
23.2
37.7
44.2
UMT-L (ViT-L/16)
2023-03-28
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
✓ Link
24.1
43.8
52.0
mPLUG-2
2023-02-01
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
✓ Link
19.5
35.9
45.0
BT-Adapter
2023-09-27
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
18.3
36.7
44.2
HiTeA-17M
2022-12-30
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
✓ Link
17.6
32.4
40.2
13.2
27.8
34.9
InternVideo
2022-12-06
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
✓ Link
17.3
31.7
38.6
29
HowToCaption
2023-10-07
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
✓ Link
17.2
32.4
39.1
Yatai Ji et. al.
2022-11-24
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
15.5
31.1
39.8
HiTeA-5M
2022-12-30
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
✓ Link
15.1
28.5
36.4
28
117
CLIP4Clip
2021-04-18
Clover: Towards A Unified Video-Language Alignment and Fusion Model
✓ Link
14.7
29.2
38.2
24
Clover
2022-07-16
Bridging Video-text Retrieval with Multiple Choice Questions
✓ Link
12.2
25.9
32.2
42.0
Y. Ge et. al.
2022-01-13
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
✓ Link
11.1
24.7
30.6
50.7
MILES
2022-04-26
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
✓ Link
4.2
11.6
17.1
SSML
2020-03-06