OpenCodePapers

zero-shot-video-retrieval-on-msvd

Zero-Shot Video Retrieval

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	text-to-video R@1	text-to-video R@5	text-to-video R@10	text-to-video Median Rank	text-to-video Mean Rank	video-to-text R@1	video-to-text R@5	video-to-text R@10	video-to-text Median Rank	ModelName	ReleaseDate
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	59.3	84.4	89.6			83.1	94.2	97.0		InternVideo2-6B	2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	58.1	83.0	88.4			83.3	94.3	96.9		InternVideo2-1B	2024-03-22
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	✓ Link	54.8	80.9	87.2	1						VAST, HowToCaption-finetuned	2023-10-07
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	✓ Link	54.1	81.1	88.1	1.0		69.7	91.8	97.9	1.0	LanguageBind(ViT-L/14)	2023-10-03
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	✓ Link	53.9	80.4	87.8	1		72.0	91.4	96.3	1	LanguageBind(ViT-H/14)	2023-10-03
vid-TLDR: Training Free Token merging for Light-weight Video Transformer	✓ Link	50.0	77.6	85.5			75.7	90.0	95.1		vid-TLDR (UMT-L)	2024-03-20
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	✓ Link	49.0	76.9	84.7			74.5	89.7	92.8		UMT-L (ViT-L/16)	2023-03-28
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	✓ Link	44.5	73.3	82.1	2						HowToCaption	2023-10-07
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval	✓ Link	44.4	76.2	87.0	2.0						MILES	2022-04-26
Bridging Video-text Retrieval with Multiple Choice Questions	✓ Link	43.6	74.9	84.9	2.0						Y. Ge et. al.	2022-01-13
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	43.4					67.6				InternVideo	2022-12-06
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval	✓ Link	38.5	66.9	76.8	2	17.8					CLIP4Clip	2021-04-18
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval		36.9	68.6	81.0	2		34.4	69.0	79.2	3	LaT	2022-07-11
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning	✓ Link	13.66	35.7	47.74							SSML	2020-03-06