zero-shot-video-retrieval-on-msr-vtt

Zero-Shot Video Retrieval

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	text-to-video R@1	text-to-video R@5	text-to-video R@10	text-to-video Median Rank	text-to-video Mean Rank	video-to-text R@1	video-to-text R@5	video-to-text R@10	video-to-text Median Rank	ModelName	ReleaseDate
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	55.9	78.3	85.1			53.7	77.5	84.1		InternVideo2-6B	2024-03-22
Gramian Multimodal Representation Learning and Alignment	✓ Link	54.8		83.9			52.9		82.9		GRAM	2024-12-16
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	51.9	75.3	82.5			50.9	73.4	81.8		InternVideo2-1B	2024-03-22
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	✓ Link	50	73.2	81.4	1						VAST, HowToCaption-finetuned	2023-10-07
Make Your Training Flexible: Towards Deployment-Efficient Video Models	✓ Link	49.9	71.0	79.6			49.4	73.9	82.4		FluxViT-B	2025-03-18
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	✓ Link	49.3	68.3	73.9							VAST	2023-05-29
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	✓ Link	47.1	69.7	79.0							mPLUG-2	2023-02-01
Make Your Training Flexible: Towards Deployment-Efficient Video Models	✓ Link	45.0	67.5	75.8			44.9	68.2	76.5		FluxViT-S	2025-03-18
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	✓ Link	44.8	70.0	78.7	2		40.9	66.4	75.7	2.	LanguageBind(ViT-H/14)	2023-10-03
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	✓ Link	42.8	67.5	76.0	2.0		38.3	65.8	77.8	3.0	LanguageBind(ViT-L/14)	2023-10-03
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	✓ Link	42.6	64.4	73.1			38.6	59.8	69.6		UMT-L (ViT-L/16)	2023-03-28
vid-TLDR: Training Free Token merging for Light-weight Video Transformer	✓ Link	42.1	63.9	72.4			37.7	59.8	69.4		vid-TLDR (UMT-L)	2024-03-20
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning	✓ Link	40.9	64.7	73.5							BT-Adapter	2023-09-27
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	40.7					39.6				InternVideo	2022-12-06
Florence: A New Foundation Model for Computer Vision	✓ Link	37.6	63.8	72.6							Florence	2021-11-22
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	✓ Link	37.6	62	73.3	3						HowToCaption	2023-10-07
ImageBind: One Embedding Space To Bind Them All	✓ Link	36.8	61.8	70.0							ImageBind	2023-05-09
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks		34.6	58.4	66.6							OmniVL	2022-09-15
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training		34.4	60.0	69.9							HiTeA-17M	2022-12-30
Revealing Single Frame Bias for Video-and-Language Learning	✓ Link	34.0	56.7	66.7							Singularity-17M	2022-06-07
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval	✓ Link	32.0	57.0	66.9	4	34.0					CLIP4Clip	2021-04-18
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning	✓ Link	30.9	54.4	65.0							Yatai Ji et. al.	2022-11-24
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training		29.9	54.2	62.9							HiTeA-5M	2022-12-30
Revealing Single Frame Bias for Video-and-Language Learning	✓ Link	28.4	50.2	59.5							Singularity-5M	2022-06-07
Clover: Towards A Unified Video-Language Alignment and Fusion Model	✓ Link	26.4	49.5	60	6						Clover	2022-07-16
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval	✓ Link	26.1	47.2	56.9	7						MILES	2022-04-26
Bridging Video-text Retrieval with Multiple Choice Questions	✓ Link	26.0	46.4	56.4	7.0						Y. Ge et. al.	2022-01-13
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling	✓ Link	25.9	49.5	59.7							VIOLET	2021-11-24
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	✓ Link	24.7	46.9	57.2	7.0						FROZEN	2021-04-01
Align and Prompt: Video-and-Language Pre-training with Entity Prompts	✓ Link	24.1	44.7	55.4	8						ALPRO	2021-12-17
Object-aware Video-language Pre-training for Retrieval	✓ Link	23.4	47.5	55.6	8.0						OA-Trans	2021-12-01
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval		23.4	44.1	53.3	8		17.2	36.2	47.9	12	LaT	2022-07-11
Learning Audio-Video Modalities from Image Captions		19.4	39.5	50.3							A. Nagrani et. al.	2022-04-01
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions	✓ Link	14.6	34.4	44.1	15						HD-VILA	2021-11-19
Multi-granularity Correspondence Learning from Long-term Noisy Videos	✓ Link	10.7	24.1								Norton	2024-01-30
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding	✓ Link	10.4	22.2	30.0							VideoCLIP	2021-09-28
End-to-End Learning of Visual Representations from Uncurated Instructional Videos	✓ Link	9.9	24.0	32.4		29.5					MIL-NCE	2019-12-13
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment		9.8	25.0	33.4							TACo	2021-08-23
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning	✓ Link	8.0	21.3	29.3							SSML	2020-03-06
Multi-modal Transformer for Video Retrieval	✓ Link		14.4		66	148.1					MMT	2020-07-21
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text	✓ Link			29.7	49						VATT-MBS	2021-04-22

OpenCodePapers

zero-shot-video-retrieval-on-msr-vtt