OpenCodePapers

zero-shot-video-retrieval-on-lsmdc

Zero-Shot Video Retrieval

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	text-to-video R@1	text-to-video R@5	text-to-video R@10	text-to-video Median Rank	text-to-video Mean Rank	video-to-text R@1	video-to-text R@5	video-to-text R@10	ModelName	ReleaseDate
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	33.8	55.9	62.2			30.1	47.7	54.8	InternVideo2-6B	2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	32.0	52.4	59.4			27.3	44.2	51.6	InternVideo2-1B	2024-03-22
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	✓ Link	27.7	46.5	54.6	7					VAST, HowToCaption-finetuned	2023-10-07
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	✓ Link	25.2	43.0	50.5			23.2	37.7	44.2	UMT-L (ViT-L/16)	2023-03-28
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	✓ Link	24.1	43.8	52.0						mPLUG-2	2023-02-01
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning	✓ Link	19.5	35.9	45.0						BT-Adapter	2023-09-27
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training		18.3	36.7	44.2						HiTeA-17M	2022-12-30
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	17.6	32.4	40.2			13.2	27.8	34.9	InternVideo	2022-12-06
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	✓ Link	17.3	31.7	38.6	29					HowToCaption	2023-10-07
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning	✓ Link	17.2	32.4	39.1						Yatai Ji et. al.	2022-11-24
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training		15.5	31.1	39.8						HiTeA-5M	2022-12-30
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval	✓ Link	15.1	28.5	36.4	28	117				CLIP4Clip	2021-04-18
Clover: Towards A Unified Video-Language Alignment and Fusion Model	✓ Link	14.7	29.2	38.2	24					Clover	2022-07-16
Bridging Video-text Retrieval with Multiple Choice Questions	✓ Link	12.2	25.9	32.2	42.0					Y. Ge et. al.	2022-01-13
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval	✓ Link	11.1	24.7	30.6	50.7					MILES	2022-04-26
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning	✓ Link	4.2	11.6	17.1						SSML	2020-03-06