video-retrieval-on-msvd

Video Retrieval

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	text-to-video R@1	text-to-video R@5	text-to-video R@10	text-to-video Median Rank	text-to-video Mean Rank	text-to-video R@50	video-to-text R@1	video-to-text R@5	video-to-text R@10	video-to-text Median Rank	video-to-text Mean Rank	ModelName	ReleaseDate
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	61.4						85.2					InternVideo2-6B	2024-03-22
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations		59.0	84.0	90.3	1.0	7.6		73.0	94.5	96.6	1.0	7.6	HunYuan_tvr (huge)	2022-04-07
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	58.4						76.3					InternVideo	2022-12-06
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations		58.2	83.5	90.1	1	7.8		69.1	91.5	95.0	1.0	3.8	HunYuan_tvr	2022-04-07
vid-TLDR: Training Free Token merging for Light-weight Video Transformer	✓ Link	57.9	83.8	89.4				82.7	94.5	96.3			vid-TLDR (UMT-L)	2024-03-20
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending		57.5	83.6	89.9									VLAB	2023-05-22
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization		56.8	83.1	89.2	1.0	8.8							MDMMT-2	2022-03-14
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning	✓ Link	56.1	81.7	88.8	1.0	8.4							Side4Video	2023-11-27
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss	✓ Link	51.8	87.6	87.6	1	8.9		69.3	90.6	94.6	1	3.1	CAMoE	2021-09-09
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?	✓ Link	51.8	80.8	88.3	1	8.3		70.0	93.2	96.2	1	2.4	Cap4Video	2022-12-31
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval	✓ Link	50.6	80.3	88.4	1	8.4		68.4	90.1	95.0	1	3.0	CenterCLIP (ViT-B/16)	2022-05-02
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval	✓ Link	50.4	80.6			8.4		66.8		90.4		4.2	X-CLIP	2022-07-15
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning	✓ Link	48.7	78.4	86.3	2.0	9.8							DMAE (ViT-B/32)	2023-09-20
Cross Modal Retrieval with Querybank Normalisation	✓ Link	48.0	77.9	86.2	2.0								QB-Norm+CLIP2Video	2021-12-23
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model	✓ Link	47.9	77.2	84.8		15.6		60.3	86.4	92	1.0	4.5	DiffusionRet+QB-Norm	2023-03-17
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval	✓ Link	47.3	77.4	85.5	2.0	9.6		68.9	93.1	97.1	1.0	2.4	PAU	2023-09-29
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval	✓ Link	47.2	77.4	86.0	2.0	9.3		66.4	90.0	94.2	1.0	3.3	X-Pool	2022-03-28
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model	✓ Link	46.6	75.9	84.1	2.0	15.7		61.9	88.3	92.9	1.0	4.5	DiffusionRet	2023-03-17
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval	✓ Link	46.2	76.1	84.6	2	10.0		62.0	87.3	92.6	1		CLIP4Clip	2021-04-18
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval	✓ Link	45.4	76.0	84.6									LAFF	2021-12-03
A Straightforward Framework For Video Retrieval Using CLIP	✓ Link	37	64.1	73.8	3.0			59.9	85.2	90.7	1		CLIP	2021-02-24
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	✓ Link	33.7	64.7	76.3	3								FROZEN	2021-04-01
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning	✓ Link	20.3	49.0	63.3	6.0	--	--						SSML	2020-03-06
Use What You Have: Video Retrieval Using Representations From Collaborative Experts	✓ Link	19.8	49.0	63.8	6.0	23.1	89.0						Collaborative Experts	2019-07-31

OpenCodePapers

video-retrieval-on-msvd