video-retrieval-on-msr-vtt-1ka

Video Retrieval

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	text-to-video R@1	text-to-video R@5	text-to-video R@10	text-to-video Median Rank	text-to-video Mean Rank	video-to-text R@1	video-to-text R@5	video-to-text R@10	video-to-text Median Rank	video-to-text Mean Rank	ModelName	ReleaseDate
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations		62.9	84.5	90.8	1.0	9.3	64.8	84.9	91.1	1.0	5.5	HunYuan_tvr (huge)	2022-04-07
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment	✓ Link	57.7	80.5	88.2	1.0							CLIP-ViP	2022-09-14
PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval		55.9	79.8	87.6	1.0	10.7	54.5	78,3	87.3	1.0	7.5	PIDRo	2023-01-01
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning	✓ Link	55.5	79.4	87.1	1.0	10.0	55.7	79.2	87.2	1.0	7.3	DMAE (ViT-B/16)	2023-09-20
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations		55.0					55.5	78.4	85.8	1.0	7.7	HunYuan_tvr	2022-04-07
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling		54.7	77.7	86.0								MuLTI	2023-03-10
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring	✓ Link	54.1	79.5	87.8	1							STAN	2023-01-26
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning	✓ Link	54.1	78.8	86.9								EERCF	2024-01-01
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval	✓ Link	54.0	79.3	87.4								TS2-Net	2022-07-16
RTQ: Rethinking Video-language Understanding Based on Image-text Model	✓ Link	53.4	76.1	84.4								RTQ	2023-12-01
Disentangled Representation Learning for Text-Video Retrieval	✓ Link	53.3	80.3	87.6	1	11.4	56.2	79.9	87.4	1.0	7.6	DRL	2022-03-14
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	✓ Link	53.1	77.6	84.7								mPLUG-2	2023-02-01
CLIP2TV: Align, Match and Distill for Video-Text Retrieval		52.9	78.5	86.5	1	12.8	54.1	77.4	85.7	1	9.0	CLIP2TV	2021-11-10
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning	✓ Link	52.3	75.5	84.2	1.0	12.8						Side4Video	2023-11-27
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations	✓ Link	51.6	78.1	85.3		1	51.8	80.2	88		1	EMCL-Net++	2022-11-21
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?	✓ Link	51.4	75.7	83.9	1	12.4	49.0	75.2	85.0	2	8.0	Cap4Video	2022-12-31
Video-Text Retrieval by Supervised Sparse Multi-Grained Learning	✓ Link	49.8	75.1	83.9			47.3	76	84.3			SuMA (ViT-B/16)	2023-02-19
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	✓ Link	49.6	76.7	84.2								X2-VLM (large)	2022-11-22
Unified Coarse-to-Fine Alignment for Video-Text Retrieval	✓ Link	49.4	72.1	83.5			47.1	74.3	83.0			UCoFiA	2023-09-18
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval	✓ Link	49.3	75.8	84.8	2.0	12.2	48.9	76.8	84.5	2.0	8.1	X-CLIP	2022-07-15
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model	✓ Link	49.0	75.2	82.7	2.0	12.1	47.7	73.8	84.5	2.0	8.8	DiffusionRet	2023-03-17
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model	✓ Link	48.9	75.2	83.1	2.0	12.1	49.3	74.3	83.8	2.0	8.5	DiffusionRet+QB-Norm	2023-03-17
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss	✓ Link	48.8	75.6	85.3	2	12.4	50.3	74.6	83.8	2	9.9	CAMoE	2021-09-09
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning	✓ Link	48.6	74.6	83.4	2.0	12.0	46.8	74.3	84.3	2.0	8.9	HBI	2023-03-25
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval	✓ Link	48.5	72.7	82.5	2.0	14.0	48.3	73.0	83.2	2.0	9.7	PAU	2023-09-29
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval	✓ Link	48.4	73.8	82.0	2	13.8	47.7	75.0	83.3	2	10.2	CenterCLIP (ViT-B/16)	2022-05-02
Holistic Features are almost Sufficient for Text-to-Video Retrieval	✓ Link	48.0	75.9	83.5								TeachCLIP (ViT-B/16)	2024-01-01
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	✓ Link	47.6	74.1	84.2								X2-VLM (base)	2022-11-22
Cross Modal Retrieval with Querybank Normalisation	✓ Link	47.2	73.0	83.0	2							QB-Norm+CLIP2Video	2021-12-23
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval	✓ Link	46.9	72.8	82.2	2	14.3	44.4	73.3	84.0	2.0	9.0	X-Pool	2022-03-28
Holistic Features are almost Sufficient for Text-to-Video Retrieval	✓ Link	46.8	74.3	82.6								TeachCLIP	2024-01-01
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations	✓ Link	46.8	73.1	83.1		2	46.5	73.5	83.5		2	EMCL-Net	2022-11-21
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training		46.8	71.2	81.9								HiTeA	2022-12-30
VindLU: A Recipe for Effective Video-and-Language Pretraining	✓ Link	46.5	71.5	80.4								VindLU	2022-12-09
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval	✓ Link	45.8	71.5	82								LAFF	2021-12-03
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP	✓ Link	45.6	72.6	81.7	2	14.6	43.3	72.3	82.1	2	10.2	CLIP2Video	2021-06-21
Revealing Single Frame Bias for Video-and-Language Learning	✓ Link	41.5	68.7	77								Singularity	2022-06-07
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models	✓ Link	41.3	73.5	82.5								All-in-one + MELTR	2023-03-23
Clover: Towards A Unified Video-Language Alignment and Fusion Model	✓ Link	40.5	69.8	79.4	2							Clover	2022-07-16
MDMMT: Multidomain Multimodal Transformer for Video Retrieval	✓ Link	38.9	69.0	79.7	2	16.5						MDMMT	2021-03-19
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval		38.9	63.1	73.9	3							MAC	2022-12-02
All in One: Exploring Unified Video-Language Pre-training	✓ Link	37.9	68.1	77.1								All-in-one-B	2022-03-14
Bridging Video-text Retrieval with Multiple Choice Questions	✓ Link	37.6	64.8	75.1	3							BridgeFormer	2022-01-13
Florence: A New Foundation Model for Computer Vision	✓ Link	37.6	63.8	72.6								Florence	2021-11-22
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval		36.8	63.8	73.2	2							COTS	2022-04-15
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models	✓ Link	35.5	67.2	78.4	3							VIOLET + MELTR	2023-03-23
A Straightforward Framework For Video Retrieval Using CLIP	✓ Link	31.2	53.7	64.2	4		27.2	51.7	62.6	5		CLIP	2021-02-24
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models	✓ Link	31.1	55.7	68.3	4							UniVL + MELTR	2023-03-23
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	✓ Link	31.0	59.5	70.5	3							FROZEN	2021-04-01
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding	✓ Link	30.9	55.4	66.8								VideoCLIP	2021-09-28
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment		28.4	57.8	71.2	4							TACo	2021-08-23
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding	✓ Link	28.10	55.50	67.40	4							VLM	2021-05-20
Multi-modal Transformer for Video Retrieval	✓ Link	26.6	57.1	69.6	4	24.0						MMT-Pretrained	2020-07-21
Bridging Video-text Retrieval with Multiple Choice Questions	✓ Link	26	46.4	56.4	7							BridgeFormer (Zero-shot)	2022-01-13
Multi-modal Transformer for Video Retrieval	✓ Link	24.6	54.0	67.1	4	26.7						MMT	2020-07-21
Use What You Have: Video Retrieval Using Representations From Collaborative Experts	✓ Link	20.9	48.8	62.4	6	28.2						Collaborative Experts	2019-07-31
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips	✓ Link	14.9	40.2	52.8	9							HT-Pretrained	2019-06-07
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips	✓ Link	12.1	35.0	48.0	12							HT	2019-06-07
A Joint Sequence Fusion Model for Video Question Answering and Retrieval	✓ Link	10.2	31.2	43.2	13							JSFusion	2018-08-07
OmniVec: Learning robust representations with cross modal sharing				89.4								OmniVec	2023-11-07
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval	✓ Link			81.6	2	15.3	42.7	70.9	80.6	2		CLIP4Clip	2021-04-18
OmniVec: Learning robust representations with cross modal sharing				78.6								OmniVec (pretrained)	2023-11-07
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language	✓ Link						42.8					Socratic Models	2022-04-01

OpenCodePapers

video-retrieval-on-msr-vtt-1ka