video-retrieval-on-msr-vtt

Video Retrieval

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	text-to-video R@1	text-to-video R@5	text-to-video R@10	text-to-video Mean Rank	text-to-video Median Rank	video-to-text R@1	video-to-text R@5	video-to-text R@10	video-to-text Median Rank	video-to-text Mean Rank	text-to-video MedianR	text-to-videoMedian Rank	ModelName	ReleaseDate
Gramian Multimodal Representation Learning and Alignment	✓ Link	64		89.3			64.8		91.5					GRAM	2024-12-16
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	✓ Link	63.9	84.3	89.6										VAST	2023-05-29
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	62.8					60.2							InternVideo2-6B	2024-03-22
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	✓ Link	59.9	83.5	89.6										VALOR	2023-04-17
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	✓ Link	58.8	81.0	87.1			58.6	81.6	86.5					UMT-L (ViT-L/16)	2023-03-28
vid-TLDR: Training Free Token merging for Light-weight Video Transformer	✓ Link	58.1	81.0	81.6			58.7	81.6	86.9					vid-TLDR (UMT-L)	2024-03-20
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	✓ Link	57.9												COSA	2023-06-15
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	55.2					57.9							InternVideo	2022-12-06
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending		55.1	78.8	87.6										VLAB	2023-05-22
[]()		52.4	73.9	82									1	Aurora (ours, r=64)
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment		52	76.6	86.1										TEFAL	2023-07-24
Unified Coarse-to-Fine Alignment for Video-Text Retrieval	✓ Link	49.4	72.1	83.5										UCoFiA	2023-09-18
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks		47.8	74.2	83.8										OmniVL	2022-09-15
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval	✓ Link	44.5	71.4	81.6										CLIP4Clip-seqTransf	2021-04-18
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models	✓ Link	38.6	74.4	84.7										All-in-one + MELTR	2023-03-23
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling	✓ Link	37.2	64.8	75.8										VIOLETv2	2022-09-04
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions	✓ Link	35.6	65.3	78								3		HD-VILA	2021-11-19
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners		34.3	57.8	67.0			64.7	85.2	91.4					VideoCoCa (zero-shot)	2022-12-09
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization		33.7	60.5	70.8	37.8	3.0								MDMMT-2	2022-03-14
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models	✓ Link	33.6	63.7	77.8		3								VIOLET + MELTR	2023-03-23
CLIP2TV: Align, Match and Distill for Video-Text Retrieval		33.1	58.9	68.9	44.7	3								CLIP2TV	2021-11-10
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss	✓ Link	32.9	58.3	68.4	42.6	3	59.8	86.2	92.8	1	3.8			CAMoE	2021-09-09
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	✓ Link	32.5	61.5	71.2										FROZEN	2021-04-01
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval		32.1	60.8	70.2		3								COTS	2022-04-15
CoCa: Contrastive Captioners are Image-Text Foundation Models	✓ Link	30.0	52.4	61.6			49.9	73.4	81.4					CoCa (zero-shot)	2022-05-04
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP	✓ Link	29.8	55.5	66.2	45.4	4	54.6	82.1	90.8	1	5.3			CLIP2Video	2021-06-21
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval	✓ Link	29.1	54.9	65.8										LAFF	2021-12-03
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models	✓ Link	28.5	55.5	67.6		4								UniVL + MELTR	2023-03-23
Video and Text Matching with Conditioned Embeddings	✓ Link	26	56.7			3	26.7	56.5		3				Ours	2021-10-21
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment		24.8	52.1	64.0		5								TACo	2021-08-23
MDMMT: Multidomain Multimodal Transformer for Video Retrieval	✓ Link	23.1	49.8	61.8	52.8	6								MDMMT	2021-03-19
A Straightforward Framework For Video Retrieval Using CLIP	✓ Link	21.4	41.1	50.4		10	40.3	69.7	79.2	2				CLIP	2021-02-24
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation	✓ Link	21.2	49.6	63.1		6								UniVL	2020-02-15
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips	✓ Link	14.9		52.8		9		40.2						Text-Video Embedding	2019-06-07
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval	✓ Link	10.7	29.6	41.2		17								RoME	2022-06-26
A Joint Sequence Fusion Model for Video Question Answering and Retrieval	✓ Link	10.2		43.2		13		31.2						JSFusion	2018-08-07
Use What You Have: Video Retrieval Using Representations From Collaborative Experts	✓ Link	10.0	29.0	41.2	86.8	16	15.6	40.9	55.2	8.3	38.1			Collaborative Experts	2019-07-31
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval	✓ Link	7.0	20.9	29.7	213.8	29.7	12.5	32.1	42.2	16	134			JEMC	2018-06-11
Temporal Tessellation: A Unified Approach for Video Analysis	✓ Link	4.7		24.1		41		16.6						Kaufman	2016-12-21
Learning Language-Visual Embedding for Movie Understanding with Natural-Language		4.2		19.9		55		12.9						C+LSTM+SA+FC7	2016-09-26

OpenCodePapers

video-retrieval-on-msr-vtt