video-retrieval-on-lsmdc

Video Retrieval

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	text-to-video R@1	text-to-video R@5	text-to-video R@10	text-to-video Median Rank	text-to-video Mean Rank	video-to-text R@1	video-to-text R@10	video-to-text R@5	video-to-text Median Rank	video-to-text Mean Rank	ModelName	ReleaseDate
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	46.4					46.7					InternVideo2-6B	2024-03-22
vid-TLDR: Training Free Token merging for Light-weight Video Transformer	✓ Link	43.1	64.5	71.4			40.7	63.6	70.2			vid-TLDR (UMT-L)	2024-03-20
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	✓ Link	43.0	65.5	73.0			41.4	71.5	64.3			UMT-L (ViT-L/16)	2023-03-28
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations		40.4	80.1	92.8	2.0	3.9	34.6	91.8	71.8	2.0	4.3	HunYuan_tvr (huge)	2022-04-07
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	✓ Link	39.4										COSA	2023-06-15
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	✓ Link	34.4	55.2	65.1								mPLUG-2	2023-02-01
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	✓ Link	34.2	56.0	64.1								VALOR	2023-04-17
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	34.0					34.9					InternVideo	2022-12-06
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment	✓ Link	30.7	51.4	60.6	5							CLIP-ViP	2022-09-14
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations		29.7	46.4	55.4	7	56.4	30.1	55.7	47.5	7	48.9	HunYuan_tvr	2022-04-07
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring	✓ Link	29.2	49.5	58.8	6							STAN	2023-01-26
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training		28.7	50.3	59.0								HiTeA	2022-12-30
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization		26.9	46.7	55.9	6.7	48.0						MDMMT-2	2022-03-14
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval	✓ Link	26.1					26.9					X-CLIP	2022-07-15
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations	✓ Link	25.9	46.4				26.7	54.4	44.7		8	EMCL-Net++	2022-11-21
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss	✓ Link	25.9	46.1	53.7		54.4						CAMoE	2021-09-09
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval	✓ Link	25.2	43.7	53.5	8.0	53.2	22.7	51.2	42.6	10.0	47.4	X-Pool	2022-03-28
Clover: Towards A Unified Video-Language Alignment and Fusion Model	✓ Link	24.8	44	54.5	8							Clover	2022-07-16
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model	✓ Link	24.4	43.1	54.3	8.0	40.7	23.0	51.5	43.5	9.0	40.2	DiffusionRet	2023-03-17
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval	✓ Link	24.2	46.2	55.9	8	47.3	24.5	55.8	46.4	7	41.3	CenterCLIP (ViT-B/16)	2022-05-02
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling	✓ Link	24	43.5	54.1								VIOLETv2	2022-09-04
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations	✓ Link	23.9	42.4	50.9			22.2	49.2	40.6		12	EMCL-Net	2022-11-21
Cross Modal Retrieval with Querybank Normalisation	✓ Link	22.4	40.1	49.5	11.0							QB-Norm+CLIP4Clip	2021-12-23
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval	✓ Link	21.6	41.8	49.8		58.0						CLIP4Clip	2021-04-18
MDMMT: Multidomain Multimodal Transformer for Video Retrieval	✓ Link	18.8	38.5	47.9	12.3	58.0						MDMMT	2021-03-19
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions	✓ Link	17.4	34.1	44.1	15							HD-VILA	2021-11-19
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	✓ Link	15.0	30.8	39.8	20.0							FROZEN	2021-04-01
Video and Text Matching with Conditioned Embeddings	✓ Link	14.9	33.2				15.3		34.1			Ours	2021-10-21
Multi-modal Transformer for Video Retrieval	✓ Link	13.5	29.9	40.1	19.3							MMT-Pretrained	2020-07-21
Multi-modal Transformer for Video Retrieval	✓ Link	13.2	29.2	38.8	21							MMT	2020-07-21
A Straightforward Framework For Video Retrieval Using CLIP	✓ Link	11.3	22.7	29.2	56.5		6.8	22.1	16.4	73		CLIP	2021-02-24
Use What You Have: Video Retrieval Using Representations From Collaborative Experts	✓ Link	11.2	26.9	34.8	25							Collaborative Experts	2019-07-31
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data	✓ Link	10.1	25.6	34.6	27							MoEE	2018-04-07
A Joint Sequence Fusion Model for Video Question Answering and Retrieval	✓ Link	9.1	21.2	34.1	36							JSFusion	2018-08-07
Learning from Video and Text via Large-Scale Discriminative Clustering	✓ Link	7.3	19.2	27.1	52							Large-Scale Discriminative Clustering	2017-07-27
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips	✓ Link	7.2	19.6	27.9	40							Text-Video Embedding	2019-06-07
End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering		5.1	16.3	25.2	46							CT-SAN	2016-10-10
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations	✓ Link			53.7		8						EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)	2022-11-21

OpenCodePapers

video-retrieval-on-lsmdc