video-retrieval-on-didemo

Video Retrieval

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	text-to-video R@1	text-to-video R@5	text-to-video R@10	text-to-video R@50	text-to-video Median Rank	text-to-video Mean Rank	video-to-text R@1	video-to-text R@5	video-to-text R@10	video-to-text Median Rank	video-to-text Mean Rank	text-to-videoR@1	ModelName	ReleaseDate
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	74.2						71.9						InternVideo2-6B	2024-03-22
vid-TLDR: Training Free Token merging for Light-weight Video Transformer	✓ Link	72.3	91.2	94.2				68.5	89.8	93.8				vid-TLDR (UMT-L)	2024-03-20
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	✓ Link	72.0	89.0	91.4										VAST	2023-05-29
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	✓ Link	70.5												COSA	2023-06-15
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	✓ Link	70.4	90.1	93.5				65.7	89.6	93.3				UMT-L (ViT-L/16)	2023-03-28
Gramian Multimodal Representation Learning and Alignment	✓ Link	67.3		90.1				63.5		91.6				GRAM	2024-12-16
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	✓ Link	61.5	85.3	90.4										VALOR	2023-04-17
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding	✓ Link	61.2	87.2	91.5										TESTA (ViT-B/16)	2023-10-29
VindLU: A Recipe for Effective Video-and-Language Pretraining	✓ Link	61.2	85.8	91.0										VindLU	2022-12-09
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	57.9						59.1						InternVideo	2022-12-06
RTQ: Rethinking Video-language Understanding Based on Image-text Model	✓ Link	57.6	84.1	89.9										RTQ	2023-12-01
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending		56.8	81.6	88.7										VLAB	2023-05-22
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training		56.5	81.7	89.7										HiTeA	2022-12-30
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling		56.5	80.2	87.0										MuLTI	2023-03-10
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	✓ Link	56.4	79.1	85.2										mPLUG-2	2023-02-01
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment	✓ Link	55.3	82	89.3		1								CLIP-ViP	2022-09-14
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring	✓ Link	54.6	78.4	85.1		1								STAN	2023-01-26
Revealing Single Frame Bias for Video-and-Language Learning	✓ Link	53.9	79.4	86.9										Singularity	2022-06-07
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning	✓ Link	52.7	79.3	86.6		1.0	10.5							DMAE (ViT-B/32)	2023-09-20
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations		52.7	77.8	85.2		1.0	13.7	54.1	78.3	86.8	1.0	9.1		HunYuan_tvr (huge)	2022-04-07
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks		52.4	79.5	85.4										OmniVL	2022-09-15
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations		52.1	78.2	85.7		1	11.1	54.8	79.9	87.2	1	7.1		HunYuan_tvr	2022-04-07
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?	✓ Link	52.0	79.4	87.5		1	10.5	51.2	78.5	87.4	1	7.3		Cap4Video	2022-12-31
Clover: Towards A Unified Video-Language Alignment and Fusion Model	✓ Link	50.1	76.7	85.6		1								Clover	2022-07-16
Disentangled Representation Learning for Text-Video Retrieval	✓ Link	49.0	76.5	84.5		2.0	11.5	49.9		83.3	2	7.9		DRL	2022-03-14
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model	✓ Link	48.9	75.5	83.3		2.0	14.1	50.3	75.1	82.9	1.0	10.3		DiffusionRet+QB-Norm	2023-03-17
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval	✓ Link	48.6	76.0	84.5		2.0	12.9	48.1	74.2	85.7	2.0	9.8		PAU	2023-09-29
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling	✓ Link	47.9	76.5	84.1										VIOLETv2	2022-09-04
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval	✓ Link	47.8	79.3				12.6	47.8		76.8		10.5		X-CLIP	2022-07-15
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning	✓ Link	46.9	74.9	82.7		2.0	12.1	46.2	73.0	82.7	2.0	8.7		HBI	2023-03-25
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model	✓ Link	46.7	74.7	82.7		2.0	14.3	46.2	74.3	82.2	2.0	10.7		DiffusionRet	2023-03-17
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss	✓ Link	43.8	71.4	79.9		2.0	16.3	45.5		80.5	2	10.2		CAMoE	2021-09-09
Cross Modal Retrieval with Querybank Normalisation	✓ Link	43.5	71.4	80.9		2.0								QB-Norm+CLIP4Clip	2021-12-23
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval	✓ Link	43.4	70.2	80.6		2.0	17.5							CLIP4Clip	2021-04-18
Align and Prompt: Video-and-Language Pre-training with Entity Prompts	✓ Link	35.9	67.5	78.8		3								ALPRO	2021-12-17
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	✓ Link	31.0	59.8	72.4		3								FROZEN	2021-04-01
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions	✓ Link	28.8	57.4	69.1		4								HD-VILA	2021-11-19
Rudder: A Cross Lingual Video and Text Retrieval Dataset	✓ Link	16.3		56.5		8	40.2	15		54.9	8	39.6		PO Loss	2021-03-09
Use What You Have: Video Retrieval Using Representations From Collaborative Experts	✓ Link	16.1	41.1	54.4	82.7	8.3	43.7							Collaborative Experts	2019-07-31
[]()			77.4	85.3		1							53.1	Aurora (ours, r=64)

OpenCodePapers

video-retrieval-on-didemo