video-retrieval-on-activitynet

Video Retrieval

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	text-to-video R@1	text-to-video R@5	text-to-video R@10	text-to-video R@50	text-to-video Mean Rank	text-to-video Median Rank	video-to-text R@1	video-to-text R@5	video-to-text Mean Rank	video-to-text Median Rank	video-to-text R@10	video-to-text R@50	ModelName	ReleaseDate
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	74.1						69.7						InternVideo2-6B	2024-03-22
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	✓ Link	70.5	90.9	95.5										VAST	2023-05-29
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	✓ Link	70.1	90.8	95.3										VALOR	2023-04-17
Gramian Multimodal Representation Learning and Alignment	✓ Link	69.9		96.1				66.9				95.4		GRAM	2024-12-16
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	✓ Link	67.3												COSA	2023-06-15
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	✓ Link	66.8	89.1	94.9				64.4	89.1			94.8		UMT-L (ViT-L/16)	2023-03-28
vid-TLDR: Training Free Token merging for Light-weight Video Transformer	✓ Link	66.7	88.6	94.4				63.9	88.7			94.5		vid-TLDR (UMT-L)	2024-03-20
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	62.2						62.8						InternVideo	2022-12-06
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment	✓ Link	61.4	85.7	92.6			1							CLIP-ViP	2022-09-14
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations		57.3	84.8	93.1		4.0	1	57.7	85.7	3.4	1	93.9		HunYuan_tvr	2022-04-07
VindLU: A Recipe for Effective Video-and-Language Pretraining	✓ Link	55.0	81.4	89.7										VindLU	2022-12-09
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding	✓ Link	54.8	80.8	89.6										TESTA (ViT-B/16)	2023-10-29
RTQ: Rethinking Video-language Understanding Based on Image-text Model	✓ Link	53.5	81.4	91.9										RTQ	2023-12-01
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning	✓ Link	53.4	80.7	89.2		5.3	1.0							DMAE (ViT-B/32)	2023-09-20
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss	✓ Link	51.0	77.7	87.6		6.3	1							CAMoE	2021-09-09
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations	✓ Link	50.6	78.7		98.1	1		50.6	78.9	1			98.4	EMCL-Net++	2022-11-21
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training		49.7	77.1	86.7										HiTeA	2022-12-30
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model	✓ Link	48.1		85.7		6.8	2.0	47.4	76.3	6.7	2.0	86.7		DiffusionRet+QB-Norm	2023-03-17
Revealing Single Frame Bias for Video-and-Language Learning	✓ Link	47.1	75.5	85.5										Singularity	2022-06-07
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval	✓ Link	46.2	77.0	87.6		5.7	2	46.7	77.1	5.5	2	88.0		CenterCLIP (ViT-B/16)	2022-05-02
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval	✓ Link	46.2	75.5			6.8		46.4	75.9	6.4				X-CLIP	2022-07-15
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model	✓ Link	45.8	75.6	86.3		6.5	2.0	43.8	75.3	6.3	2.0	86.7		DiffusionRet	2023-03-17
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning	✓ Link	42.2	73.0	84.6		6.6	2.0	42.4	73.0	6.5	2.0	86.0		HBI	2023-03-25
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations	✓ Link	41.2	72.7			2		42.7	74	2			98.3	EMCL-Net	2022-11-21
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval	✓ Link	40.5	73.4		98.2	7.5	2							CLIP4Clip	2021-04-18
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment		30.4	61.2		93.4		3.0							TACo	2021-08-23
Multi-modal Transformer for Video Retrieval	✓ Link	28.7	61.4		94.5	16	3.3							MMT-Pretrained	2020-07-21
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions	✓ Link	28.5	57.4		94		4							HD-VILA	2021-11-19
Video and Text Matching with Conditioned Embeddings	✓ Link	25.4	59.1					26.1	60					Ours	2021-10-21
Multi-modal Transformer for Video Retrieval	✓ Link	22.7	54.2		93.2	20.8	5							MMT	2020-07-21
Use What You Have: Video Retrieval Using Representations From Collaborative Experts	✓ Link	20.5	47.7	63.9	91.4	23.1	6							Collaborative Experts	2019-07-31

OpenCodePapers

video-retrieval-on-activitynet