OpenCodePapers

zero-shot-video-retrieval-on-didemo

Zero-Shot Video Retrieval
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodetext-to-video R@1text-to-video R@5text-to-video R@10video-to-text R@1video-to-text R@5video-to-text R@10text-to-video Median Rankvideo-to-text Median RankModelNameReleaseDate
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link57.980.084.657.179.985.0InternVideo2-6B2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding✓ Link57.080.085.154.377.283.5InternVideo2-1B2024-03-22
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset✓ Link55.574.379.6VAST2023-05-29
Gramian Multimodal Representation Learning and Alignment✓ Link54.280.752.380.3GRAM2024-12-16
vid-TLDR: Training Free Token merging for Light-weight Video Transformer✓ Link52.074.081.052.075.983.8vid-TLDR (UMT-L)2024-03-20
Unmasked Teacher: Towards Training-Efficient Video Foundation Models✓ Link48.672.979.049.974.881.4UMT-L (ViT-L/16)2023-03-28
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video✓ Link45.771.179.2mPLUG-22023-02-01
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training43.269.379.0HiTeA-17M2022-12-30
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment✓ Link39.966.174.639.867.876.22LanguageBind(ViT-H/14)2023-10-03
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment✓ Link39.765.573.838.466.677.92.0LanguageBind(ViT-L/14)2023-10-03
Revealing Single Frame Bias for Video-and-Language Learning✓ Link37.161.769.9Singularity-17M2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning✓ Link36.961.169.3Singularity-5M2022-06-07
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training36.160.170.3HiTeA-5M2022-12-30
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning✓ Link35.661.972.6BT-Adapter2023-09-27
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks33.358.768.5OmniVL2022-09-15
InternVideo: General Video Foundation Models via Generative and Discriminative Learning✓ Link31.557.668.233.560.371.1InternVideo2022-12-06
Clover: Towards A Unified Video-Language Alignment and Fusion Model✓ Link29.555.266.34Clover2022-07-16
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval✓ Link27.250.363.65.0MILES2022-04-26
Bridging Video-text Retrieval with Multiple Choice Questions✓ Link25.650.661.15.0Y. Ge et. al.2022-01-13
Align and Prompt: Video-and-Language Pre-training with Entity Prompts✓ Link23.847.357.96ALPRO2021-12-17
Object-aware Video-language Pre-training for Retrieval✓ Link23.550.459.86.0OA-Trans2021-12-01
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling✓ Link23.549.859.8VIOLET2021-11-24
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval22.645.958.922.545.256.877LaT2022-07-11
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval✓ Link21.146.056.2FROZEN2021-04-01
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval✓ Link20.246.458.57M. Bain et. al.2021-04-01
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding✓ Link16.646.9VideoCLIP2021-09-28