InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 74.2 | | | | | | 71.9 | | | | | | InternVideo2-6B | 2024-03-22 |
vid-TLDR: Training Free Token merging for Light-weight Video Transformer | ✓ Link | 72.3 | 91.2 | 94.2 | | | | 68.5 | 89.8 | 93.8 | | | | vid-TLDR (UMT-L) | 2024-03-20 |
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | ✓ Link | 72.0 | 89.0 | 91.4 | | | | | | | | | | VAST | 2023-05-29 |
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model | ✓ Link | 70.5 | | | | | | | | | | | | COSA | 2023-06-15 |
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 70.4 | 90.1 | 93.5 | | | | 65.7 | 89.6 | 93.3 | | | | UMT-L (ViT-L/16) | 2023-03-28 |
Gramian Multimodal Representation Learning and Alignment | ✓ Link | 67.3 | | 90.1 | | | | 63.5 | | 91.6 | | | | GRAM | 2024-12-16 |
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | ✓ Link | 61.5 | 85.3 | 90.4 | | | | | | | | | | VALOR | 2023-04-17 |
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding | ✓ Link | 61.2 | 87.2 | 91.5 | | | | | | | | | | TESTA (ViT-B/16) | 2023-10-29 |
VindLU: A Recipe for Effective Video-and-Language Pretraining | ✓ Link | 61.2 | 85.8 | 91.0 | | | | | | | | | | VindLU | 2022-12-09 |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 57.9 | | | | | | 59.1 | | | | | | InternVideo | 2022-12-06 |
RTQ: Rethinking Video-language Understanding Based on Image-text Model | ✓ Link | 57.6 | 84.1 | 89.9 | | | | | | | | | | RTQ | 2023-12-01 |
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending | | 56.8 | 81.6 | 88.7 | | | | | | | | | | VLAB | 2023-05-22 |
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | | 56.5 | 81.7 | 89.7 | | | | | | | | | | HiTeA | 2022-12-30 |
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling | | 56.5 | 80.2 | 87.0 | | | | | | | | | | MuLTI | 2023-03-10 |
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | ✓ Link | 56.4 | 79.1 | 85.2 | | | | | | | | | | mPLUG-2 | 2023-02-01 |
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment | ✓ Link | 55.3 | 82 | 89.3 | | 1 | | | | | | | | CLIP-ViP | 2022-09-14 |
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring | ✓ Link | 54.6 | 78.4 | 85.1 | | 1 | | | | | | | | STAN | 2023-01-26 |
Revealing Single Frame Bias for Video-and-Language Learning | ✓ Link | 53.9 | 79.4 | 86.9 | | | | | | | | | | Singularity | 2022-06-07 |
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning | ✓ Link | 52.7 | 79.3 | 86.6 | | 1.0 | 10.5 | | | | | | | DMAE (ViT-B/32) | 2023-09-20 |
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations | | 52.7 | 77.8 | 85.2 | | 1.0 | 13.7 | 54.1 | 78.3 | 86.8 | 1.0 | 9.1 | | HunYuan_tvr (huge) | 2022-04-07 |
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | | 52.4 | 79.5 | 85.4 | | | | | | | | | | OmniVL | 2022-09-15 |
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations | | 52.1 | 78.2 | 85.7 | | 1 | 11.1 | 54.8 | 79.9 | 87.2 | 1 | 7.1 | | HunYuan_tvr | 2022-04-07 |
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? | ✓ Link | 52.0 | 79.4 | 87.5 | | 1 | 10.5 | 51.2 | 78.5 | 87.4 | 1 | 7.3 | | Cap4Video | 2022-12-31 |
Clover: Towards A Unified Video-Language Alignment and Fusion Model | ✓ Link | 50.1 | 76.7 | 85.6 | | 1 | | | | | | | | Clover | 2022-07-16 |
Disentangled Representation Learning for Text-Video Retrieval | ✓ Link | 49.0 | 76.5 | 84.5 | | 2.0 | 11.5 | 49.9 | | 83.3 | 2 | 7.9 | | DRL | 2022-03-14 |
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model | ✓ Link | 48.9 | 75.5 | 83.3 | | 2.0 | 14.1 | 50.3 | 75.1 | 82.9 | 1.0 | 10.3 | | DiffusionRet+QB-Norm | 2023-03-17 |
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval | ✓ Link | 48.6 | 76.0 | 84.5 | | 2.0 | 12.9 | 48.1 | 74.2 | 85.7 | 2.0 | 9.8 | | PAU | 2023-09-29 |
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | ✓ Link | 47.9 | 76.5 | 84.1 | | | | | | | | | | VIOLETv2 | 2022-09-04 |
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval | ✓ Link | 47.8 | 79.3 | | | | 12.6 | 47.8 | | 76.8 | | 10.5 | | X-CLIP | 2022-07-15 |
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning | ✓ Link | 46.9 | 74.9 | 82.7 | | 2.0 | 12.1 | 46.2 | 73.0 | 82.7 | 2.0 | 8.7 | | HBI | 2023-03-25 |
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model | ✓ Link | 46.7 | 74.7 | 82.7 | | 2.0 | 14.3 | 46.2 | 74.3 | 82.2 | 2.0 | 10.7 | | DiffusionRet | 2023-03-17 |
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss | ✓ Link | 43.8 | 71.4 | 79.9 | | 2.0 | 16.3 | 45.5 | | 80.5 | 2 | 10.2 | | CAMoE | 2021-09-09 |
Cross Modal Retrieval with Querybank Normalisation | ✓ Link | 43.5 | 71.4 | 80.9 | | 2.0 | | | | | | | | QB-Norm+CLIP4Clip | 2021-12-23 |
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | ✓ Link | 43.4 | 70.2 | 80.6 | | 2.0 | 17.5 | | | | | | | CLIP4Clip | 2021-04-18 |
Align and Prompt: Video-and-Language Pre-training with Entity Prompts | ✓ Link | 35.9 | 67.5 | 78.8 | | 3 | | | | | | | | ALPRO | 2021-12-17 |
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | ✓ Link | 31.0 | 59.8 | 72.4 | | 3 | | | | | | | | FROZEN | 2021-04-01 |
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions | ✓ Link | 28.8 | 57.4 | 69.1 | | 4 | | | | | | | | HD-VILA | 2021-11-19 |
Rudder: A Cross Lingual Video and Text Retrieval Dataset | ✓ Link | 16.3 | | 56.5 | | 8 | 40.2 | 15 | | 54.9 | 8 | 39.6 | | PO Loss | 2021-03-09 |
Use What You Have: Video Retrieval Using Representations From Collaborative Experts | ✓ Link | 16.1 | 41.1 | 54.4 | 82.7 | 8.3 | 43.7 | | | | | | | Collaborative Experts | 2019-07-31 |
[]() | | | 77.4 | 85.3 | | 1 | | | | | | | 53.1 | Aurora (ours, r=64) | |