InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 74.1 | | | | | | 69.7 | | | | | | InternVideo2-6B | 2024-03-22 |
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | ✓ Link | 70.5 | 90.9 | 95.5 | | | | | | | | | | VAST | 2023-05-29 |
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | ✓ Link | 70.1 | 90.8 | 95.3 | | | | | | | | | | VALOR | 2023-04-17 |
Gramian Multimodal Representation Learning and Alignment | ✓ Link | 69.9 | | 96.1 | | | | 66.9 | | | | 95.4 | | GRAM | 2024-12-16 |
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model | ✓ Link | 67.3 | | | | | | | | | | | | COSA | 2023-06-15 |
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 66.8 | 89.1 | 94.9 | | | | 64.4 | 89.1 | | | 94.8 | | UMT-L (ViT-L/16) | 2023-03-28 |
vid-TLDR: Training Free Token merging for Light-weight Video Transformer | ✓ Link | 66.7 | 88.6 | 94.4 | | | | 63.9 | 88.7 | | | 94.5 | | vid-TLDR (UMT-L) | 2024-03-20 |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 62.2 | | | | | | 62.8 | | | | | | InternVideo | 2022-12-06 |
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment | ✓ Link | 61.4 | 85.7 | 92.6 | | | 1 | | | | | | | CLIP-ViP | 2022-09-14 |
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations | | 57.3 | 84.8 | 93.1 | | 4.0 | 1 | 57.7 | 85.7 | 3.4 | 1 | 93.9 | | HunYuan_tvr | 2022-04-07 |
VindLU: A Recipe for Effective Video-and-Language Pretraining | ✓ Link | 55.0 | 81.4 | 89.7 | | | | | | | | | | VindLU | 2022-12-09 |
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding | ✓ Link | 54.8 | 80.8 | 89.6 | | | | | | | | | | TESTA (ViT-B/16) | 2023-10-29 |
RTQ: Rethinking Video-language Understanding Based on Image-text Model | ✓ Link | 53.5 | 81.4 | 91.9 | | | | | | | | | | RTQ | 2023-12-01 |
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning | ✓ Link | 53.4 | 80.7 | 89.2 | | 5.3 | 1.0 | | | | | | | DMAE
(ViT-B/32) | 2023-09-20 |
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss | ✓ Link | 51.0 | 77.7 | 87.6 | | 6.3 | 1 | | | | | | | CAMoE | 2021-09-09 |
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations | ✓ Link | 50.6 | 78.7 | | 98.1 | 1 | | 50.6 | 78.9 | 1 | | | 98.4 | EMCL-Net++ | 2022-11-21 |
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | | 49.7 | 77.1 | 86.7 | | | | | | | | | | HiTeA | 2022-12-30 |
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model | ✓ Link | 48.1 | | 85.7 | | 6.8 | 2.0 | 47.4 | 76.3 | 6.7 | 2.0 | 86.7 | | DiffusionRet+QB-Norm | 2023-03-17 |
Revealing Single Frame Bias for Video-and-Language Learning | ✓ Link | 47.1 | 75.5 | 85.5 | | | | | | | | | | Singularity | 2022-06-07 |
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval | ✓ Link | 46.2 | 77.0 | 87.6 | | 5.7 | 2 | 46.7 | 77.1 | 5.5 | 2 | 88.0 | | CenterCLIP (ViT-B/16) | 2022-05-02 |
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval | ✓ Link | 46.2 | 75.5 | | | 6.8 | | 46.4 | 75.9 | 6.4 | | | | X-CLIP | 2022-07-15 |
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model | ✓ Link | 45.8 | 75.6 | 86.3 | | 6.5 | 2.0 | 43.8 | 75.3 | 6.3 | 2.0 | 86.7 | | DiffusionRet | 2023-03-17 |
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning | ✓ Link | 42.2 | 73.0 | 84.6 | | 6.6 | 2.0 | 42.4 | 73.0 | 6.5 | 2.0 | 86.0 | | HBI | 2023-03-25 |
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations | ✓ Link | 41.2 | 72.7 | | | 2 | | 42.7 | 74 | 2 | | | 98.3 | EMCL-Net | 2022-11-21 |
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | ✓ Link | 40.5 | 73.4 | | 98.2 | 7.5 | 2 | | | | | | | CLIP4Clip | 2021-04-18 |
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment | | 30.4 | 61.2 | | 93.4 | | 3.0 | | | | | | | TACo | 2021-08-23 |
Multi-modal Transformer for Video Retrieval | ✓ Link | 28.7 | 61.4 | | 94.5 | 16 | 3.3 | | | | | | | MMT-Pretrained | 2020-07-21 |
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions | ✓ Link | 28.5 | 57.4 | | 94 | | 4 | | | | | | | HD-VILA | 2021-11-19 |
Video and Text Matching with Conditioned Embeddings | ✓ Link | 25.4 | 59.1 | | | | | 26.1 | 60 | | | | | Ours | 2021-10-21 |
Multi-modal Transformer for Video Retrieval | ✓ Link | 22.7 | 54.2 | | 93.2 | 20.8 | 5 | | | | | | | MMT | 2020-07-21 |
Use What You Have: Video Retrieval Using Representations From Collaborative Experts | ✓ Link | 20.5 | 47.7 | 63.9 | 91.4 | 23.1 | 6 | | | | | | | Collaborative Experts | 2019-07-31 |