InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 55.9 | 78.3 | 85.1 | | | 53.7 | 77.5 | 84.1 | | InternVideo2-6B | 2024-03-22 |
Gramian Multimodal Representation Learning and Alignment | ✓ Link | 54.8 | | 83.9 | | | 52.9 | | 82.9 | | GRAM | 2024-12-16 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 51.9 | 75.3 | 82.5 | | | 50.9 | 73.4 | 81.8 | | InternVideo2-1B | 2024-03-22 |
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale | ✓ Link | 50 | 73.2 | 81.4 | 1 | | | | | | VAST, HowToCaption-finetuned | 2023-10-07 |
Make Your Training Flexible: Towards Deployment-Efficient Video Models | ✓ Link | 49.9 | 71.0 | 79.6 | | | 49.4 | 73.9 | 82.4 | | FluxViT-B | 2025-03-18 |
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | ✓ Link | 49.3 | 68.3 | 73.9 | | | | | | | VAST | 2023-05-29 |
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | ✓ Link | 47.1 | 69.7 | 79.0 | | | | | | | mPLUG-2 | 2023-02-01 |
Make Your Training Flexible: Towards Deployment-Efficient Video Models | ✓ Link | 45.0 | 67.5 | 75.8 | | | 44.9 | 68.2 | 76.5 | | FluxViT-S | 2025-03-18 |
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment | ✓ Link | 44.8 | 70.0 | 78.7 | 2 | | 40.9 | 66.4 | 75.7 | 2. | LanguageBind(ViT-H/14) | 2023-10-03 |
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment | ✓ Link | 42.8 | 67.5 | 76.0 | 2.0 | | 38.3 | 65.8 | 77.8 | 3.0 | LanguageBind(ViT-L/14) | 2023-10-03 |
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 42.6 | 64.4 | 73.1 | | | 38.6 | 59.8 | 69.6 | | UMT-L (ViT-L/16) | 2023-03-28 |
vid-TLDR: Training Free Token merging for Light-weight Video Transformer | ✓ Link | 42.1 | 63.9 | 72.4 | | | 37.7 | 59.8 | 69.4 | | vid-TLDR (UMT-L) | 2024-03-20 |
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | ✓ Link | 40.9 | 64.7 | 73.5 | | | | | | | BT-Adapter | 2023-09-27 |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 40.7 | | | | | 39.6 | | | | InternVideo | 2022-12-06 |
Florence: A New Foundation Model for Computer Vision | ✓ Link | 37.6 | 63.8 | 72.6 | | | | | | | Florence | 2021-11-22 |
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale | ✓ Link | 37.6 | 62 | 73.3 | 3 | | | | | | HowToCaption | 2023-10-07 |
ImageBind: One Embedding Space To Bind Them All | ✓ Link | 36.8 | 61.8 | 70.0 | | | | | | | ImageBind | 2023-05-09 |
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | | 34.6 | 58.4 | 66.6 | | | | | | | OmniVL | 2022-09-15 |
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | | 34.4 | 60.0 | 69.9 | | | | | | | HiTeA-17M | 2022-12-30 |
Revealing Single Frame Bias for Video-and-Language Learning | ✓ Link | 34.0 | 56.7 | 66.7 | | | | | | | Singularity-17M | 2022-06-07 |
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | ✓ Link | 32.0 | 57.0 | 66.9 | 4 | 34.0 | | | | | CLIP4Clip | 2021-04-18 |
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning | ✓ Link | 30.9 | 54.4 | 65.0 | | | | | | | Yatai Ji et. al. | 2022-11-24 |
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | | 29.9 | 54.2 | 62.9 | | | | | | | HiTeA-5M | 2022-12-30 |
Revealing Single Frame Bias for Video-and-Language Learning | ✓ Link | 28.4 | 50.2 | 59.5 | | | | | | | Singularity-5M | 2022-06-07 |
Clover: Towards A Unified Video-Language Alignment and Fusion Model | ✓ Link | 26.4 | 49.5 | 60 | 6 | | | | | | Clover | 2022-07-16 |
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval | ✓ Link | 26.1 | 47.2 | 56.9 | 7 | | | | | | MILES | 2022-04-26 |
Bridging Video-text Retrieval with Multiple Choice Questions | ✓ Link | 26.0 | 46.4 | 56.4 | 7.0 | | | | | | Y. Ge et. al. | 2022-01-13 |
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling | ✓ Link | 25.9 | 49.5 | 59.7 | | | | | | | VIOLET | 2021-11-24 |
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | ✓ Link | 24.7 | 46.9 | 57.2 | 7.0 | | | | | | FROZEN | 2021-04-01 |
Align and Prompt: Video-and-Language Pre-training with Entity Prompts | ✓ Link | 24.1 | 44.7 | 55.4 | 8 | | | | | | ALPRO | 2021-12-17 |
Object-aware Video-language Pre-training for Retrieval | ✓ Link | 23.4 | 47.5 | 55.6 | 8.0 | | | | | | OA-Trans | 2021-12-01 |
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval | | 23.4 | 44.1 | 53.3 | 8 | | 17.2 | 36.2 | 47.9 | 12 | LaT | 2022-07-11 |
Learning Audio-Video Modalities from Image Captions | | 19.4 | 39.5 | 50.3 | | | | | | | A. Nagrani et. al. | 2022-04-01 |
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions | ✓ Link | 14.6 | 34.4 | 44.1 | 15 | | | | | | HD-VILA | 2021-11-19 |
Multi-granularity Correspondence Learning from Long-term Noisy Videos | ✓ Link | 10.7 | 24.1 | | | | | | | | Norton | 2024-01-30 |
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding | ✓ Link | 10.4 | 22.2 | 30.0 | | | | | | | VideoCLIP | 2021-09-28 |
End-to-End Learning of Visual Representations from Uncurated Instructional Videos | ✓ Link | 9.9 | 24.0 | 32.4 | | 29.5 | | | | | MIL-NCE | 2019-12-13 |
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment | | 9.8 | 25.0 | 33.4 | | | | | | | TACo | 2021-08-23 |
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning | ✓ Link | 8.0 | 21.3 | 29.3 | | | | | | | SSML | 2020-03-06 |
Multi-modal Transformer for Video Retrieval | ✓ Link | | 14.4 | | 66 | 148.1 | | | | | MMT | 2020-07-21 |
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text | ✓ Link | | | 29.7 | 49 | | | | | | VATT-MBS | 2021-04-22 |