InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 57.9 | 80.0 | 84.6 | 57.1 | 79.9 | 85.0 | | | InternVideo2-6B | 2024-03-22 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 57.0 | 80.0 | 85.1 | 54.3 | 77.2 | 83.5 | | | InternVideo2-1B | 2024-03-22 |
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | ✓ Link | 55.5 | 74.3 | 79.6 | | | | | | VAST | 2023-05-29 |
Gramian Multimodal Representation Learning and Alignment | ✓ Link | 54.2 | | 80.7 | 52.3 | | 80.3 | | | GRAM | 2024-12-16 |
vid-TLDR: Training Free Token merging for Light-weight Video Transformer | ✓ Link | 52.0 | 74.0 | 81.0 | 52.0 | 75.9 | 83.8 | | | vid-TLDR (UMT-L) | 2024-03-20 |
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 48.6 | 72.9 | 79.0 | 49.9 | 74.8 | 81.4 | | | UMT-L (ViT-L/16) | 2023-03-28 |
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | ✓ Link | 45.7 | 71.1 | 79.2 | | | | | | mPLUG-2 | 2023-02-01 |
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | | 43.2 | 69.3 | 79.0 | | | | | | HiTeA-17M | 2022-12-30 |
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment | ✓ Link | 39.9 | 66.1 | 74.6 | 39.8 | 67.8 | 76.2 | 2 | | LanguageBind(ViT-H/14) | 2023-10-03 |
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment | ✓ Link | 39.7 | 65.5 | 73.8 | 38.4 | 66.6 | 77.9 | 2.0 | | LanguageBind(ViT-L/14) | 2023-10-03 |
Revealing Single Frame Bias for Video-and-Language Learning | ✓ Link | 37.1 | 61.7 | 69.9 | | | | | | Singularity-17M | 2022-06-07 |
Revealing Single Frame Bias for Video-and-Language Learning | ✓ Link | 36.9 | 61.1 | 69.3 | | | | | | Singularity-5M | 2022-06-07 |
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | | 36.1 | 60.1 | 70.3 | | | | | | HiTeA-5M | 2022-12-30 |
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | ✓ Link | 35.6 | 61.9 | 72.6 | | | | | | BT-Adapter | 2023-09-27 |
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | | 33.3 | 58.7 | 68.5 | | | | | | OmniVL | 2022-09-15 |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 31.5 | 57.6 | 68.2 | 33.5 | 60.3 | 71.1 | | | InternVideo | 2022-12-06 |
Clover: Towards A Unified Video-Language Alignment and Fusion Model | ✓ Link | 29.5 | 55.2 | 66.3 | | | | 4 | | Clover | 2022-07-16 |
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval | ✓ Link | 27.2 | 50.3 | 63.6 | | | | 5.0 | | MILES | 2022-04-26 |
Bridging Video-text Retrieval with Multiple Choice Questions | ✓ Link | 25.6 | 50.6 | 61.1 | | | | 5.0 | | Y. Ge et. al. | 2022-01-13 |
Align and Prompt: Video-and-Language Pre-training with Entity Prompts | ✓ Link | 23.8 | 47.3 | 57.9 | | | | 6 | | ALPRO | 2021-12-17 |
Object-aware Video-language Pre-training for Retrieval | ✓ Link | 23.5 | 50.4 | 59.8 | | | | 6.0 | | OA-Trans | 2021-12-01 |
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling | ✓ Link | 23.5 | 49.8 | 59.8 | | | | | | VIOLET | 2021-11-24 |
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval | | 22.6 | 45.9 | 58.9 | 22.5 | 45.2 | 56.8 | 7 | 7 | LaT | 2022-07-11 |
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | ✓ Link | 21.1 | 46.0 | 56.2 | | | | | | FROZEN | 2021-04-01 |
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | ✓ Link | 20.2 | 46.4 | 58.5 | | | | 7 | | M. Bain et. al. | 2021-04-01 |
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding | ✓ Link | 16.6 | 46.9 | | | | | | | VideoCLIP | 2021-09-28 |