Composing Ensembles of Pre-trained Models via Iterative Consensus | | 61.2 | | GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot) | 2022-10-20 |
Composing Ensembles of Pre-trained Models via Iterative Consensus | | 58.4 | | GPT-2 + CLIP-32 (Zero-Shot) | 2022-10-20 |
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners | | 56.1 | | VideoCoCa | 2022-12-09 |
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities | | 51.13 | | Mirasol3B | 2023-11-09 |
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | ✓ Link | 50.4 | | VAST | 2023-05-29 |
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model | ✓ Link | 49.9 | | COSA | 2023-06-15 |
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding | ✓ Link | 49.8 | | MA-LMM | 2024-04-08 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ✓ Link | 49.1 | 3.3 | VideoChat2 | 2023-11-28 |
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | ✓ Link | 48.6 | | VALOR | 2023-04-17 |
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 47.9 | | UMT-L (ViT-L/16) | 2023-03-28 |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | ✓ Link | 47.5 | 3.3 | LLaMA-VID-13B (2 Token) | 2023-11-28 |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | ✓ Link | 47.4 | 3.3 | LLaMA-VID-7B (2 Token) | 2023-11-28 |
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | ✓ Link | 46.4 | 3.3 | Chat-UniVi-13B | 2023-11-14 |
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | ✓ Link | 46.1 | 3.6 | BT-Adapter (zero-shot) | 2023-09-27 |
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | ✓ Link | 45.7 | 3.1 | MovieChat | 2023-07-31 |
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | ✓ Link | 45.3 | 3.3 | Video-LLaVA | 2023-11-16 |
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding | ✓ Link | 45 | | TESTA (ViT-B/16) | 2023-10-29 |
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | ✓ Link | 44.8 | | FrozenBiLM+ | 2023-08-18 |
VindLU: A Recipe for Effective Video-and-Language Pretraining | ✓ Link | 44.7 | | VindLU | 2022-12-09 |
Revealing Single Frame Bias for Video-and-Language Learning | ✓ Link | 44.1 | | Singularity-temporal | 2022-06-07 |
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models | ✓ Link | 43.2 | | FrozenBiLM | 2022-06-16 |
Revealing Single Frame Bias for Video-and-Language Learning | ✓ Link | 43.1 | | Singularity | 2022-06-07 |
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval | ✓ Link | 41.4 | | Text + Text (no Multimodal Pretext Training) | 2022-06-05 |
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | ✓ Link | 40.0 | | All-in-one+ | 2023-08-18 |
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | ✓ Link | 39.7 | | VIOLET+ | 2023-08-18 |
Just Ask: Learning to Answer Questions from Millions of Narrated Videos | ✓ Link | 38.9 | | Just Ask (fine-tune) | 2020-12-01 |
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs | ✓ Link | 38.2 | | LocVLM-Vid-B+ | 2024-04-11 |
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs | ✓ Link | 37.4 | | LocVLM-Vid-B | 2024-04-11 |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | ✓ Link | 35.2 | 2.7 | Video-ChatGPT | 2023-06-08 |
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | ✓ Link | 34.2 | 2.7 | LLaMA Adapter V2 | 2023-04-28 |
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering | ✓ Link | 31.8 | | E-SA | 2019-06-06 |
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering | ✓ Link | 27.1 | | E-MN | 2019-06-06 |
VideoChat: Chat-Centric Video Understanding | ✓ Link | 26.5 | 2.2 | Video Chat | 2023-05-10 |
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models | ✓ Link | 25.9 | | FrozenBiLM (0-shot) | 2022-06-16 |
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering | ✓ Link | 25.1 | | E-VQA | 2019-06-06 |
Just Ask: Learning to Answer Questions from Millions of Narrated Videos | ✓ Link | 12.2 | | Just Ask (0-shot) | 2020-12-01 |