LinVT: Empower Your Image-level Large Language Model to Understand Videos | ✓ Link | 85.5 | LinVT-Qwen2-VL
(7B) | 2024-12-06 |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | ✓ Link | 85.5 | InternVL-2.5(8B) | 2024-12-06 |
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding | ✓ Link | 84.5 | VideoLLaMA3(7B) | 2025-01-22 |
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding | ✓ Link | 84.1 | PLM-8B | 2025-04-17 |
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering | ✓ Link | 83.73 | BIMBA-LLaVA-Qwen2-7B | 2025-03-12 |
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding | ✓ Link | 83.4 | PLM-3B | 2025-04-17 |
Video Instruction Tuning With Synthetic Data | | 83.2 | LLaVA-Video | 2024-10-03 |
NVILA: Efficient Frontier Visual Language Models | ✓ Link | 82.2 | NVILA(8B) | 2024-12-05 |
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution | ✓ Link | 81.8 | Oryx-1.5(7B) | 2024-09-19 |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | ✓ Link | 81.2 | Qwen2-VL(7B) | 2024-09-18 |
LongVILA: Scaling Long-Context Visual Language Models for Long Videos | ✓ Link | 80.7 | LongVILA(7B) | 2024-08-19 |
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding | ✓ Link | 80.3 | PLM-1B | 2025-04-17 |
LLaVA-OneVision: Easy Visual Task Transfer | ✓ Link | 80.2 | LLaVA-OV(72B) | 2024-08-06 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ✓ Link | 79.5 | VideoChat2_HD_mistral | 2023-11-28 |
LLaVA-OneVision: Easy Visual Task Transfer | ✓ Link | 79.4 | LLaVA-OV(7B) | 2024-08-06 |
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models | ✓ Link | 79.1 | LLaVA-NeXT-Interleave(14B) | 2024-07-10 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ✓ Link | 78.6 | VideoChat2_mistral | 2023-11-28 |
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models | ✓ Link | 78.6 | mPLUG-Owl3(8B) | 2024-08-09 |
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models | ✓ Link | 78.2 | LLaVA-NeXT-Interleave(7B) | 2024-07-10 |
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models | ✓ Link | 77.9 | LLaVA-NeXT-Interleave(DPO) | 2024-07-10 |
Vamos: Versatile Action Models for Video Understanding | ✓ Link | 77.3 | Vamos | 2023-11-22 |
ViLA: Efficient Video-Language Alignment for Video Question Answering | ✓ Link | 75.6 | ViLA (3B) | 2023-12-13 |
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | ✓ Link | 75.6 | VideoLLaMA2.1(7B) | 2024-06-11 |
Large Language Models are Temporal and Causal Reasoners for Video Question Answering | ✓ Link | 75.5 | LLaMA-VQA (33B) | 2023-10-24 |
ViLA: Efficient Video-Language Alignment for Video Question Answering | ✓ Link | 74.4 | ViLA (3B, 4 frames) | 2023-12-13 |
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion | ✓ Link | 73.9 | CREMA | 2024-02-08 |
Self-Chained Image-Language Model for Video Localization and Question Answering | ✓ Link | 73.8 | SeViLA | 2023-05-11 |
Text-Conditioned Resampler For Long Form Video Understanding | | 73.5 | TCR | 2023-12-19 |
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge | ✓ Link | 72.1 | LSTP | 2024-02-25 |
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities | | 72 | Mirasol3B | 2023-11-09 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ✓ Link | 68.6 | VideoChat2 | 2023-11-28 |
RTQ: Rethinking Video-language Understanding Based on Image-text Model | ✓ Link | 63.2 | RTQ | 2023-12-01 |
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | | 63.1 | HiTeA | 2022-12-30 |
Contrastive Video Question Answering via Video Graph Transformer | ✓ Link | 60.7 | CoVGT(PT) | 2023-02-27 |
Semi-Parametric Video-Grounded Text Generation | | 60.6 | SeViT | 2023-01-27 |
ViperGPT: Visual Inference via Python Execution for Reasoning | ✓ Link | 60.0 | ViperGPT(0-shot) | 2023-03-14 |
Contrastive Video Question Answering via Video Graph Transformer | ✓ Link | 60.0 | CoVGT | 2023-02-27 |
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering | ✓ Link | 58.83 | GF | 2024-01-03 |
Verbs in Action: Improving verb understanding in video-language models | ✓ Link | 58.6 | VFC | 2023-04-13 |
ATM: Action Temporality Modeling for Video Question Answering | | 58.3 | ATM | 2023-09-05 |
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | ✓ Link | 57.2 | MIST | 2022-12-19 |
Video Graph Transformer for Video Question Answering | ✓ Link | 56.9 | VGT(PT) | 2022-07-12 |
Paxion: Patching Action Knowledge in Video-Language Foundation Models | ✓ Link | 56.9 | PAXION | 2023-05-18 |
Video Graph Transformer for Video Question Answering | ✓ Link | 55.0 | VGT | 2022-07-12 |
Revisiting the "Video" in Video-Language Understanding | ✓ Link | 54.3 | ATP | 2022-06-03 |
(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering | | 53.4 | P3D-G | 2022-02-18 |
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering | ✓ Link | 51.4 | HQGA | 2021-12-12 |