Tarsier: Recipes for Training and Evaluating Large Video Description Models | ✓ Link | 80.3 | 4.2 | Tarsier (34B) | 2024-06-30 |
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams | ✓ Link | 80.3 | 3.9 | Flash-VStream | 2024-06-12 |
LinVT: Empower Your Image-level Large Language Model to Understand Videos | ✓ Link | 80.2 | 4.4 | LinVT-Qwen2-VL
(7B) | 2024-12-06 |
VILA: On Pre-training for Visual Language Models | ✓ Link | 80.1 | | VILA1.5-40B | 2023-12-12 |
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning | ✓ Link | 79.9 | 4.2 | PLLaVA (34B) | 2024-04-25 |
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models | ✓ Link | 79.9 | 4.1 | SlowFast-LLaVA-34B | 2024-07-22 |
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM | ✓ Link | 79.6 | 4.1 | IG-VLM-34B | 2024-03-27 |
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models | ✓ Link | 79.4 | 4.1 | TS-LLaVA-34B | 2024-11-17 |
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance | ✓ Link | 77.1 | 4.0 | PPLLaVA-7B | 2024-11-04 |
Elysium: Exploring Object-level Perception in Videos via MLLM | ✓ Link | 75.8 | 3.7 | Elysium | 2024-03-25 |
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | ✓ Link | 75.2 | 2.9 | MovieChat | 2023-07-31 |
ST-LLM: Large Language Models Are Effective Temporal Learners | ✓ Link | 74.6 | 3.9 | ST-LLM | 2024-03-30 |
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens | ✓ Link | 73.92 | | MiniGPT4-video-7B | 2024-04-04 |
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization | ✓ Link | 73.2 | 3.9 | Video-LaVIT | 2024-02-05 |
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding | ✓ Link | 72.4 | 3.6 | VideoGPT+ | 2024-06-13 |
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token | ✓ Link | 70.9 | 4.0 | LLaVA-Mini | 2025-01-07 |
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | ✓ Link | 70.7 | 3.9 | Video-LLaVA-7B | 2023-11-16 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ✓ Link | 70.0 | 3.9 | VideoChat2 | 2023-11-28 |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | ✓ Link | 70.0 | 3.7 | LLaMA-VID-13B (2 Token) | 2023-11-28 |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | ✓ Link | 69.7 | 3.7 | LLaMA-VID-7B (2 Token) | 2023-11-28 |
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | ✓ Link | 69.3 | 3.7 | Chat-UniVi-7B | 2023-11-14 |
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | ✓ Link | 67.0 | 3.6 | BT-Adapter (zero-shot) | 2023-09-27 |
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | ✓ Link | 67.0 | 3.6 | BT-Adapter (zero-shot) | 2023-09-27 |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | ✓ Link | 64.9 | 3.3 | Video-ChatGPT-7B | 2023-06-08 |
VideoChat: Chat-Centric Video Understanding | ✓ Link | 56.3 | 2.8 | Video Chat-7B | 2023-05-10 |
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | ✓ Link | 54.9 | 3.1 | LLaMA Adapter-7B | 2023-04-28 |
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | ✓ Link | 51.6 | 2.5 | Video LLaMA-7B | 2023-06-05 |
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models | ✓ Link | 33.8 | | FrozenBiLM | 2022-06-16 |