Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams | ✓ Link | 72.4 | 3.4 | Flash-VStream | 2024-06-12 |
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning | ✓ Link | 68.7 | 3.6 | PLLaVA (34B) | 2024-04-25 |
Elysium: Exploring Object-level Perception in Videos via MLLM | ✓ Link | 67.5 | 3.2 | Elysium | 2024-03-25 |
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models | ✓ Link | 67.4 | 3.7 | SlowFast-LLaVA-34B | 2024-07-22 |
Tarsier: Recipes for Training and Evaluating Large Video Description Models | ✓ Link | 66.4 | 3.7 | Tarsier (34B) | 2024-06-30 |
LinVT: Empower Your Image-level Large Language Model to Understand Videos | ✓ Link | 66.2 | 4.0 | LinVT-Qwen2-VL
(7B) | 2024-12-06 |
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models | ✓ Link | 66.2 | 3.6 | TS-LLaVA-34B | 2024-11-17 |
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance | ✓ Link | 64.3 | 3.5 | PPLLaVA-7B | 2024-11-04 |
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM | ✓ Link | 63.8 | 3.5 | IG-VLM | 2024-03-27 |
ST-LLM: Large Language Models Are Effective Temporal Learners | ✓ Link | 63.2 | 3.4 | ST-LLM | 2024-03-30 |
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios | ✓ Link | 62.1 | 3.5 | CAT-7B | 2024-03-07 |
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding | ✓ Link | 60.6 | 3.6 | VideoGPT+ | 2024-06-13 |
Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens | | 60.5 | 3.3 | Vista-LLaMA-7B | 2023-12-12 |
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens | ✓ Link | 59.73 | | MiniGPT4-video-7B | 2024-04-04 |
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token | ✓ Link | 59.5 | 3.6 | LLaVA-Mini | 2025-01-07 |
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization | ✓ Link | 59.3 | 3.3 | Video-LaVIT | 2024-02-05 |
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | ✓ Link | 59.2 | 3.5 | Video-LLaVA-7B | 2023-11-16 |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | ✓ Link | 58.9 | 3.3 | LLaMA-VID-13B (2 Token) | 2023-11-28 |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | ✓ Link | 57.7 | 3.2 | LLaMA-VID-7B (2 Token) | 2023-11-28 |
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos | ✓ Link | 56.8 | | SUM-shot+Vicuna | 2023-12-16 |
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation | ✓ Link | 55.3 | 3.3 | Omni-VideoAssistant | 2023-08-08 |
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | ✓ Link | 55.0 | 3.1 | Chat-UniVi-7B | 2023-11-14 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ✓ Link | 54.1 | 3.3 | VideoChat2 | 2023-11-28 |
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | ✓ Link | 52.7 | 2.6 | MovieChat | 2023-07-31 |
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | ✓ Link | 51.2 | 2.9 | BT-Adapter (zero-shot) | 2023-09-27 |
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | ✓ Link | 51.2 | 2.9 | BT-Adapter (zero-shot) | 2023-09-27 |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | ✓ Link | 49.3 | 2.8 | Video-ChatGPT-7B | 2023-06-08 |
VideoChat: Chat-Centric Video Understanding | ✓ Link | 45.0 | 2.5 | Video Chat-7B | 2023-05-10 |
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | ✓ Link | 43.8 | 2.7 | LLaMA Adapter-7B | 2023-04-28 |
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | ✓ Link | 29.6 | 1.8 | Video LLaMA-7B | 2023-06-05 |