Tarsier: Recipes for Training and Evaluating Large Video Description Models | ✓ Link | 61.6 | 3.7 | Tarsier (34B) | 2024-06-30 |
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning | ✓ Link | 60.9 | 3.7 | PLLaVA (34B) | 2024-04-25 |
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance | ✓ Link | 60.7 | 3.6 | PPLLaVA-7B | 2024-11-04 |
LinVT: Empower Your Image-level Large Language Model to Understand Videos | ✓ Link | 60.1 | 3.6 | LinVT-Qwen2-VL(7B) | 2024-12-06 |
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models | ✓ Link | 59.2 | 3.5 | SlowFast-LLaVA-34B | 2024-07-22 |
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models | ✓ Link | 58.9 | 3.5 | TS-LLaVA-34B | 2024-11-17 |
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM | ✓ Link | 58.4 | 3.5 | IG-VLM | 2024-03-27 |
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token | ✓ Link | 53.5 | 3.5 | LLaVA-Mini | 2025-01-07 |
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams | ✓ Link | 51.9 | 3.4 | Flash-VStream | 2024-06-12 |
ST-LLM: Large Language Models Are Effective Temporal Learners | ✓ Link | 50.9 | 3.3 | ST-LLM | 2024-03-30 |
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding | ✓ Link | 50.6 | 3.6 | VideoGPT+ | 2024-06-13 |
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios | ✓ Link | 50.2 | 3.5 | CAT-7B | 2024-03-07 |
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization | ✓ Link | 50.1 | 3.3 | Video-LaVIT | 2024-02-05 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ✓ Link | 49.1 | 3.3 | VideoChat2 | 2023-11-28 |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | ✓ Link | 47.5 | 3.3 | LLaMA-VID-13B (2 Token) | 2023-11-28 |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | ✓ Link | 47.4 | 3.3 | LLaMA-VID-7B (2 Token) | 2023-11-28 |
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | ✓ Link | 46.4 | 3.6 | Chat-UniVi-13B | 2023-11-14 |
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens | ✓ Link | 46.3 | | MiniGPT4-video-7B | 2024-04-04 |
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | ✓ Link | 46.1 | 3.3 | Chat-UniVi | 2023-11-14 |
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | ✓ Link | 46.1 | 3.2 | BT-Adapter (zero-shot) | 2023-09-27 |
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | ✓ Link | 45.7 | 3.1 | MovieChat | 2023-07-31 |
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | ✓ Link | 45.3 | 3.3 | Video-LLaVA | 2023-11-16 |
Elysium: Exploring Object-level Perception in Videos via MLLM | ✓ Link | 43.4 | 2.9 | Elysium | 2024-03-25 |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | ✓ Link | 35.2 | 2.7 | Video-ChatGPT | 2023-06-08 |
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | ✓ Link | 34.2 | 2.7 | LLaMA Adapter | 2023-04-28 |
VideoChat: Chat-Centric Video Understanding | ✓ Link | 26.5 | 2.2 | Video Chat | 2023-05-10 |
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models | ✓ Link | 24.7 | - | FrozenBiLM | 2022-06-16 |
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | ✓ Link | 12.4 | 1.1 | Video LLaMA | 2023-06-05 |