PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance | ✓ Link | 3.73 | 3.85 | 3.56 | 4.21 | 3.21 | 3.81 | PPLLaVA-7B-dpo | 2024-11-04 |
Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback | ✓ Link | 3.49 | 3.63 | 3.25 | 4 | 3.23 | 3.32 | VLM-RLAIF | 2024-02-06 |
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models | ✓ Link | 3.38 | | | | | | TS-LLaVA-34B | 2024-11-17 |
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning | ✓ Link | 3.32 | 3.60 | 3.20 | 3.90 | 2.67 | 3.25 | PLLaVA-34B | 2024-04-25 |
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance | ✓ Link | 3.32 | 3.32 | 3.20 | 3.88 | 3.0 | 3.20 | PPLLaVA-7B | 2024-11-04 |
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models | ✓ Link | 3.32 | | | | | | SlowFast-LLaVA-34B | 2024-07-22 |
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding | ✓ Link | 3.28 | 3.27 | 3.18 | 3.74 | 2.83 | 3.39 | VideoGPT+ | 2024-06-13 |
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM | ✓ Link | 3.17 | 3.40 | 2.80 | 3.61 | 2.89 | 3.13 | IG-VLM-GPT4v | 2024-03-27 |
ST-LLM: Large Language Models Are Effective Temporal Learners | ✓ Link | 3.15 | 3.23 | 3.05 | 3.74 | 2.93 | 2.81 | ST-LLM-7B | 2024-03-30 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ✓ Link | 3.10 | 3.40 | 2.91 | 3.72 | 2.65 | 2.84 | VideoChat2_HD_mistral | 2023-11-28 |
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios | ✓ Link | 3.07 | 3.08 | 2.95 | 3.49 | 2.81 | 2.89 | CAT-7B | 2024-03-07 |
LITA: Language Instructed Temporal-Localization Assistant | ✓ Link | 3.04 | 2.94 | 2.98 | 3.43 | 2.68 | 3.19 | LITA-13B | 2024-03-27 |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | ✓ Link | 2.99 | 3.07 | 3.05 | 3.60 | 2.58 | 2.63 | LLaMA-VID-13B (2 Token) | 2023-11-28 |
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | ✓ Link | 2.99 | 2.89 | 2.91 | 3.46 | 2.39 | 2.81 | Chat-UniVi | 2023-11-14 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ✓ Link | 2.98 | 3.02 | 2.88 | 3.51 | 2.66 | 2.81 | VideoChat2 | 2023-11-28 |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | ✓ Link | 2.89 | 2.96 | 3.00 | 3.53 | 2.46 | 2.51 | LLaMA-VID-7B (2 Token) | 2023-11-28 |
VTimeLLM: Empower LLM to Grasp Video Moments | ✓ Link | 2.85 | 2.78 | 3.10 | 3.40 | 2.49 | 2.47 | VTimeLLM | 2023-11-30 |
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | ✓ Link | 2.69 | 2.68 | 2.69 | 3.27 | 2.34 | 2.46 | BT-Adapter | 2023-09-27 |
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | ✓ Link | 2.46 | 2.16 | 2.46 | 2.89 | 2.13 | 2.2 | BT-Adapter (zero-shot) | 2023-09-27 |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | ✓ Link | 2.38 | 2.4 | 2.52 | 2.62 | 1.98 | 2.37 | Video-ChatGPT | 2023-06-08 |
VideoChat: Chat-Centric Video Understanding | ✓ Link | 2.29 | 2.23 | 2.50 | 2.53 | 1.94 | 2.24 | Video Chat | 2023-05-10 |
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | ✓ Link | 2.16 | 2.03 | 2.32 | 2.30 | 1.98 | 2.15 | LLaMA Adapter | 2023-04-28 |
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | ✓ Link | 1.98 | 1.96 | 2.18 | 2.16 | 1.82 | 1.79 | Video LLaMA | 2023-06-05 |