OpenCodePapers

video-based-generative-performance

Video-based Generative Performance Benchmarking
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodemeanCorrectness of InformationDetail OrientationContextual UnderstandingTemporal UnderstandingConsistencyModelNameReleaseDate
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance✓ Link3.733.853.564.213.213.81PPLLaVA-7B-dpo2024-11-04
Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback✓ Link3.493.633.2543.233.32VLM-RLAIF2024-02-06
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models✓ Link3.38TS-LLaVA-34B2024-11-17
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning✓ Link3.323.603.203.902.673.25PLLaVA-34B2024-04-25
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance✓ Link3.323.323.203.883.03.20PPLLaVA-7B2024-11-04
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models✓ Link3.32SlowFast-LLaVA-34B2024-07-22
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding✓ Link3.283.273.183.742.833.39VideoGPT+2024-06-13
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM✓ Link3.173.402.803.612.893.13IG-VLM-GPT4v2024-03-27
ST-LLM: Large Language Models Are Effective Temporal Learners✓ Link3.153.233.053.742.932.81ST-LLM-7B2024-03-30
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark✓ Link3.103.402.913.722.652.84VideoChat2_HD_mistral2023-11-28
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios✓ Link3.073.082.953.492.812.89CAT-7B2024-03-07
LITA: Language Instructed Temporal-Localization Assistant✓ Link3.042.942.983.432.683.19LITA-13B2024-03-27
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models✓ Link2.993.073.053.602.582.63LLaMA-VID-13B (2 Token)2023-11-28
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding✓ Link2.992.892.913.462.392.81Chat-UniVi2023-11-14
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark✓ Link2.983.022.883.512.662.81VideoChat22023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models✓ Link2.892.963.003.532.462.51LLaMA-VID-7B (2 Token)2023-11-28
VTimeLLM: Empower LLM to Grasp Video Moments✓ Link2.852.783.103.402.492.47VTimeLLM2023-11-30
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning✓ Link2.692.682.693.272.342.46BT-Adapter2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning✓ Link2.462.162.462.892.132.2BT-Adapter (zero-shot)2023-09-27
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models✓ Link2.382.42.522.621.982.37Video-ChatGPT2023-06-08
VideoChat: Chat-Centric Video Understanding✓ Link2.292.232.502.531.942.24Video Chat2023-05-10
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model✓ Link2.162.032.322.301.982.15LLaMA Adapter2023-04-28
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding✓ Link1.981.962.182.161.821.79Video LLaMA2023-06-05