OpenCodePapers

video-question-answering-on-msrvtt-qa

Video Question Answering

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy	ModelName	ReleaseDate
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities		50.42	Mirasol3B	2023-11-09
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	✓ Link	50.1	VAST	2023-05-29
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	✓ Link	49.2	VALOR	2023-04-17
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	✓ Link	49.2	COSA	2023-06-15
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	✓ Link	48.5	MA-LMM	2024-04-08
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	✓ Link	48.0	mPLUG-2	2023-02-01
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models	✓ Link	47.0	FrozenBiLM	2022-06-16
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning	✓ Link	46.2	HBI	2023-03-25
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations	✓ Link	45.8	EMCL-Net	2022-11-21
VindLU: A Recipe for Effective Video-and-Language Pretraining	✓ Link	44.6	VindLU	2022-12-09
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling	✓ Link	44.5	VIOLETv2	2022-09-04
Revealing Single Frame Bias for Video-and-Language Learning	✓ Link	43.9	Singularity-temporal	2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning	✓ Link	43.5	Singularity	2022-06-07
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models	✓ Link	16.7	FrozenBiLM (0-shot)	2022-06-16