OpenCodePapers
video-question-answering-on-msrvtt-qa
Video Question Answering
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
Show papers without code
Paper
Code
Accuracy
↕
ModelName
ReleaseDate
↕
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
50.42
Mirasol3B
2023-11-09
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
✓ Link
50.1
VAST
2023-05-29
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
✓ Link
49.2
VALOR
2023-04-17
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
✓ Link
49.2
COSA
2023-06-15
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
✓ Link
48.5
MA-LMM
2024-04-08
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
✓ Link
48.0
mPLUG-2
2023-02-01
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
✓ Link
47.0
FrozenBiLM
2022-06-16
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
✓ Link
46.2
HBI
2023-03-25
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
✓ Link
45.8
EMCL-Net
2022-11-21
VindLU: A Recipe for Effective Video-and-Language Pretraining
✓ Link
44.6
VindLU
2022-12-09
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
✓ Link
44.5
VIOLETv2
2022-09-04
Revealing Single Frame Bias for Video-and-Language Learning
✓ Link
43.9
Singularity-temporal
2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning
✓ Link
43.5
Singularity
2022-06-07
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
✓ Link
16.7
FrozenBiLM (0-shot)
2022-06-16