OpenCodePapers

video-question-answering-on-msrvtt-qa

Video Question Answering
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracyModelNameReleaseDate
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities50.42Mirasol3B2023-11-09
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset✓ Link50.1VAST2023-05-29
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset✓ Link49.2VALOR2023-04-17
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model✓ Link49.2COSA2023-06-15
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding✓ Link48.5MA-LMM2024-04-08
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video✓ Link48.0mPLUG-22023-02-01
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models✓ Link47.0FrozenBiLM2022-06-16
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning✓ Link46.2HBI2023-03-25
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations✓ Link45.8EMCL-Net2022-11-21
VindLU: A Recipe for Effective Video-and-Language Pretraining✓ Link44.6VindLU2022-12-09
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling✓ Link44.5VIOLETv22022-09-04
Revealing Single Frame Bias for Video-and-Language Learning✓ Link43.9Singularity-temporal2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning✓ Link43.5Singularity2022-06-07
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models✓ Link16.7FrozenBiLM (0-shot)2022-06-16