OpenCodePapers

visual-question-answering-on-msvd-qa-1

Visual Question Answering (VQA)
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracyModelNameReleaseDate
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending0.61VLAB2023-05-22
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding✓ Link0.606MA-LMM2024-04-08
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks✓ Link.602MaMMUT (ours)2023-03-29
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset✓ Link0.60VALOR2023-04-17
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset✓ Link0.60VAST2023-05-29
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model✓ Link0.60COSA2023-06-15
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video✓ Link0.581mPLUG-22023-02-01
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners0.569VideoCoCa2022-12-09
GIT: A Generative Image-to-text Transformer for Vision and Language✓ Link0.568GIT2022-05-27
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models✓ Link0.558FrozenBiLM+2023-08-18
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training0.556HiTeA2022-12-30
InternVideo: General Video Foundation Models via Generative and Discriminative Learning✓ Link0.555InternVideo2022-12-06
Unmasked Teacher: Towards Training-Efficient Video Foundation Models✓ Link0.552UMT-L (ViT-L/16)2023-03-28
vid-TLDR: Training Free Token merging for Light-weight Video Transformer✓ Link0.549vid-TLDR (UMT-L)2024-03-20
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling✓ Link0.547VIOLETv22022-09-04
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling0.547MuLTI2023-03-10
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link0.546X2-VLM (large)2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link0.528X2-VLM (base)2022-11-22
Clover: Towards A Unified Video-Language Alignment and Fusion Model✓ Link0.524Clover2022-07-16
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models✓ Link0.517VIOLET + MELTR2023-03-23
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks0.510OmniVL2022-09-15
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models✓ Link0.495VIOLET+2023-08-18
Video Question Answering with Iterative Video-Text Co-Tokenization.486Co-Tokenization2022-08-01
All in One: Exploring Unified Video-Language Pre-training✓ Link0.483All-in-one-B2022-03-14
Lightweight Recurrent Cross-modal Encoder for Video Question Answering✓ Link0.478LRCE2023-06-30
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models✓ Link0.477JustAsk+2023-08-18
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models✓ Link0.469GIT+MDF2023-07-09
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models✓ Link0.467AIO+MIF2023-07-09
Align and Prompt: Video-and-Language Pre-training with Entity Prompts✓ Link0.459ALPRO2021-12-17
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models✓ Link0.438All-in-one+2023-08-18
DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering✓ Link0.390DualVGR2021-07-10
Hierarchical Conditional Relation Networks for Video Question Answering✓ Link0.361HCRN2020-02-25
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning✓ Link0.351SSML2020-03-06
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering✓ Link0.337HMEMA2019-04-08
Motion-Appearance Co-Memory Networks for Video Question Answering0.317Co-Mem2018-03-29
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering✓ Link0.313ST-VQA2017-04-14