OpenCodePapers

visual-question-answering-on-msrvtt-qa-1

Visual Question Answering (VQA)
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracyModelNameReleaseDate
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending0.496VLAB2023-05-22
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks✓ Link0.495MaMMUT2023-03-29
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video✓ Link0.480mPLUG-22023-02-01
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling0.478MuLTI2023-03-10
Flamingo: a Visual Language Model for Few-Shot Learning✓ Link0.474Flamingo2022-04-29
InternVideo: General Video Foundation Models via Generative and Discriminative Learning✓ Link0.471InternVideo2022-12-06
Unmasked Teacher: Towards Training-Efficient Video Foundation Models✓ Link0.471UMT-L (ViT-L/16)2023-03-28
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models✓ Link0.470FrozenBiLM+2023-08-18
vid-TLDR: Training Free Token merging for Light-weight Video Transformer✓ Link0.470vid-TLDR (UMT-L)2024-03-20
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners0.463VideoCoCa2022-12-09
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning✓ Link0.462HBI2023-03-25
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training0.459HiTeA2022-12-30
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations✓ Link0.458EMCL-Net2022-11-21
Video Question Answering with Iterative Video-Text Co-Tokenization.457Co-Tokenization2022-08-01
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link0.455X2-VLM (large)2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks✓ Link0.45X2-VLM (base)2022-11-22
All in One: Exploring Unified Video-Language Pre-training✓ Link0.443All-in-one-B2022-03-14
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks0.441OmniVL2022-09-15
Clover: Towards A Unified Video-Language Alignment and Fusion Model✓ Link0.441Clover2022-07-16
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models✓ Link0.440AIO+MIF2023-07-09
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models✓ Link0.438AIO+MDF2023-07-09
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models✓ Link0.423GIT+MDF2023-07-09
Align and Prompt: Video-and-Language Pre-training with Entity Prompts✓ Link0.421ALPRO2021-12-17
Lightweight Recurrent Cross-modal Encoder for Video Question Answering✓ Link0.42LRCE2023-06-30
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models✓ Link0.418JustAsk+2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models✓ Link0.395All-in-one+2023-08-18
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling✓ Link0.374CLIPBERT2021-02-11
Hierarchical Conditional Relation Networks for Video Question Answering✓ Link0.356HCRN2020-02-25
DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering✓ Link0.355DualVGR2021-07-10
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering✓ Link0.33HMEMA2019-04-08
Motion-Appearance Co-Memory Networks for Video Question Answering0.32Co-Mem2018-03-29
Flamingo: a Visual Language Model for Few-Shot Learning✓ Link0.310Flamingo (32-shot)2022-04-29
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering✓ Link0.309ST-VQA2017-04-14
Flamingo: a Visual Language Model for Few-Shot Learning✓ Link0.174Flamingo (0-shot)2022-04-29