OpenCodePapers

visual-question-answering-on-gqa-test-dev

Visual Question Answering (VQA)
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracyModelNameReleaseDate
Coarse-to-Fine Reasoning for Visual Question Answering✓ Link72.1CFR2021-10-06
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models67.3PaLI-X-VPD2023-12-05
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts✓ Link64.9CuMo-7B2024-05-09
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization✓ Link64.4Video-LaVIT2024-02-05
Learning by Abstraction: The Neural State Machine✓ Link62.95NSM2019-07-09
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects62.4Lyrics2023-12-08
LXMERT: Learning Cross-Modality Encoder Representations from Transformers✓ Link60.0LXMERT (Pre-train + scratch)2019-08-20
Language-Conditioned Graph Networks for Relational Reasoning✓ Link55.8single-hop + LCGN (ours)2019-05-10
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning✓ Link47.9HYDRA2024-03-19
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link44.7BLIP-2 ViT-G FlanT5 XXL (zero-shot)2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link44.4BLIP-2 ViT-L FlanT5 XL (zero-shot)2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link44.2BLIP-2 ViT-G FlanT5 XL (zero-shot)2023-01-30
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training✓ Link41.9PNP-VQA2022-10-17
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link36.4BLIP-2 ViT-G OPT 6.7B (zero-shot)2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link34.6BLIP-2 ViT-G OPT 2.7B (zero-shot)2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link33.9BLIP-2 ViT-L OPT 2.7B (zero-shot)2023-01-30
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models✓ Link29.3FewVLM (zero-shot)2021-10-16