OpenCodePapers

visual-question-answering-on-ok-vqa

Visual Question Answering (VQA)
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeAccuracyExact Match (EM)Recall@5ModelNameReleaseDate
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models66.8PaLI-X-VPD2023-12-05
PaLM-E: An Embodied Multimodal Language Model✓ Link66.1PaLM-E-562B2023-03-06
PaLI-X: On Scaling up a Multilingual Vision and Language Model✓ Link66.1PaLI-X (Single-task FT)2023-05-29
PaLI: A Jointly-Scaled Multilingual Language-Image Model✓ Link64.5PaLI 17B2022-09-14
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering✓ Link62.5Prophet2023-03-03
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering✓ Link62.0862.0189.32RA-VQA-v2 (BLIP 2)2023-09-29
A Simple Baseline for Knowledge-Based Visual Question Answering61.2A Simple Baseline for KB-VQA2023-10-20
PromptCap: Prompt-Guided Task-Aware Image Captioning✓ Link60.4PromptCap2022-11-15
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory✓ Link59.1ReVeaL WIT + CC12M + Wikidata + VQA-22022-12-10
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects58.2Lyrics2023-12-08
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering✓ Link58.0REVIVE (Ensemble)2022-06-02
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering✓ Link56.6REVIVE (Single)2022-06-02
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering✓ Link54.85RA-VQA-v2 (T5-large)2023-09-29
Retrieval Augmented Visual Question Answering with Outside Knowledge✓ Link54.4859.4182.84RA-VQA (T5-large)2022-10-07
Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis✓ Link52.4VK-OOD2023-02-11
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis✓ Link52.4VK-OOD2023-09-21
Retrieval Augmented Visual Question Answering with Outside Knowledge✓ Link51.2255.7781.25RA-VQA-FrDPR (T5-large)2022-10-07
Flamingo: a Visual Language Model for Few-Shot Learning✓ Link50.6Flamingo80B2022-04-29
Transform-Retrieve-Generate: Natural Language-Centric Outside-Knowledge Visual Question Answering50.50TRiG (T5-Large)2022-01-01
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning✓ Link48.6HYDRA2024-03-19
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA✓ Link48.0PICa2021-09-10
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection✓ Link47.01LaKo2022-07-26
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link45.9BLIP-2 ViT-G FlanT5 XXL (zero-shot)2023-01-30
Flamingo: a Visual Language Model for Few-Shot Learning✓ Link44.7Flamingo9B2022-04-29
VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge✓ Link43.1VLC-BERT2022-10-24
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection✓ Link42.03T5(Tan and Bansal, 2019) + Prefixes2022-07-26
Flamingo: a Visual Language Model for Few-Shot Learning✓ Link41.2Flamingo3B2022-04-29
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link40.7BLIP-2 ViT-G FlanT5 XL (zero-shot)2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link39.4BLIP-2 ViT-L FlanT5 XL (zero-shot)2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link36.4BLIP-2 ViT-G OPT 6.7B (zero-shot)2023-01-30
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training✓ Link35.9PNP-VQA2022-10-17
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link31.7BLIP-2 ViT-G OPT 2.7B (zero-shot)2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link30.2BLIP-2 ViT-L OPT 2.7B (zero-shot)2023-01-30
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models✓ Link16.5FewVLM2021-10-16
Language Models are General-Purpose Interfaces✓ Link11.4MetaLM2022-06-13
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation10.5VLKD(ViT-B/16)2021-11-16
Multimodal Few-Shot Learning with Frozen Language Models 5.9Frozen2021-06-25