Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models | | 66.8 | | | PaLI-X-VPD | 2023-12-05 |
PaLM-E: An Embodied Multimodal Language Model | ✓ Link | 66.1 | | | PaLM-E-562B | 2023-03-06 |
PaLI-X: On Scaling up a Multilingual Vision and Language Model | ✓ Link | 66.1 | | | PaLI-X (Single-task FT) | 2023-05-29 |
PaLI: A Jointly-Scaled Multilingual Language-Image Model | ✓ Link | 64.5 | | | PaLI 17B | 2022-09-14 |
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering | ✓ Link | 62.5 | | | Prophet | 2023-03-03 |
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering | ✓ Link | 62.08 | 62.01 | 89.32 | RA-VQA-v2 (BLIP 2) | 2023-09-29 |
A Simple Baseline for Knowledge-Based Visual Question Answering | | 61.2 | | | A Simple Baseline for KB-VQA | 2023-10-20 |
PromptCap: Prompt-Guided Task-Aware Image Captioning | ✓ Link | 60.4 | | | PromptCap | 2022-11-15 |
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory | ✓ Link | 59.1 | | | ReVeaL WIT + CC12M + Wikidata + VQA-2 | 2022-12-10 |
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects | | 58.2 | | | Lyrics | 2023-12-08 |
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering | ✓ Link | 58.0 | | | REVIVE (Ensemble) | 2022-06-02 |
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering | ✓ Link | 56.6 | | | REVIVE (Single) | 2022-06-02 |
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering | ✓ Link | 54.85 | | | RA-VQA-v2 (T5-large) | 2023-09-29 |
Retrieval Augmented Visual Question Answering with Outside Knowledge | ✓ Link | 54.48 | 59.41 | 82.84 | RA-VQA (T5-large) | 2022-10-07 |
Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis | ✓ Link | 52.4 | | | VK-OOD | 2023-02-11 |
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis | ✓ Link | 52.4 | | | VK-OOD | 2023-09-21 |
Retrieval Augmented Visual Question Answering with Outside Knowledge | ✓ Link | 51.22 | 55.77 | 81.25 | RA-VQA-FrDPR (T5-large) | 2022-10-07 |
Flamingo: a Visual Language Model for Few-Shot Learning | ✓ Link | 50.6 | | | Flamingo80B | 2022-04-29 |
Transform-Retrieve-Generate: Natural Language-Centric Outside-Knowledge Visual Question Answering | | 50.50 | | | TRiG (T5-Large) | 2022-01-01 |
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning | ✓ Link | 48.6 | | | HYDRA | 2024-03-19 |
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA | ✓ Link | 48.0 | | | PICa | 2021-09-10 |
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection | ✓ Link | 47.01 | | | LaKo | 2022-07-26 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 45.9 | | | BLIP-2 ViT-G FlanT5 XXL (zero-shot) | 2023-01-30 |
Flamingo: a Visual Language Model for Few-Shot Learning | ✓ Link | 44.7 | | | Flamingo9B | 2022-04-29 |
VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge | ✓ Link | 43.1 | | | VLC-BERT | 2022-10-24 |
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection | ✓ Link | 42.03 | | | T5(Tan and Bansal, 2019) + Prefixes | 2022-07-26 |
Flamingo: a Visual Language Model for Few-Shot Learning | ✓ Link | 41.2 | | | Flamingo3B | 2022-04-29 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 40.7 | | | BLIP-2 ViT-G FlanT5 XL (zero-shot) | 2023-01-30 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 39.4 | | | BLIP-2 ViT-L FlanT5 XL (zero-shot) | 2023-01-30 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 36.4 | | | BLIP-2 ViT-G OPT 6.7B (zero-shot) | 2023-01-30 |
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training | ✓ Link | 35.9 | | | PNP-VQA | 2022-10-17 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 31.7 | | | BLIP-2 ViT-G OPT 2.7B (zero-shot) | 2023-01-30 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 30.2 | | | BLIP-2 ViT-L OPT 2.7B (zero-shot) | 2023-01-30 |
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models | ✓ Link | 16.5 | | | FewVLM | 2021-10-16 |
Language Models are General-Purpose Interfaces | ✓ Link | 11.4 | | | MetaLM | 2022-06-13 |
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation | | 10.5 | | | VLKD(ViT-B/16) | 2021-11-16 |
Multimodal Few-Shot Learning with Frozen Language Models | | 5.9 | | | Frozen | 2021-06-25 |