Coarse-to-Fine Reasoning for Visual Question Answering | ✓ Link | 72.1 | CFR | 2021-10-06 |
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models | | 67.3 | PaLI-X-VPD | 2023-12-05 |
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts | ✓ Link | 64.9 | CuMo-7B | 2024-05-09 |
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization | ✓ Link | 64.4 | Video-LaVIT | 2024-02-05 |
Learning by Abstraction: The Neural State Machine | ✓ Link | 62.95 | NSM | 2019-07-09 |
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects | | 62.4 | Lyrics | 2023-12-08 |
LXMERT: Learning Cross-Modality Encoder Representations from Transformers | ✓ Link | 60.0 | LXMERT (Pre-train + scratch) | 2019-08-20 |
Language-Conditioned Graph Networks for Relational Reasoning | ✓ Link | 55.8 | single-hop + LCGN (ours) | 2019-05-10 |
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning | ✓ Link | 47.9 | HYDRA | 2024-03-19 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 44.7 | BLIP-2 ViT-G FlanT5 XXL (zero-shot) | 2023-01-30 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 44.4 | BLIP-2 ViT-L FlanT5 XL (zero-shot) | 2023-01-30 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 44.2 | BLIP-2 ViT-G FlanT5 XL (zero-shot) | 2023-01-30 |
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training | ✓ Link | 41.9 | PNP-VQA | 2022-10-17 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 36.4 | BLIP-2 ViT-G OPT 6.7B (zero-shot) | 2023-01-30 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 34.6 | BLIP-2 ViT-G OPT 2.7B (zero-shot) | 2023-01-30 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 33.9 | BLIP-2 ViT-L OPT 2.7B (zero-shot) | 2023-01-30 |
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models | ✓ Link | 29.3 | FewVLM (zero-shot) | 2021-10-16 |