mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections | ✓ Link | 46.5 | 155.1 | 32.0 | 26.0 | | | | | | mPLUG | 2022-05-24 |
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework | ✓ Link | 44.9 | 154.9 | 32.5 | 26.6 | | | | | | OFA | 2022-02-07 |
GIT: A Generative Image-to-text Transformer for Vision and Language | ✓ Link | 44.1 | 151.1 | 32.2 | 26.3 | | | | | | GIT | 2022-05-27 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 43.7 | 145.8 | | | | | | | | BLIP-2 ViT-G OPT 2.7B (zero-shot) | 2023-01-30 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 43.5 | 145.2 | | | | | | | | BLIP-2 ViT-G OPT 6.7B (zero-shot) | 2023-01-30 |
Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning | ✓ Link | 42.7 | 143.7 | 30.6 | 24.7 | 61.1 | 83.5 | | | | ExpansionNet v2 (No VL pretraining) | 2022-08-13 |
Scaling Up Vision-Language Pre-training for Image Captioning | | 42.6 | 145.5 | 31.4 | 25.5 | | | | | | LEMON | 2021-11-24 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | ✓ Link | 42.4 | 144.5 | | | | | | | | BLIP-2 ViT-G FlanT5 XL (zero-shot) | 2023-01-30 |
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features | ✓ Link | 42.4 | 144.2 | 30.6 | 24.3 | 60.7 | 84.2 | | | | GRIT (No VL pretraining - base) | 2022-07-20 |
Prompt Tuning for Generative Multimodal Pretrained Models | ✓ Link | 41.81 | 141.4 | 31.51 | 24.42 | | | | | | Prompt Tuning | 2022-08-04 |
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks | ✓ Link | 41.7 | 140 | 30.6 | 24.5 | | | | | | Oscar | 2020-04-13 |
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning | ✓ Link | 41.4 | 139.9 | 30.4 | 24.0 | 60.4 | 83.4 | | | | Xmodal-Ctx | 2022-05-09 |
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning | ✓ Link | 41.3 | 142.2 | | 24.9 | | | | | | Xmodal-Ctx + OSCAR | 2022-05-09 |
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts | ✓ Link | 41.3 | 140.8 | | | | | | | | X-VLM (base) | 2021-11-16 |
VinVL: Revisiting Visual Representations in Vision-Language Models | ✓ Link | 41.0 | 140.9 | 31.1 | 25.2 | | | | | | VinVL | 2021-01-02 |
CoCa: Contrastive Captioners are Image-Text Foundation Models | ✓ Link | 40.9 | 143.6 | 33.9 | 24.7 | | | | | | CoCa | 2022-05-04 |
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | ✓ Link | 40.6 | 143.3 | 33.4 | 25.4 | | | | | | SimVLM | 2021-08-24 |
Prismer: A Vision-Language Model with Multi-Task Experts | ✓ Link | 40.4 | 136.5 | 31.4 | 24.4 | | | | | | Prismer | 2023-03-04 |
Position-guided Text Prompt for Vision-Language Pre-training | ✓ Link | 40.1 | 135.0 | 30.4 | 23.7 | | | | | | PTP-BLIP (14M) | 2022-12-19 |
L-Verse: Bidirectional Generation Between Image and Text | ✓ Link | 39.9 | | 31.4 | 23.3 | 60.4 | | | | | L-Verse | 2021-11-22 |
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning | ✓ Link | 39.7 | 135.9 | 30.0 | 23.7 | 59.5 | 81.5 | | | | Xmodal-Ctx | 2022-05-09 |
X-Linear Attention Networks for Image Captioning | ✓ Link | 39.7 | 132.8 | 29.5 | 23.4 | 59.1 | 80.9 | 65.8 | 51.5 | | X-Transformer | 2020-03-31 |
Visual Commonsense R-CNN | ✓ Link | 39.5 | | 29.3 | | 59.3 | | | | | AoANet + VC | 2020-02-27 |
A Better Variant of Self-Critical Sequence Training | ✓ Link | 39.4 | 129.6 | 28.9 | 22.8 | 58.7 | 80.7 | 65.6 | 51.3 | | Transformer_NSC | 2020-03-22 |
Meshed-Memory Transformer for Image Captioning | ✓ Link | 39.1 | 131.2 | 29.2 | 22.6 | 58.6 | 80.8 | | | | Meshed-Memory Transformer | 2019-12-17 |
Fine-grained Image Captioning with CLIP Reward | ✓ Link | 38.2 | 124.9 | 28.7 | | 58.5 | | | | | CLIP Text Encoder (RL w/ CIDEr-reward) | 2022-05-26 |
RefineCap: Concept-Aware Refinement for Image Captioning | | 37.8 | 127.2 | 28.3 | 22.5 | 58.0 | 80.2 | 64.5 | 49.9 | | RefineCap (w/ REINFORCE) | 2021-09-08 |
Reflective Decoding Network for Image Captioning | | 37.3 | 125.2 | 28.1 | | 57.4 | 80.2 | | | | RDN | 2019-08-30 |
SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation | ✓ Link | 37.2 | 121.8 | 28.3 | 21.5 | | | | | | SmallCapd=16, Large | 2022-09-30 |
ClipCap: CLIP Prefix for Image Captioning | ✓ Link | 33.53 | 113.08 | 27.45 | 21.05 | | | | | | ClipCap (Transformer) | 2021-11-18 |
ClipCap: CLIP Prefix for Image Captioning | ✓ Link | 32.15 | 108.35 | 27.1 | 20.12 | | | | | | ClipCap (MLP + GPT2 tuning) | 2021-11-18 |
Text-Only Training for Image Captioning using Noise-Injected CLIP | ✓ Link | 26.4 | 91.8 | 25.1 | | | | | | | CapDec | 2022-11-01 |
From Captions to Visual Concepts and Back | ✓ Link | 25.7 | | 23.6 | | | | | | | From Captions to Visual Concepts and Back | 2014-11-18 |
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation | | 16.7 | 58.3 | 19.7 | 13.4 | | | | | | VLKD (ViT-B/16) | 2021-11-16 |
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? | ✓ Link | 0.382 | 126.2 | 29.5 | | | | | | | LaDiC (ours, 30 steps) | 2024-04-16 |
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | ✓ Link | | 152.5 | | 25.7 | | | | | | VALOR | 2023-04-17 |
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | ✓ Link | | 149.0 | | 27.0 | | | | | | VAST | 2023-05-29 |
VirTex: Learning Visual Representations from Textual Annotations | ✓ Link | | 94 | | 18.5 | | | | | | Virtex (ResNet-101) | 2020-06-11 |
[]() | | | 84.7 | | 16.8 | | | | | | KOSMOS-1 (1.6B) (zero-shot) | |
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? | ✓ Link | | | | 22.4 | 58.7 | | | | | LaDiC | 2024-04-16 |
FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions | ✓ Link | | | | | | | | | 78.5 | BLIP-FuseCap | 2023-05-28 |