OpenCodePapers

image-captioning-on-coco-captions

Image Captioning
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeBLEU-4CIDERMETEORSPICEROUGE-LBLEU-1BLEU-2BLEU-3CLIPScoreModelNameReleaseDate
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections✓ Link46.5155.132.026.0mPLUG2022-05-24
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework✓ Link44.9154.932.526.6OFA2022-02-07
GIT: A Generative Image-to-text Transformer for Vision and Language✓ Link44.1151.1 32.226.3GIT2022-05-27
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link43.7145.8BLIP-2 ViT-G OPT 2.7B (zero-shot)2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link43.5145.2BLIP-2 ViT-G OPT 6.7B (zero-shot)2023-01-30
Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning✓ Link42.7143.730.624.761.183.5ExpansionNet v2 (No VL pretraining)2022-08-13
Scaling Up Vision-Language Pre-training for Image Captioning42.6145.531.425.5LEMON2021-11-24
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link42.4144.5BLIP-2 ViT-G FlanT5 XL (zero-shot)2023-01-30
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features✓ Link42.4144.230.624.360.784.2GRIT (No VL pretraining - base)2022-07-20
Prompt Tuning for Generative Multimodal Pretrained Models✓ Link41.81141.431.5124.42Prompt Tuning2022-08-04
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks✓ Link41.714030.624.5Oscar2020-04-13
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning✓ Link41.4139.930.424.060.483.4Xmodal-Ctx2022-05-09
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning✓ Link41.3142.224.9Xmodal-Ctx + OSCAR2022-05-09
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts✓ Link41.3140.8X-VLM (base)2021-11-16
VinVL: Revisiting Visual Representations in Vision-Language Models✓ Link41.0140.931.125.2VinVL2021-01-02
CoCa: Contrastive Captioners are Image-Text Foundation Models✓ Link40.9143.633.924.7CoCa2022-05-04
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision✓ Link40.6143.333.425.4SimVLM2021-08-24
Prismer: A Vision-Language Model with Multi-Task Experts✓ Link40.4136.531.424.4Prismer2023-03-04
Position-guided Text Prompt for Vision-Language Pre-training✓ Link40.1135.030.423.7PTP-BLIP (14M)2022-12-19
L-Verse: Bidirectional Generation Between Image and Text✓ Link39.931.423.360.4L-Verse2021-11-22
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning✓ Link39.7135.930.023.759.581.5Xmodal-Ctx2022-05-09
X-Linear Attention Networks for Image Captioning✓ Link39.7132.829.523.459.180.965.851.5X-Transformer2020-03-31
Visual Commonsense R-CNN✓ Link39.529.359.3AoANet + VC2020-02-27
A Better Variant of Self-Critical Sequence Training✓ Link39.4129.628.922.858.780.765.651.3Transformer_NSC2020-03-22
Meshed-Memory Transformer for Image Captioning✓ Link39.1131.229.222.658.680.8Meshed-Memory Transformer2019-12-17
Fine-grained Image Captioning with CLIP Reward✓ Link38.2124.928.758.5CLIP Text Encoder (RL w/ CIDEr-reward)2022-05-26
RefineCap: Concept-Aware Refinement for Image Captioning37.8127.228.322.558.080.264.549.9RefineCap (w/ REINFORCE)2021-09-08
Reflective Decoding Network for Image Captioning37.3125.228.157.480.2RDN2019-08-30
SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation✓ Link37.2121.828.321.5SmallCapd=16, Large2022-09-30
ClipCap: CLIP Prefix for Image Captioning✓ Link33.53113.0827.4521.05ClipCap (Transformer)2021-11-18
ClipCap: CLIP Prefix for Image Captioning✓ Link32.15108.3527.120.12ClipCap (MLP + GPT2 tuning)2021-11-18
Text-Only Training for Image Captioning using Noise-Injected CLIP✓ Link26.491.825.1CapDec2022-11-01
From Captions to Visual Concepts and Back✓ Link25.723.6From Captions to Visual Concepts and Back2014-11-18
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation16.758.319.713.4VLKD (ViT-B/16)2021-11-16
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?✓ Link0.382126.229.5LaDiC (ours, 30 steps)2024-04-16
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset✓ Link152.525.7VALOR2023-04-17
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset✓ Link149.027.0VAST2023-05-29
VirTex: Learning Visual Representations from Textual Annotations✓ Link9418.5Virtex (ResNet-101)2020-06-11
[]()84.716.8KOSMOS-1 (1.6B) (zero-shot)
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?✓ Link22.458.7LaDiC2024-04-16
FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions✓ Link78.5BLIP-FuseCap2023-05-28