Paper | Code | BLEU-4 | CIDEr | METEOR | SPICE | ModelName | ReleaseDate |
---|---|---|---|---|---|---|---|
Unified Vision-Language Pre-Training for Image Captioning and VQA | ✓ Link | 30.1 | 67.4 | 23 | 17 | Unified VLP | 2019-09-24 |
Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention | 21.3 | 46.4 | 20.0 | - | Cornia et al | 2017-06-26 | |
Deep Visual-Semantic Alignments for Generating Image Descriptions | ✓ Link | 15.7 | 24.7 | 15.3 | - | BRNN | 2014-12-07 |
[]() | 67.1 | 14.5 | KOSMOS-1 1.6B (zero-shot) | ||||
Language Models are General-Purpose Interfaces | ✓ Link | 43.3 | 11.7 | MetaLM | 2022-06-13 | ||
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models | ✓ Link | 31.0 | 10.0 | FewVLM | 2021-10-16 | ||
Unifying Vision-and-Language Tasks via Text Generation | ✓ Link | 2.6 | 2.0 | VL-T5 | 2021-02-04 |