Paper | Code | Accuracy | ModelName | ReleaseDate |
---|---|---|---|---|
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework | ✓ Link | 91.0 | OFA | 2022-02-07 |
Prompt Tuning for Generative Multimodal Pretrained Models | ✓ Link | 90.04 | Prompt Tuning | 2022-08-04 |
CoCa: Contrastive Captioners are Image-Text Foundation Models | ✓ Link | 87.0 | CoCa | 2022-05-04 |
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | ✓ Link | 86.21 | SimVLM | 2021-08-24 |
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning | ✓ Link | 85.00 | SOHO | 2021-04-07 |
How Much Can CLIP Benefit Vision-and-Language Tasks? | ✓ Link | 80.20 | CLIP-ViL | 2021-07-13 |
Large-Scale Adversarial Training for Vision-and-Language Representation Learning | ✓ Link | 80.18 | VILLA-LARGE | 2020-06-11 |
UNITER: UNiversal Image-TExt Representation Learning | ✓ Link | 78.98 | UNITER | 2019-09-25 |
Visual Entailment: A Novel Task for Fine-Grained Image Understanding | ✓ Link | 70.81 | EVE-ROI* | 2019-01-20 |