Paper | Code | Accuracy | ModelName | ReleaseDate |
---|---|---|---|---|
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework | ✓ Link | 91.2 | OFA | 2022-02-07 |
Prompt Tuning for Generative Multimodal Pretrained Models | ✓ Link | 90.12 | Prompt Tuning | 2022-08-04 |
CoCa: Contrastive Captioners are Image-Text Foundation Models | ✓ Link | 87.1 | CoCa | 2022-05-04 |
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | ✓ Link | 86.32 | SimVLM | 2021-08-24 |
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning | ✓ Link | 84.95 | SOHO | 2021-04-07 |
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks | 80.32 | MAD (Single Model, Formerly CLIP-TD) | 2022-04-22 | |
UNITER: UNiversal Image-TExt Representation Learning | ✓ Link | 78.98 | UNITER (Large) | 2019-09-25 |
Visual Entailment: A Novel Task for Fine-Grained Image Understanding | ✓ Link | 70.47 | EVE-ROI* | 2019-01-20 |