Paper | Code | Human (%) | Accuracy | ModelName | ReleaseDate |
---|---|---|---|---|---|
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images | 68 | Ground-truth Caption -> GPT3 (Oracle) | 2023-03-13 | ||
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images | 33 | Predicted Caption -> GPT3 | 2023-03-13 | ||
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images | 27 | BLIP2 FlanT5-XXL (Fine-tuned) | 2023-03-13 | ||
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images | 15 | BLIP2 FlanT5-XL (Fine-tuned) | 2023-03-13 | ||
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images | 0 | BLIP2 FlanT5-XXL (Zero-shot) | 2023-03-13 | ||
VLIS: Unimodal Language Models Guide Multimodal Language Generation | ✓ Link | 80 | VLIS (Lynx) | 2023-10-15 | |
VLIS: Unimodal Language Models Guide Multimodal Language Generation | ✓ Link | 73 | VLIS (LLaVA) | 2023-10-15 |