A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs | | 75.5 | 58.5 | 52 | GPT-4o + CA | 2025-01-23 |
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task | | 75.25 | 68.75 | 58.75 | GPT-4V (CoT, pick b/w two options) | 2023-11-15 |
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task | | 69.25 | 46.25 | 39.25 | GPT-4V (pick b/w two options) | 2023-11-15 |
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | ✓ Link | 64.25 | 52.5 | 50.75 | MMICL + CoCoT | 2024-01-05 |
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | ✓ Link | 58.5 | 49.5 | 44.5 | GPT-4V + CoCoT | 2024-01-05 |
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | ✓ Link | 58.25 | 55.25 | 41.5 | OpenFlamingo + CoCoT | 2024-01-05 |
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | ✓ Link | 54.5 | 42.5 | 37.75 | GPT-4V | 2024-01-05 |
Equivariant Similarity for Vision-Language Foundation Models | ✓ Link | 51.5 | 32.00 | 27.5 | FIBER (EqSim) | 2023-03-25 |
Equivariant Similarity for Vision-Language Foundation Models | ✓ Link | 51.25 | 26.50 | 23.00 | FIBER (finetuned, Flickr30k) | 2023-03-25 |
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | ✓ Link | 51 | 48 | 47.5 | MMICL + CCoT | 2024-01-05 |
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | ✓ Link | 47.5 | 47.25 | 39 | OpenFlamingo + DDCoT | 2024-01-05 |
What You See is What You Read? Improving Text-Image Alignment Evaluation | ✓ Link | 47 | 42.2 | 30.5 | VQ2 | 2023-05-17 |
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | ✓ Link | 46.75 | 45 | 36.75 | MMICL + DDCoT | 2024-01-05 |
Measuring Progress in Fine-grained Vision-and-Language Understanding | ✓ Link | 46.7 | 24.5 | 21.2 | X-VLM 16M | 2023-05-12 |
What You See is What You Read? Improving Text-Image Alignment Evaluation | ✓ Link | 46.5 | 38 | 28.75 | PaLI (ft SNLI-VE + Synthetic Data) | 2023-05-17 |
Equivariant Similarity for Vision-Language Foundation Models | ✓ Link | 46.25 | 25.75 | 22.25 | FIBER | 2023-03-25 |
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | ✓ Link | 45.50 | 44.99 | 43.00 | MMICL (FLAN-T5-XXL) | 2023-09-14 |
What You See is What You Read? Improving Text-Image Alignment Evaluation | ✓ Link | 45.00 | 41.50 | 28.70 | PaLI (ft SNLI-VE) | 2023-05-17 |
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | ✓ Link | 45 | 25 | 23.75 | Gemini + DDCoT | 2024-01-05 |
Equivariant Similarity for Vision-Language Foundation Models | ✓ Link | 45.0 | 22.75 | 18.75 | METER (EqSim) | 2023-03-25 |
Measuring Progress in Fine-grained Vision-and-Language Understanding | ✓ Link | 44.0 | 26.7 | 21.5 | X-VLM 4M | 2023-05-12 |
What You See is What You Read? Improving Text-Image Alignment Evaluation | ✓ Link | 44.00 | 26.00 | 23.50 | BLIP2 (ft COCO) | 2023-05-17 |
Prompting Large Vision-Language Models for Compositional Reasoning | ✓ Link | 43.5 | 28.7 | 18.2 | KeyComp* (GPT-4) | 2024-01-20 |
Equivariant Similarity for Vision-Language Foundation Models | ✓ Link | 43.5 | 20.75 | 14.75 | METER (finetuned, Flickr30k) | 2023-03-25 |
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs | | 42.8 | 28.5 | 23.3 | BLIP2 (SGVL) | 2023-05-10 |
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs | | 42.8 | 27.3 | 21.5 | BLIP (SGVL) | 2023-05-10 |
Prompting Large Vision-Language Models for Compositional Reasoning | ✓ Link | 42.7 | 27.8 | 17.4 | KeyComp* (GPT-3.5) | 2024-01-20 |
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | ✓ Link | 42.5 | 27.5 | 20 | OpenFlamingo + CCoT | 2024-01-05 |
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs | | 42.5 | 24.0 | 18.5 | NegBLIP | 2023-05-10 |
Does Structural Attention Improve Compositional Representations in Vision-Language Models? | | 42.50 | 19.75 | 16.00 | IAIS large (Flickr30k) | 2022-12-03 |
Compositional Chain-of-Thought Prompting for Large Multimodal Models | ✓ Link | 42.0 | 35.5 | 22.3 | LLaVA-1.5-CCoT | 2023-11-27 |
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs | | 42.0 | 23.8 | 19.0 | BLIP2 | 2023-05-10 |
Does Structural Attention Improve Compositional Representations in Vision-Language Models? | | 41.75 | 19.75 | 15.50 | IAIS large (COCO) | 2022-12-03 |
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs | | 41.5 | 26.0 | 20.5 | NegBLIP2 | 2023-05-10 |
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs | | 40.5 | 25.5 | 19.0 | BLIP (+Graph Text, +Graph Neg) | 2023-05-10 |
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs | | 40.3 | 20.5 | 16.5 | BLIP (+Graph Text) | 2023-05-10 |
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | ✓ Link | 40 | 32.5 | 27.75 | Gemini + CoCoT | 2024-01-05 |
Does Structural Attention Improve Compositional Representations in Vision-Language Models? | | 39.25 | 17.75 | 14.25 | CACR base | 2022-12-03 |
Equivariant Similarity for Vision-Language Foundation Models | ✓ Link | 39.25 | 15.75 | 12.00 | METER | 2023-03-25 |
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | ✓ Link | 39 | 41.25 | 33.25 | OpenFlamingo | 2024-01-05 |
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs | | 39.0 | 19.2 | 15.0 | BLIP | 2023-05-10 |
[]() | | 38.00 | 38.00 | 38.00 | GPT-4V (image-caption match answer yes/no, zero-shot) | |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 38.00 | 14.00 | 10.50 | UNITER large | 2022-04-07 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 37.75 | 17.75 | 14.50 | VinVL | 2022-04-07 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 37.00 | 13.25 | 11.00 | ViLLA large | 2022-04-07 |
Revisiting the Role of Language Priors in Vision-Language Models | ✓ Link | 36.5 | 21.5 | 16.8 | BLIP (VisualGPTScore, α-tuned) | 2023-06-02 |
Measuring Progress in Fine-grained Vision-and-Language Understanding | ✓ Link | 36.5 | 18.5 | 14.5 | BLIP 14M | 2023-05-12 |
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval | | 36.5 | | | ViT-B/16 + BERT base + ViLEM | 2023-01-01 |
Compositional Chain-of-Thought Prompting for Large Multimodal Models | ✓ Link | 36.0 | 33.3 | 20.1 | LLaVA-1.5 | 2023-11-27 |
Revisiting the Role of Language Priors in Vision-Language Models | ✓ Link | 35.8 | 15.8 | 13.3 | BLIP (ITM) | 2023-06-02 |
Measuring Progress in Fine-grained Vision-and-Language Understanding | ✓ Link | 35.5 | 15.0 | 11.7 | BLIP 129M | 2023-05-12 |
Does Structural Attention Improve Compositional Representations in Vision-Language Models? | | 35.25 | 15.25 | 12.25 | ROSITA (Flickr30k) | 2022-12-03 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 34.75 | 14.00 | 9.25 | ViLT (ViT-B/32) | 2022-04-07 |
Measuring Progress in Fine-grained Vision-and-Language Understanding | ✓ Link | 34.7 | 15.2 | 12.2 | BLIP 129M (CapFilt/L) | 2023-05-12 |
Measuring Progress in Fine-grained Vision-and-Language Understanding | ✓ Link | 34.7 | 14.5 | 12.2 | BLIP-ViT/L 129M | 2023-05-12 |
Your Diffusion Model is Secretly a Zero-Shot Classifier | ✓ Link | 34.00 | | | Diffusion Classifier (zero-shot) | 2023-03-28 |
Measuring Progress in Fine-grained Vision-and-Language Understanding | ✓ Link | 33.2 | 15.7 | 12.2 | PEVL 14M | 2023-05-12 |
Measuring Progress in Fine-grained Vision-and-Language Understanding | ✓ Link | 32.5 | 16.2 | 12.7 | ALBEF 14M | 2023-05-12 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 32.25 | 20.50 | 14.25 | FLAVA (ITM) | 2022-04-07 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 32.25 | 13.25 | 10.00 | UNITER base | 2022-04-07 |
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs | | 32.0 | 14.0 | 9.8 | CLIP (SGVL) | 2023-05-10 |
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval | | 31.2 | | | ViT-B/16 + BERT base | 2023-01-01 |
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | ✓ Link | 30.75 | 26 | 25 | Gemini | 2024-01-05 |
SelfEval: Leveraging the discriminative nature of generative models for evaluation | | 30.75 | 12.75 | | OCLIP (ViT-H/14) | 2023-11-17 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 30.75 | 10.50 | 8.00 | CLIP (ViT-B/32) | 2022-04-07 |
Simple Token-Level Confidence Improves Caption Correctness | | 30.75 | 10.25 | 7.25 | OFA large (ITM) | 2023-05-11 |
Prompting Large Vision-Language Models for Compositional Reasoning | ✓ Link | 30.3 | 24.6 | 12.4 | KeyComp (GPT-3.5) | 2024-01-20 |
SelfEval: Leveraging the discriminative nature of generative models for evaluation | | 30.25 | 8.0 | | CLIP (ViT-L/14) | 2023-11-17 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 30.00 | 12.00 | 8.00 | ViLLA base | 2022-04-07 |
Going Beyond Nouns With Vision & Language Models Using Synthetic Data | ✓ Link | 30.00 | 11.50 | 9.50 | syn-CLIP | 2023-03-30 |
Going Beyond Nouns With Vision & Language Models Using Synthetic Data | ✓ Link | 30.00 | 10.75 | 8.25 | syn-CyCLIP | 2023-03-30 |
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs | | 29.5 | 10.5 | 8.0 | NegCLIP | 2023-05-10 |
Simple Token-Level Confidence Improves Caption Correctness | | 29.25 | 27.00 | 17.50 | OFA large (TLC-A) | 2023-05-11 |
Measuring Progress in Fine-grained Vision-and-Language Understanding | ✓ Link | 29.2 | 15.5 | 11.0 | ALBEF 4M | 2023-05-12 |
SelfEval: Leveraging the discriminative nature of generative models for evaluation | | 29.00 | 13.50 | | LDM-T5 (SelfEval) | 2023-11-17 |
Going Beyond Nouns With Vision & Language Models Using Synthetic Data | ✓ Link | 28.50 | 9.50 | 7.25 | CyCLIP | 2023-03-30 |
SelfEval: Leveraging the discriminative nature of generative models for evaluation | | 28.25 | 12.00 | | PDM-T5 (SelfEval) | 2023-11-17 |
What You See is What You Read? Improving Text-Image Alignment Evaluation | ✓ Link | 28.25 | 11.50 | 8.25 | COCA ViT-L14 (f.t on COCO) | 2023-05-17 |
Compositional Chain-of-Thought Prompting for Large Multimodal Models | ✓ Link | 28.0 | 22.5 | 12.3 | LLaVA-1.5-ZS-CoT | 2023-11-27 |
Revisiting the Role of Language Priors in Vision-Language Models | ✓ Link | 28.0 | 9.0 | 6.5 | BLIP (ITC) | 2023-06-02 |
What You See is What You Read? Improving Text-Image Alignment Evaluation | ✓ Link | 27.70 | 14.30 | 9.00 | OFA large (ft SNLI-VE) | 2023-05-17 |
Simple Token-Level Confidence Improves Caption Correctness | | 26.75 | 10.75 | 6.50 | OFA base (ITM) | 2023-05-11 |
What You See is What You Read? Improving Text-Image Alignment Evaluation | ✓ Link | 26.50 | 13.75 | 10.25 | CLIP RN50x64 | 2023-05-17 |
An Examination of the Compositionality of Large Generative Vision-Language Models | ✓ Link | 25.50 | 17.00 | 10.50 | LLaVA-7B (GPTScore) | 2023-08-21 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 25.25 | 13.50 | 9.00 | FLAVA (contrastive) | 2022-04-07 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 25.00 | 25.00 | 16.67 | Random chance | 2022-04-07 |
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs | | 24.8 | 25.0 | 13.0 | LLaVA | 2023-05-10 |
Simple Token-Level Confidence Improves Caption Correctness | | 24.50 | 23.50 | 13.75 | OFA base (TLC-A) | 2023-05-11 |
An Examination of the Compositionality of Large Generative Vision-Language Models | ✓ Link | 24.50 | 21.75 | 11.50 | MiniGPT-4-7B (GPTScore) | 2023-08-21 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 23.75 | 7.25 | 4.75 | ViLBERT base | 2022-04-07 |
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs | | 23.3 | 18.0 | 9.5 | MiniGPT-4 | 2023-05-10 |
An Examination of the Compositionality of Large Generative Vision-Language Models | ✓ Link | 23.25 | 18.00 | 9.50 | MiniGPT-4-7B (VisualGPTScore) | 2023-08-21 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 22.75 | 8.00 | 4.00 | VSE++ (COCO, ResNet) | 2022-04-07 |
Simple Token-Level Confidence Improves Caption Correctness | | 22.75 | 7.75 | 4.50 | OFA tiny (ITM) | 2023-05-11 |
SelfEval: Leveraging the discriminative nature of generative models for evaluation | | 22.75 | 7.25 | | LDM-CLIP (SelfEval) | 2023-11-17 |
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | ✓ Link | 22.5 | 33 | 20.75 | Gemini + CCoT | 2024-01-05 |
Compositional Chain-of-Thought Prompting for Large Multimodal Models | ✓ Link | 21.0 | 21.3 | 8.3 | InstructBLIP-CCoT | 2023-11-27 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 20.00 | 5.00 | 3.50 | VSRN (Flickr30k) | 2022-04-07 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 20.00 | 5.00 | 2.75 | VSE++ (Flickr30k, ResNet) | 2022-04-07 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 19.75 | 6.25 | 4.50 | VSE++ (Flickr30k, VGG) | 2022-04-07 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 19.50 | 6.25 | 4.00 | UniT (ITM finetuned) | 2022-04-07 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 19.25 | 7.00 | 4.00 | LXMERT | 2022-04-07 |
What You See is What You Read? Improving Text-Image Alignment Evaluation | ✓ Link | 19.00 | 12.50 | 11.30 | TIFA | 2023-05-17 |
[]() | | 18.75 | 22.5 | 8.0 | IDEFICS 80B | |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 18.75 | 5.50 | 3.50 | VSE++ (COCO, VGG) | 2022-04-07 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 17.50 | 7.00 | 3.75 | VSRN (COCO) | 2022-04-07 |
SelfEval: Leveraging the discriminative nature of generative models for evaluation | | 17.00 | 14.00 | | PDM-CLIP (SelfEval) | 2023-11-17 |
[]() | | 16.8 | 20.8 | 5.0 | IDEFICS 9B | |
Simple Token-Level Confidence Improves Caption Correctness | | 16.50 | 15.75 | 6.75 | OFA tiny (TLC-A) | 2023-05-11 |
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | ✓ Link | 15.50 | 2.50 | 1.50 | VisualBERT base | 2022-04-07 |
An Examination of the Compositionality of Large Generative Vision-Language Models | ✓ Link | 14.00 | 8.00 | 2.75 | MiniGPT-4-7B (BERTScore) | 2023-08-21 |
An Examination of the Compositionality of Large Generative Vision-Language Models | ✓ Link | 13.50 | 5.25 | 2.25 | LLaVA-7B (BERTScore) | 2023-08-21 |
Compositional Chain-of-Thought Prompting for Large Multimodal Models | ✓ Link | 9.3 | 16.3 | 4.0 | InstructBLIP-ZS-CoT | 2023-11-27 |
Compositional Chain-of-Thought Prompting for Large Multimodal Models | ✓ Link | 7.0 | 11.5 | 3.3 | InstructBLIP | 2023-11-27 |