OpenCodePapers

visual-reasoning-on-winoground

Visual Reasoning
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeText ScoreImage ScoreGroup ScoreModelNameReleaseDate
A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs75.558.552GPT-4o + CA2025-01-23
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task75.2568.7558.75GPT-4V (CoT, pick b/w two options)2023-11-15
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task69.2546.2539.25GPT-4V (pick b/w two options)2023-11-15
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs✓ Link64.2552.550.75MMICL + CoCoT2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs✓ Link58.549.544.5GPT-4V + CoCoT2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs✓ Link58.2555.2541.5OpenFlamingo + CoCoT2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs✓ Link54.542.537.75GPT-4V2024-01-05
Equivariant Similarity for Vision-Language Foundation Models✓ Link51.532.0027.5FIBER (EqSim)2023-03-25
Equivariant Similarity for Vision-Language Foundation Models✓ Link51.2526.5023.00FIBER (finetuned, Flickr30k)2023-03-25
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs✓ Link514847.5MMICL + CCoT2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs✓ Link47.547.2539OpenFlamingo + DDCoT2024-01-05
What You See is What You Read? Improving Text-Image Alignment Evaluation✓ Link4742.230.5VQ22023-05-17
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs✓ Link46.754536.75MMICL + DDCoT2024-01-05
Measuring Progress in Fine-grained Vision-and-Language Understanding✓ Link46.724.521.2X-VLM 16M2023-05-12
What You See is What You Read? Improving Text-Image Alignment Evaluation✓ Link46.53828.75PaLI (ft SNLI-VE + Synthetic Data)2023-05-17
Equivariant Similarity for Vision-Language Foundation Models✓ Link46.2525.7522.25FIBER2023-03-25
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning✓ Link45.5044.9943.00MMICL (FLAN-T5-XXL)2023-09-14
What You See is What You Read? Improving Text-Image Alignment Evaluation✓ Link45.0041.5028.70PaLI (ft SNLI-VE)2023-05-17
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs✓ Link452523.75Gemini + DDCoT2024-01-05
Equivariant Similarity for Vision-Language Foundation Models✓ Link45.022.7518.75METER (EqSim)2023-03-25
Measuring Progress in Fine-grained Vision-and-Language Understanding✓ Link44.026.721.5X-VLM 4M2023-05-12
What You See is What You Read? Improving Text-Image Alignment Evaluation✓ Link44.0026.0023.50BLIP2 (ft COCO)2023-05-17
Prompting Large Vision-Language Models for Compositional Reasoning✓ Link43.528.718.2KeyComp* (GPT-4)2024-01-20
Equivariant Similarity for Vision-Language Foundation Models✓ Link43.520.7514.75METER (finetuned, Flickr30k)2023-03-25
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs42.828.523.3BLIP2 (SGVL)2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs42.827.321.5BLIP (SGVL)2023-05-10
Prompting Large Vision-Language Models for Compositional Reasoning✓ Link42.727.817.4KeyComp* (GPT-3.5)2024-01-20
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs✓ Link42.527.520OpenFlamingo + CCoT2024-01-05
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs42.524.018.5NegBLIP2023-05-10
Does Structural Attention Improve Compositional Representations in Vision-Language Models?42.5019.7516.00IAIS large (Flickr30k)2022-12-03
Compositional Chain-of-Thought Prompting for Large Multimodal Models✓ Link42.035.522.3LLaVA-1.5-CCoT2023-11-27
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs42.023.819.0BLIP22023-05-10
Does Structural Attention Improve Compositional Representations in Vision-Language Models?41.7519.7515.50IAIS large (COCO)2022-12-03
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs41.526.020.5NegBLIP22023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs40.525.519.0BLIP (+Graph Text, +Graph Neg)2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs40.320.516.5BLIP (+Graph Text)2023-05-10
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs✓ Link4032.527.75Gemini + CoCoT2024-01-05
Does Structural Attention Improve Compositional Representations in Vision-Language Models?39.2517.7514.25CACR base2022-12-03
Equivariant Similarity for Vision-Language Foundation Models✓ Link39.2515.7512.00METER2023-03-25
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs✓ Link3941.2533.25OpenFlamingo2024-01-05
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs39.019.215.0BLIP2023-05-10
[]()38.0038.0038.00GPT-4V (image-caption match answer yes/no, zero-shot)
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link38.0014.0010.50UNITER large2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link37.7517.7514.50VinVL2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link37.0013.2511.00ViLLA large2022-04-07
Revisiting the Role of Language Priors in Vision-Language Models✓ Link36.521.516.8BLIP (VisualGPTScore, α-tuned)2023-06-02
Measuring Progress in Fine-grained Vision-and-Language Understanding✓ Link36.518.514.5BLIP 14M2023-05-12
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval36.5ViT-B/16 + BERT base + ViLEM2023-01-01
Compositional Chain-of-Thought Prompting for Large Multimodal Models✓ Link36.033.320.1LLaVA-1.52023-11-27
Revisiting the Role of Language Priors in Vision-Language Models✓ Link35.815.813.3BLIP (ITM)2023-06-02
Measuring Progress in Fine-grained Vision-and-Language Understanding✓ Link35.515.011.7BLIP 129M2023-05-12
Does Structural Attention Improve Compositional Representations in Vision-Language Models?35.2515.2512.25ROSITA (Flickr30k)2022-12-03
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link34.7514.009.25ViLT (ViT-B/32)2022-04-07
Measuring Progress in Fine-grained Vision-and-Language Understanding✓ Link34.715.212.2BLIP 129M (CapFilt/L)2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding✓ Link34.714.512.2BLIP-ViT/L 129M2023-05-12
Your Diffusion Model is Secretly a Zero-Shot Classifier✓ Link34.00Diffusion Classifier (zero-shot)2023-03-28
Measuring Progress in Fine-grained Vision-and-Language Understanding✓ Link33.215.712.2PEVL 14M2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding✓ Link32.516.212.7ALBEF 14M2023-05-12
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link32.2520.5014.25FLAVA (ITM)2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link32.2513.2510.00UNITER base2022-04-07
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs32.014.09.8CLIP (SGVL)2023-05-10
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval31.2ViT-B/16 + BERT base2023-01-01
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs✓ Link30.752625Gemini2024-01-05
SelfEval: Leveraging the discriminative nature of generative models for evaluation30.7512.75OCLIP (ViT-H/14) 2023-11-17
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link30.7510.508.00CLIP (ViT-B/32)2022-04-07
Simple Token-Level Confidence Improves Caption Correctness30.7510.257.25OFA large (ITM)2023-05-11
Prompting Large Vision-Language Models for Compositional Reasoning✓ Link30.324.612.4KeyComp (GPT-3.5)2024-01-20
SelfEval: Leveraging the discriminative nature of generative models for evaluation30.258.0CLIP (ViT-L/14)2023-11-17
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link30.0012.008.00ViLLA base2022-04-07
Going Beyond Nouns With Vision & Language Models Using Synthetic Data✓ Link30.0011.509.50syn-CLIP2023-03-30
Going Beyond Nouns With Vision & Language Models Using Synthetic Data✓ Link30.0010.758.25syn-CyCLIP2023-03-30
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs29.510.58.0NegCLIP2023-05-10
Simple Token-Level Confidence Improves Caption Correctness29.2527.0017.50OFA large (TLC-A)2023-05-11
Measuring Progress in Fine-grained Vision-and-Language Understanding✓ Link29.215.511.0ALBEF 4M2023-05-12
SelfEval: Leveraging the discriminative nature of generative models for evaluation29.0013.50LDM-T5 (SelfEval)2023-11-17
Going Beyond Nouns With Vision & Language Models Using Synthetic Data✓ Link28.509.507.25CyCLIP2023-03-30
SelfEval: Leveraging the discriminative nature of generative models for evaluation28.2512.00PDM-T5 (SelfEval)2023-11-17
What You See is What You Read? Improving Text-Image Alignment Evaluation✓ Link28.2511.508.25COCA ViT-L14 (f.t on COCO)2023-05-17
Compositional Chain-of-Thought Prompting for Large Multimodal Models✓ Link28.022.512.3LLaVA-1.5-ZS-CoT2023-11-27
Revisiting the Role of Language Priors in Vision-Language Models✓ Link28.09.06.5BLIP (ITC)2023-06-02
What You See is What You Read? Improving Text-Image Alignment Evaluation✓ Link27.7014.309.00OFA large (ft SNLI-VE)2023-05-17
Simple Token-Level Confidence Improves Caption Correctness26.7510.756.50OFA base (ITM)2023-05-11
What You See is What You Read? Improving Text-Image Alignment Evaluation✓ Link26.5013.7510.25CLIP RN50x642023-05-17
An Examination of the Compositionality of Large Generative Vision-Language Models✓ Link25.5017.0010.50LLaVA-7B (GPTScore)2023-08-21
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link25.2513.509.00FLAVA (contrastive)2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link25.0025.0016.67Random chance2022-04-07
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs24.825.013.0LLaVA2023-05-10
Simple Token-Level Confidence Improves Caption Correctness24.5023.5013.75OFA base (TLC-A)2023-05-11
An Examination of the Compositionality of Large Generative Vision-Language Models✓ Link24.5021.7511.50MiniGPT-4-7B (GPTScore)2023-08-21
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link23.757.254.75ViLBERT base2022-04-07
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs23.318.09.5MiniGPT-42023-05-10
An Examination of the Compositionality of Large Generative Vision-Language Models✓ Link23.2518.009.50MiniGPT-4-7B (VisualGPTScore)2023-08-21
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link22.758.004.00VSE++ (COCO, ResNet)2022-04-07
Simple Token-Level Confidence Improves Caption Correctness22.757.754.50OFA tiny (ITM)2023-05-11
SelfEval: Leveraging the discriminative nature of generative models for evaluation22.757.25LDM-CLIP (SelfEval)2023-11-17
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs✓ Link22.53320.75Gemini + CCoT2024-01-05
Compositional Chain-of-Thought Prompting for Large Multimodal Models✓ Link21.021.38.3InstructBLIP-CCoT 2023-11-27
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link20.005.003.50VSRN (Flickr30k)2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link20.005.002.75VSE++ (Flickr30k, ResNet)2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link19.756.254.50VSE++ (Flickr30k, VGG)2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link19.506.254.00UniT (ITM finetuned)2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link19.257.004.00LXMERT2022-04-07
What You See is What You Read? Improving Text-Image Alignment Evaluation✓ Link19.0012.5011.30TIFA2023-05-17
[]()18.7522.58.0IDEFICS 80B
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link18.755.503.50VSE++ (COCO, VGG)2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link17.507.003.75VSRN (COCO)2022-04-07
SelfEval: Leveraging the discriminative nature of generative models for evaluation17.0014.00PDM-CLIP (SelfEval)2023-11-17
[]()16.820.85.0IDEFICS 9B
Simple Token-Level Confidence Improves Caption Correctness16.5015.756.75OFA tiny (TLC-A)2023-05-11
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality✓ Link15.502.501.50VisualBERT base2022-04-07
An Examination of the Compositionality of Large Generative Vision-Language Models✓ Link14.008.002.75MiniGPT-4-7B (BERTScore)2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models✓ Link13.505.252.25LLaVA-7B (BERTScore)2023-08-21
Compositional Chain-of-Thought Prompting for Large Multimodal Models✓ Link9.316.34.0InstructBLIP-ZS-CoT2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models✓ Link7.011.53.3InstructBLIP2023-11-27