visual-reasoning-on-winoground

Visual Reasoning

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Text Score	Image Score	Group Score	ModelName	ReleaseDate
A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs		75.5	58.5	52	GPT-4o + CA	2025-01-23
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task		75.25	68.75	58.75	GPT-4V (CoT, pick b/w two options)	2023-11-15
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task		69.25	46.25	39.25	GPT-4V (pick b/w two options)	2023-11-15
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs	✓ Link	64.25	52.5	50.75	MMICL + CoCoT	2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs	✓ Link	58.5	49.5	44.5	GPT-4V + CoCoT	2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs	✓ Link	58.25	55.25	41.5	OpenFlamingo + CoCoT	2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs	✓ Link	54.5	42.5	37.75	GPT-4V	2024-01-05
Equivariant Similarity for Vision-Language Foundation Models	✓ Link	51.5	32.00	27.5	FIBER (EqSim)	2023-03-25
Equivariant Similarity for Vision-Language Foundation Models	✓ Link	51.25	26.50	23.00	FIBER (finetuned, Flickr30k)	2023-03-25
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs	✓ Link	51	48	47.5	MMICL + CCoT	2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs	✓ Link	47.5	47.25	39	OpenFlamingo + DDCoT	2024-01-05
What You See is What You Read? Improving Text-Image Alignment Evaluation	✓ Link	47	42.2	30.5	VQ2	2023-05-17
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs	✓ Link	46.75	45	36.75	MMICL + DDCoT	2024-01-05
Measuring Progress in Fine-grained Vision-and-Language Understanding	✓ Link	46.7	24.5	21.2	X-VLM 16M	2023-05-12
What You See is What You Read? Improving Text-Image Alignment Evaluation	✓ Link	46.5	38	28.75	PaLI (ft SNLI-VE + Synthetic Data)	2023-05-17
Equivariant Similarity for Vision-Language Foundation Models	✓ Link	46.25	25.75	22.25	FIBER	2023-03-25
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	✓ Link	45.50	44.99	43.00	MMICL (FLAN-T5-XXL)	2023-09-14
What You See is What You Read? Improving Text-Image Alignment Evaluation	✓ Link	45.00	41.50	28.70	PaLI (ft SNLI-VE)	2023-05-17
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs	✓ Link	45	25	23.75	Gemini + DDCoT	2024-01-05
Equivariant Similarity for Vision-Language Foundation Models	✓ Link	45.0	22.75	18.75	METER (EqSim)	2023-03-25
Measuring Progress in Fine-grained Vision-and-Language Understanding	✓ Link	44.0	26.7	21.5	X-VLM 4M	2023-05-12
What You See is What You Read? Improving Text-Image Alignment Evaluation	✓ Link	44.00	26.00	23.50	BLIP2 (ft COCO)	2023-05-17
Prompting Large Vision-Language Models for Compositional Reasoning	✓ Link	43.5	28.7	18.2	KeyComp* (GPT-4)	2024-01-20
Equivariant Similarity for Vision-Language Foundation Models	✓ Link	43.5	20.75	14.75	METER (finetuned, Flickr30k)	2023-03-25
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs		42.8	28.5	23.3	BLIP2 (SGVL)	2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs		42.8	27.3	21.5	BLIP (SGVL)	2023-05-10
Prompting Large Vision-Language Models for Compositional Reasoning	✓ Link	42.7	27.8	17.4	KeyComp* (GPT-3.5)	2024-01-20
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs	✓ Link	42.5	27.5	20	OpenFlamingo + CCoT	2024-01-05
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs		42.5	24.0	18.5	NegBLIP	2023-05-10
Does Structural Attention Improve Compositional Representations in Vision-Language Models?		42.50	19.75	16.00	IAIS large (Flickr30k)	2022-12-03
Compositional Chain-of-Thought Prompting for Large Multimodal Models	✓ Link	42.0	35.5	22.3	LLaVA-1.5-CCoT	2023-11-27
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs		42.0	23.8	19.0	BLIP2	2023-05-10
Does Structural Attention Improve Compositional Representations in Vision-Language Models?		41.75	19.75	15.50	IAIS large (COCO)	2022-12-03
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs		41.5	26.0	20.5	NegBLIP2	2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs		40.5	25.5	19.0	BLIP (+Graph Text, +Graph Neg)	2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs		40.3	20.5	16.5	BLIP (+Graph Text)	2023-05-10
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs	✓ Link	40	32.5	27.75	Gemini + CoCoT	2024-01-05
Does Structural Attention Improve Compositional Representations in Vision-Language Models?		39.25	17.75	14.25	CACR base	2022-12-03
Equivariant Similarity for Vision-Language Foundation Models	✓ Link	39.25	15.75	12.00	METER	2023-03-25
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs	✓ Link	39	41.25	33.25	OpenFlamingo	2024-01-05
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs		39.0	19.2	15.0	BLIP	2023-05-10
[]()		38.00	38.00	38.00	GPT-4V (image-caption match answer yes/no, zero-shot)
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	38.00	14.00	10.50	UNITER large	2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	37.75	17.75	14.50	VinVL	2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	37.00	13.25	11.00	ViLLA large	2022-04-07
Revisiting the Role of Language Priors in Vision-Language Models	✓ Link	36.5	21.5	16.8	BLIP (VisualGPTScore, α-tuned)	2023-06-02
Measuring Progress in Fine-grained Vision-and-Language Understanding	✓ Link	36.5	18.5	14.5	BLIP 14M	2023-05-12
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval		36.5			ViT-B/16 + BERT base + ViLEM	2023-01-01
Compositional Chain-of-Thought Prompting for Large Multimodal Models	✓ Link	36.0	33.3	20.1	LLaVA-1.5	2023-11-27
Revisiting the Role of Language Priors in Vision-Language Models	✓ Link	35.8	15.8	13.3	BLIP (ITM)	2023-06-02
Measuring Progress in Fine-grained Vision-and-Language Understanding	✓ Link	35.5	15.0	11.7	BLIP 129M	2023-05-12
Does Structural Attention Improve Compositional Representations in Vision-Language Models?		35.25	15.25	12.25	ROSITA (Flickr30k)	2022-12-03
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	34.75	14.00	9.25	ViLT (ViT-B/32)	2022-04-07
Measuring Progress in Fine-grained Vision-and-Language Understanding	✓ Link	34.7	15.2	12.2	BLIP 129M (CapFilt/L)	2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding	✓ Link	34.7	14.5	12.2	BLIP-ViT/L 129M	2023-05-12
Your Diffusion Model is Secretly a Zero-Shot Classifier	✓ Link	34.00			Diffusion Classifier (zero-shot)	2023-03-28
Measuring Progress in Fine-grained Vision-and-Language Understanding	✓ Link	33.2	15.7	12.2	PEVL 14M	2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding	✓ Link	32.5	16.2	12.7	ALBEF 14M	2023-05-12
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	32.25	20.50	14.25	FLAVA (ITM)	2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	32.25	13.25	10.00	UNITER base	2022-04-07
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs		32.0	14.0	9.8	CLIP (SGVL)	2023-05-10
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval		31.2			ViT-B/16 + BERT base	2023-01-01
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs	✓ Link	30.75	26	25	Gemini	2024-01-05
SelfEval: Leveraging the discriminative nature of generative models for evaluation		30.75	12.75		OCLIP (ViT-H/14)	2023-11-17
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	30.75	10.50	8.00	CLIP (ViT-B/32)	2022-04-07
Simple Token-Level Confidence Improves Caption Correctness		30.75	10.25	7.25	OFA large (ITM)	2023-05-11
Prompting Large Vision-Language Models for Compositional Reasoning	✓ Link	30.3	24.6	12.4	KeyComp (GPT-3.5)	2024-01-20
SelfEval: Leveraging the discriminative nature of generative models for evaluation		30.25	8.0		CLIP (ViT-L/14)	2023-11-17
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	30.00	12.00	8.00	ViLLA base	2022-04-07
Going Beyond Nouns With Vision & Language Models Using Synthetic Data	✓ Link	30.00	11.50	9.50	syn-CLIP	2023-03-30
Going Beyond Nouns With Vision & Language Models Using Synthetic Data	✓ Link	30.00	10.75	8.25	syn-CyCLIP	2023-03-30
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs		29.5	10.5	8.0	NegCLIP	2023-05-10
Simple Token-Level Confidence Improves Caption Correctness		29.25	27.00	17.50	OFA large (TLC-A)	2023-05-11
Measuring Progress in Fine-grained Vision-and-Language Understanding	✓ Link	29.2	15.5	11.0	ALBEF 4M	2023-05-12
SelfEval: Leveraging the discriminative nature of generative models for evaluation		29.00	13.50		LDM-T5 (SelfEval)	2023-11-17
Going Beyond Nouns With Vision & Language Models Using Synthetic Data	✓ Link	28.50	9.50	7.25	CyCLIP	2023-03-30
SelfEval: Leveraging the discriminative nature of generative models for evaluation		28.25	12.00		PDM-T5 (SelfEval)	2023-11-17
What You See is What You Read? Improving Text-Image Alignment Evaluation	✓ Link	28.25	11.50	8.25	COCA ViT-L14 (f.t on COCO)	2023-05-17
Compositional Chain-of-Thought Prompting for Large Multimodal Models	✓ Link	28.0	22.5	12.3	LLaVA-1.5-ZS-CoT	2023-11-27
Revisiting the Role of Language Priors in Vision-Language Models	✓ Link	28.0	9.0	6.5	BLIP (ITC)	2023-06-02
What You See is What You Read? Improving Text-Image Alignment Evaluation	✓ Link	27.70	14.30	9.00	OFA large (ft SNLI-VE)	2023-05-17
Simple Token-Level Confidence Improves Caption Correctness		26.75	10.75	6.50	OFA base (ITM)	2023-05-11
What You See is What You Read? Improving Text-Image Alignment Evaluation	✓ Link	26.50	13.75	10.25	CLIP RN50x64	2023-05-17
An Examination of the Compositionality of Large Generative Vision-Language Models	✓ Link	25.50	17.00	10.50	LLaVA-7B (GPTScore)	2023-08-21
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	25.25	13.50	9.00	FLAVA (contrastive)	2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	25.00	25.00	16.67	Random chance	2022-04-07
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs		24.8	25.0	13.0	LLaVA	2023-05-10
Simple Token-Level Confidence Improves Caption Correctness		24.50	23.50	13.75	OFA base (TLC-A)	2023-05-11
An Examination of the Compositionality of Large Generative Vision-Language Models	✓ Link	24.50	21.75	11.50	MiniGPT-4-7B (GPTScore)	2023-08-21
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	23.75	7.25	4.75	ViLBERT base	2022-04-07
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs		23.3	18.0	9.5	MiniGPT-4	2023-05-10
An Examination of the Compositionality of Large Generative Vision-Language Models	✓ Link	23.25	18.00	9.50	MiniGPT-4-7B (VisualGPTScore)	2023-08-21
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	22.75	8.00	4.00	VSE++ (COCO, ResNet)	2022-04-07
Simple Token-Level Confidence Improves Caption Correctness		22.75	7.75	4.50	OFA tiny (ITM)	2023-05-11
SelfEval: Leveraging the discriminative nature of generative models for evaluation		22.75	7.25		LDM-CLIP (SelfEval)	2023-11-17
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs	✓ Link	22.5	33	20.75	Gemini + CCoT	2024-01-05
Compositional Chain-of-Thought Prompting for Large Multimodal Models	✓ Link	21.0	21.3	8.3	InstructBLIP-CCoT	2023-11-27
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	20.00	5.00	3.50	VSRN (Flickr30k)	2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	20.00	5.00	2.75	VSE++ (Flickr30k, ResNet)	2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	19.75	6.25	4.50	VSE++ (Flickr30k, VGG)	2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	19.50	6.25	4.00	UniT (ITM finetuned)	2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	19.25	7.00	4.00	LXMERT	2022-04-07
What You See is What You Read? Improving Text-Image Alignment Evaluation	✓ Link	19.00	12.50	11.30	TIFA	2023-05-17
[]()		18.75	22.5	8.0	IDEFICS 80B
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	18.75	5.50	3.50	VSE++ (COCO, VGG)	2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	17.50	7.00	3.75	VSRN (COCO)	2022-04-07
SelfEval: Leveraging the discriminative nature of generative models for evaluation		17.00	14.00		PDM-CLIP (SelfEval)	2023-11-17
[]()		16.8	20.8	5.0	IDEFICS 9B
Simple Token-Level Confidence Improves Caption Correctness		16.50	15.75	6.75	OFA tiny (TLC-A)	2023-05-11
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality	✓ Link	15.50	2.50	1.50	VisualBERT base	2022-04-07
An Examination of the Compositionality of Large Generative Vision-Language Models	✓ Link	14.00	8.00	2.75	MiniGPT-4-7B (BERTScore)	2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models	✓ Link	13.50	5.25	2.25	LLaVA-7B (BERTScore)	2023-08-21
Compositional Chain-of-Thought Prompting for Large Multimodal Models	✓ Link	9.3	16.3	4.0	InstructBLIP-ZS-CoT	2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models	✓ Link	7.0	11.5	3.3	InstructBLIP	2023-11-27

OpenCodePapers

visual-reasoning-on-winoground