OpenCodePapers

natural-language-visual-grounding-on

Natural Language Visual Grounding

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy (%)	ModelName	ReleaseDate
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents	✓ Link	86.34	UGround-V1-7B	2024-10-07
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction	✓ Link	83.0	Aguvis-7B	2024-12-05
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents	✓ Link	82.47	OS-Atlas-Base-7B	2024-10-30
Aria-UI: Visual Grounding for GUI Instructions	✓ Link	81.1	Aria-UI	2024-12-20
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction	✓ Link	81.0	Aguvis-G-7B	2024-12-05
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents	✓ Link	77.67	UGround-V1-2B	2024-10-07
ShowUI: One Vision-Language-Action Model for GUI Visual Agent	✓ Link	75.1	ShowUI	2024-11-26
ShowUI: One Vision-Language-Action Model for GUI Visual Agent	✓ Link	75.0	ShowUI-G	2024-11-26
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents	✓ Link	73.3	UGround	2024-10-07
OmniParser for Pure Vision Based GUI Agent	✓ Link	73.0	OmniParser	2024-08-01
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents	✓ Link	68.0	OS-Atlas-Base-4B	2024-10-30
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents	✓ Link	53.4	SeeClick	2024-01-17
CogAgent: A Visual Language Model for GUI Agents	✓ Link	47.4	CogAgent	2023-12-14
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	✓ Link	42.1	Qwen2-VL-7B	2024-09-18
GUICourse: From General Vision Language Models to Versatile GUI Agents	✓ Link	28.6	Qwen-GUI	2024-06-17
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	✓ Link	5.7	MiniGPT-v2	2023-10-14
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models	✓ Link	5.2	Groma	2024-04-19
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	✓ Link	5.2	Qwen-VL	2023-08-24