OpenCodePapers
natural-language-visual-grounding-on
Natural Language Visual Grounding
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
Show papers without code
Paper
Code
Accuracy (%)
↕
ModelName
ReleaseDate
↕
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
✓ Link
86.34
UGround-V1-7B
2024-10-07
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
✓ Link
83.0
Aguvis-7B
2024-12-05
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
✓ Link
82.47
OS-Atlas-Base-7B
2024-10-30
Aria-UI: Visual Grounding for GUI Instructions
✓ Link
81.1
Aria-UI
2024-12-20
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
✓ Link
81.0
Aguvis-G-7B
2024-12-05
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
✓ Link
77.67
UGround-V1-2B
2024-10-07
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
✓ Link
75.1
ShowUI
2024-11-26
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
✓ Link
75.0
ShowUI-G
2024-11-26
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
✓ Link
73.3
UGround
2024-10-07
OmniParser for Pure Vision Based GUI Agent
✓ Link
73.0
OmniParser
2024-08-01
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
✓ Link
68.0
OS-Atlas-Base-4B
2024-10-30
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
✓ Link
53.4
SeeClick
2024-01-17
CogAgent: A Visual Language Model for GUI Agents
✓ Link
47.4
CogAgent
2023-12-14
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
✓ Link
42.1
Qwen2-VL-7B
2024-09-18
GUICourse: From General Vision Language Models to Versatile GUI Agents
✓ Link
28.6
Qwen-GUI
2024-06-17
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
✓ Link
5.7
MiniGPT-v2
2023-10-14
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
✓ Link
5.2
Groma
2024-04-19
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
✓ Link
5.2
Qwen-VL
2023-08-24