Paper | Code | Accuracy | ModelName | ReleaseDate |
---|---|---|---|---|
FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering | 92.15 | LLaVA-OneVision7B w. FOCUS | 2025-06-25 | |
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration | ✓ Link | 90.58 | LLaVA-OneVision7B w. ZoomEye | 2024-11-25 |
Instruction-Guided Visual Masking | ✓ Link | 81.2 | IVM-Enhanced GPT4-V | 2024-05-30 |
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs | ✓ Link | 75.39 | SEAL | 2023-12-21 |
LLaVA-OneVision: Easy Visual Task Transfer | ✓ Link | 74.46 | LLaVA-OneVision7B | 2024-08-06 |