OpenCodePapers

visual-question-answering-vqa-on-core-mm

Visual Question Answering (VQA)
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeOverall scoreDeductiveAbductiveAnalogicalParamsModelNameReleaseDate
GPT-4 Technical Report✓ Link74.4474.8677.8869.86GPT-4V2023-03-15
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models✓ Link39.4842.1749.8520.6916BSPHINX v22023-11-13
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond✓ Link37.3937.5544.3930.4216BQwen-VL-Chat2023-08-24
CogVLM: Visual Expert for Pretrained Language Models✓ Link37.1636.7547.8828.7517BCogVLM-Chat2023-11-06
Improved Baselines with Visual Instruction Tuning✓ Link32.6230.9447.9124.3113BLLaVA-1.52023-10-05
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model✓ Link30.4628.746.1222.087BLLaMA-Adapter V2 2023-04-28
Emu: Generative Pretraining in Multimodality✓ Link28.2428.936.5718.1914BEmu2023-07-11
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning✓ Link28.0227.5637.7620.568BInstructBLIP2023-05-11
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition✓ Link26.8426.7735.9718.619BInternLM-XComposer-VL2023-09-26
Otter: A Multi-Modal Model with In-Context Instruction Tuning✓ Link22.6922.4933.6413.337BOtter2023-05-05
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration✓ Link20.0523.4320.67.647BmPLUG-Owl22023-11-07
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models✓ Link19.312.7618.967.53BBLIP-2-OPT2.7B2023-01-30
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models✓ Link10.4311.0213.285.698BMiniGPT-v22023-04-20
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models✓ Link6.828.885.31.119BOpenFlamingo-v22023-08-02