Paper | Code | AnswerExactMatch (Question Answering) | ModelName | ReleaseDate |
---|---|---|---|---|
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion | ✓ Link | 54.6 | CREMA | 2024-02-08 |
Situational Awareness Matters in 3D Vision Language Reasoning | ✓ Link | 52.6 | Situation3D | 2024-06-11 |
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding | ✓ Link | 50.7 | Lexicon3D | 2024-09-05 |
Frozen Transformers in Language Models Are Effective Visual Encoder Layers | ✓ Link | 48.09 | LM4VisualEncoding | 2023-10-19 |
SQA3D: Situated Question Answering in 3D Scenes | ✓ Link | 47.20 | ScanQA (w/ auxiliary loss) | 2022-10-14 |
SQA3D: Situated Question Answering in 3D Scenes | ✓ Link | 46.58 | ScanQA | 2022-10-14 |
Deep Modular Co-Attention Networks for Visual Question Answering | ✓ Link | 43.42 | MCAN | 2019-06-25 |