Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA | ✓ Link | 31.29 | 34.49 | 24.06 | 43.26 | 16.51 | 83.75 | BridgeQA | 2024-02-24 |
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness | | 30.6 | | 16.4 | 49.6 | 20.8 | 103.1 | LLaVA-3D | 2024-09-26 |
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding | ✓ Link | 30.1 | | | | | 102.1 | Video-3D LLM | 2024-11-30 |
Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning | | 27.2 | | 12.0 | 40.0 | 16.6 | 80 | Scene-LLM | 2024-03-18 |
Towards Learning a Generalist Model for Embodied Navigation | ✓ Link | 26.27 | 39.73 | 13.90 | 40.23 | 16.56 | 80.77 | NaviLLM | 2023-12-04 |
An Embodied Generalist Agent in 3D World | ✓ Link | 24.5 | | 13.2 | 49.2 | 20.0 | 101.4 | LEO | 2023-11-18 |
ScanQA: 3D Question Answering for Spatial Scene Understanding | ✓ Link | 23.45 | 31.56 | 12.04 | 34.34 | 13.55 | 67.29 | ScanQA | 2021-12-20 |
3D-LLM: Injecting the 3D World into Large Language Models | ✓ Link | 23.2 | 32.6 | 8.4 | 34.8 | 13.5 | 65.6 | 3D-LLM (flamingo) | 2023-07-24 |
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment | ✓ Link | 22.4 | - | 10.4 | 35.7 | 13.9 | 69.6 | 3D-VisTA | 2023-08-08 |
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers | ✓ Link | 21.6 | | 14.3 | 41.6 | 18.0 | 87.7 | ChatScene | 2023-12-13 |
ScanQA: 3D Question Answering for Spatial Scene Understanding | ✓ Link | 20.56 | 27.85 | 7.46 | 30.68 | 11.97 | 57.56 | ScanRefer+MCAN | 2021-12-20 |
ScanQA: 3D Question Answering for Spatial Scene Understanding | ✓ Link | 19.71 | 29.46 | 6.08 | 30.97 | 12.07 | 58.23 | VoteNet+MCAN | 2021-12-20 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ✓ Link | 19.2 | | 9.6 | 28.2 | 9.5 | 49.2 | VideoChat2 | 2023-11-28 |
3D-LLM: Injecting the 3D World into Large Language Models | ✓ Link | 19.1 | 38.3 | 11.6 | 35.3 | 14.9 | 69.6 | 3D-LLM (BLIP2-flant5) | 2023-07-24 |
3D-LLM: Injecting the 3D World into Large Language Models | ✓ Link | 19.1 | 37.3 | 10.7 | 34.5 | 14.3 | 67.1 | 3D-LLM (BLIP2-opt) | 2023-07-24 |
LLaVA-OneVision: Easy Visual Task Transfer | ✓ Link | 18.7 | | 9.8 | 27.8 | 9.1 | 46.2 | LLaVA-NeXT-Video | 2024-08-06 |
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers | ✓ Link | - | - | 14.0 | - | - | 87.6 | Chat-3D v2 | 2023-12-13 |
Visual Instruction Tuning | ✓ Link | - | - | 13.5 | 37.3 | 15.9 | 76.8 | LL3DA | 2023-04-17 |