Paper | Code | Visual Matching | Visual Matching-Pick Coke Can | Visual Matching-Move Near | Visual Matching-Open/Close Drawer | Variant Aggregation | Variant Aggregation-Pick Coke Can | Variant Aggregation-Move Near | Variant Aggregation-Open/Close Drawer | ModelName | ReleaseDate |
---|---|---|---|---|---|---|---|---|---|---|---|
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation | ✓ Link | 0.749 | 0.923 | 0.917 | 0.403 | 0.676 | 0.907 | 0.740 | 0.297 | SoFar | 2025-02-18 |
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model | 0.719 | 0.810 | 0.696 | 0.593 | 0.688 | 0.895 | 0.717 | 0.362 | SpatialVLA | 2025-01-27 | |
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy | ✓ Link | 0.687 | 0.837 | 0.760 | 0.463 | 0.652 | 0.855 | 0.730 | 0.370 | Dita-300M | 2025-03-25 |
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control | ✓ Link | 0.606 | 0.787 | 0.779 | 0.250 | 0.661 | 0.823 | 0.792 | 0.353 | RT-2-X | 2023-07-28 |
Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models | ✓ Link | 0.563 | 0.727 | 0.663 | 0.268 | 0.463 | 0.683 | 0.560 | 0.085 | RoboVLM | 2024-12-18 |
RT-1: Robotics Transformer for Real-World Control at Scale | ✓ Link | 0.534 | 0.567 | 0.317 | 0.597 | 0.397 | 0.490 | 0.323 | 0.294 | RT-1-X | 2022-12-13 |
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies | 0.460 | 0.560 | 0.600 | 0.240 | 0.450 | 0.600 | 0.564 | 0.310 | TraceVLA | 2024-12-13 | |
OpenVLA: An Open-Source Vision-Language-Action Model | ✓ Link | 0.277 | 0.163 | 0.462 | 0.356 | 0.411 | 0.545 | 0.477 | 0.177 | OpenVLA | 2024-06-13 |
Octo: An Open-Source Generalist Robot Policy | 0.168 | 0.170 | 0.042 | 0.227 | 0.012 | 0.006 | 0.031 | 0.011 | Octo-Base | 2024-05-20 |