| GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer | ✓ Link | 54.6 | | GeminiFusion (Swin-Large) | 2024-06-03 |
| Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer | | 54.0 | | DiffusionMMS | 2024-09-23 |
| HDBFormer: Efficient RGB-D Semantic Segmentation with A Heterogeneous Dual-Branch Framework | ✓ Link | 53.9% | | HDBFormer | 2025-04-18 |
| GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer | ✓ Link | 53.3 | | GeminiFusion (MiT-B5) | 2024-06-03 |
| DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation | ✓ Link | 53.3 | | DFormerv2-L | 2025-04-07 |
| Multimodal Token Fusion for Vision Transformers | ✓ Link | 53.0% | | TokenFusion (S) | 2022-04-19 |
| Efficient Multimodal Semantic Segmentation via Dual-Prompt Learning | ✓ Link | 52.8% | | DPLNet | 2023-12-01 |
| DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation | ✓ Link | 52.8% | | DFormerv2-B | 2025-04-07 |
| GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer | ✓ Link | 52.7 | | GeminiFusion (MiT-B3) | 2024-06-03 |
| DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation | ✓ Link | 52.5% | | DFormer-L | 2023-09-18 |
| CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers | ✓ Link | 52.4% | | CMX (B5) | 2022-03-09 |
| CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers | ✓ Link | 52.1% | | CMX (B4) | 2022-03-09 |
| DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation | ✓ Link | 51.5% | | DFormerv2-S | 2025-04-07 |
| Multimodal Token Fusion for Vision Transformers | ✓ Link | 51.4% | | TokenFusion (Ti) | 2022-04-19 |
| DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation | ✓ Link | 51.2% | | DFormer-B | 2023-09-18 |
| PanopticNDT: Efficient and Robust Panoptic Mapping | ✓ Link | 50.86% | | EMSANet (2x ResNet-34 NBt1D, PanopticNDT version, finetuned) | 2023-09-24 |
| Deep feature selection-and-fusion for RGB-D semantic segmentation | | 50.6% | | FSFNet | 2021-05-10 |
| Pattern-Structure Diffusion for Multi-Task Learning | | 50.6% | | PSD-ResNet50 | 2020-06-01 |
| DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation | ✓ Link | 50.0% | | TokenFusion (S) | 2023-09-18 |
| CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers | ✓ Link | 49.7% | | DPLNet | 2022-03-09 |
| Attention-based Dual Supervised Decoder for RGBD Semantic Segmentation | | 49.6% | | DFormer-L | 2022-01-05 |
| DCANet: Differential Convolution Attention Network for RGB-D Semantic Segmentation | | 49.6% | | CMX (B5) | 2022-10-13 |
| Pixel Difference Convolutional Network for RGB-D Semantic Segmentation | | 49.6% | | CMX (B4) | 2023-02-23 |
| Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation | ✓ Link | 49.4% | | TokenFusion (Ti) | 2020-07-17 |
| AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation | ✓ Link | 49.1% | | DFormer-B | 2023-09-25 |
| Efficient Multi-Task Scene Analysis with RGB-D Transformers | ✓ Link | 48.82% | | EMSANet (2x ResNet-34 NBt1D, PanopticNDT version, finetuned) | 2023-06-08 |
| DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation | ✓ Link | 48.8% | | FSFNet | 2023-09-18 |
| ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation | ✓ Link | 48.6% | | PSD-ResNet50 | 2021-08-24 |
| Spatial Information Guided Convolution for Real-Time RGBD Semantic Segmentation | ✓ Link | 48.6% | | TokenFusion (S) | 2020-04-09 |
| Efficient Multi-Task RGB-D Scene Analysis for Indoor Environments | ✓ Link | 48.47% | | DPLNet | 2022-07-10 |
| Attention-guided Chained Context Aggregation for Semantic Segmentation | ✓ Link | 48.3% | | DFormer-L | 2020-02-27 |
| Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis | ✓ Link | 48.17 | | CMX (B5) | 2020-11-13 |
| ACNet: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation | ✓ Link | 48.1% | | CMX (B4) | 2019-05-24 |
| RedNet: Residual Encoder-Decoder Network for indoor RGB-D Semantic Segmentation | ✓ Link | 47.8% | | TokenFusion (Ti) | 2018-06-04 |
| RDFNet: RGB-D Multi-Level Residual Feature Fusion for Indoor Semantic Segmentation | | 47.7% | | DFormer-B | 2017-10-01 |
| Context Contrasted Feature and Gated Multi-Scale Aggregation for Scene Segmentation | ✓ Link | 47.1% | | EMSANet (2x ResNet-34 NBt1D, PanopticNDT version, finetuned) | 2018-06-01 |
| Multi-Modal Attention-based Fusion Model for Semantic Segmentation of RGB-Depth Images | | 47.0% | | FSFNet | 2019-12-25 |
| 3D Graph Neural Networks for RGBD Semantic Segmentation | ✓ Link | 45.9% | | PSD-ResNet50 | 2017-10-01 |
| Self-Supervised Model Adaptation for Multimodal Semantic Segmentation | ✓ Link | 45.73 | | TokenFusion (S) | 2018-08-11 |
| Recurrent Scene Parsing with Perspective Understanding in the Loop | ✓ Link | 45.1% | | DPLNet | 2017-05-20 |
| CI-Net: Contextual Information for Joint Semantic Segmentation and Depth Estimation | | 44.3% | | DFormer-L | 2021-07-29 |
| Depth-aware CNN for RGB-D Segmentation | ✓ Link | 42.0% | | TokenFusion (S) | 2018-03-19 |
| Self-Supervised Model Adaptation for Multimodal Semantic Segmentation | ✓ Link | 38.4 | | DPLNet | 2018-08-11 |
| Missing Modality Robustness in Semi-Supervised Multi-Modal Semantic Segmentation | ✓ Link | | 48.17 | DFormer-L | 2023-04-21 |