OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning | | 63.6 | | OmniVec2 | 2024-01-01 |
Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer | | 61.5 | | DiffusionMMS (DAT++-S) | 2024-09-23 |
DepthMatch: Semi-Supervised RGB-D Scene Parsing through Depth-Guided Regularization | | 61.4 | | DepthMatch (DINOv2-S) | 2025-05-26 |
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer | ✓ Link | 60.9 | | GeminiFusion (Swin-Large) | 2024-06-03 |
OmniVec: Learning robust representations with cross modal sharing | | 60.8 | | OmniVec | 2023-11-07 |
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer | ✓ Link | 60.2 | | GeminiFusion (Swin-Large) | 2024-06-03 |
Efficient Multimodal Semantic Segmentation via Dual-Prompt Learning | ✓ Link | 59.3 | | DPLNet | 2023-12-01 |
HDBFormer: Efficient RGB-D Semantic Segmentation with A Heterogeneous Dual-Branch Framework | ✓ Link | 59.3% | | HDBFormer | 2025-04-18 |
PanopticNDT: Efficient and Robust Panoptic Mapping | ✓ Link | 59.02 | | EMSANet (2x ResNet-34 NBt1D, PanopticNDT version, finetuned) | 2023-09-24 |
DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation | ✓ Link | 58.4% | | DFormerv2-L | 2025-04-07 |
SwinMTL: A Shared Architecture for Simultaneous Depth Estimation and Semantic Segmentation from Monocular Camera Images | ✓ Link | 58.14% | | SwinMTL | 2024-03-15 |
PolyMaX: General Dense Prediction with Mask Transformer | ✓ Link | 58.08% | | PolyMaX(ConvNeXt-L) | 2023-11-09 |
HSPFormer: Hierarchical Spatial Perception Transformer for Semantic Segmentation | ✓ Link | 57.8% | | HSPFormer(PVT v2-B4) | 2025-01-16 |
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer | ✓ Link | 57.7 | | GeminiFusion (MiT-B5) | 2024-06-03 |
DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation | ✓ Link | 57.7% | | DFormerv2-B | 2025-04-07 |
DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation | ✓ Link | 57.2% | | DFormer-L | 2023-09-18 |
CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers | ✓ Link | 56.9% | | CMX (B5) | 2022-03-09 |
Delivering Arbitrary-Modal Semantic Segmentation | ✓ Link | 56.9% | | CMNeXt (B4) | 2023-03-02 |
Omnivore: A Single Model for Many Visual Modalities | ✓ Link | 56.8% | | OMNIVORE (Swin-L, finetuned) | 2022-01-20 |
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer | ✓ Link | 56.8 | | GeminiFusion (MiT-B3) | 2024-06-03 |
CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers | ✓ Link | 56.3% | | CMX (B4) | 2022-03-09 |
MultiMAE: Multi-modal Multi-task Masked Autoencoders | ✓ Link | 56.0% | | MultiMAE (ViT-B) | 2022-04-04 |
DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation | ✓ Link | 56.0% | | DFormerv2-S | 2025-04-07 |
Understanding Dark Scenes by Contrasting Multi-Modal Observations | ✓ Link | 55.8% | | SMMCL (SegNeXt-B) | 2023-08-23 |
DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation | ✓ Link | 55.6% | | DFormer-B | 2023-09-18 |
ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer | ✓ Link | 55.5% | | ComPtr (Swin-B) | 2023-07-23 |
AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation | ✓ Link | 55.3% | | AsymFormer | 2023-09-25 |
Omnivore: A Single Model for Many Visual Modalities | ✓ Link | 55.1% | | OMNIVORE (Swin-B, finetuned) | 2022-01-20 |
HAPNet: Toward Superior RGB-Thermal Scene Parsing via Hybrid, Asymmetric, and Progressive Heterogeneous Feature Fusion | ✓ Link | 55.0 | 68.8 | HAPNet | 2024-04-04 |
CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers | ✓ Link | 54.4% | | CMX (B2) | 2022-03-09 |
Multimodal Token Fusion for Vision Transformers | ✓ Link | 54.2% | | TokenFusion (S) | 2022-04-19 |
Understanding Dark Scenes by Contrasting Multi-Modal Observations | ✓ Link | 53.7% | | SMMCL (SegFormer-B2) | 2023-08-23 |
DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation | ✓ Link | 53.6% | | DFormer-S | 2023-09-18 |
InvPT: Inverted Pyramid Multi-task Transformer for Dense Scene Understanding | ✓ Link | 53.56% | | InvPT | 2022-03-15 |
HS3: Learning with Proper Task Complexity in Hierarchically Supervised Semantic Segmentation | | 53.5% | | HS3-Fuse (ResNet-101) | 2021-11-03 |
Pixel Difference Convolutional Network for RGB-D Semantic Segmentation | | 53.5% | | PDCNet (ResNet-101) | 2023-02-23 |
Efficient Multi-Task RGB-D Scene Analysis for Indoor Environments | ✓ Link | 53.34% | | EMSANet (2x ResNet-34 NBt1D, finetuned) | 2022-07-10 |
DCANet: Differential Convolution Attention Network for RGB-D Semantic Segmentation | | 53.3% | | DCANet (ResNet-101) | 2022-10-13 |
Multimodal Token Fusion for Vision Transformers | ✓ Link | 53.3% | | TokenFusion (Ti) | 2022-04-19 |
InverseForm: A Loss Function for Structured Boundary-Aware Segmentation | ✓ Link | 53.1% | | InverseForm (ResNet-101) | 2021-04-06 |
Context-Aware Interaction Network for RGB-T Semantic Segmentation | ✓ Link | 52.6% | | CAINet (MobileNet-V2) | 2024-01-03 |
Channel Exchanging Networks for Multimodal and Multitask Dense Image Prediction | ✓ Link | 52.5% | | CEN-PSPNet (ResNet-152) | 2021-12-04 |
Attention-based Dual Supervised Decoder for RGBD Semantic Segmentation | | 52.5% | | AMF (ResNet-50) | 2022-01-05 |
Understanding Dark Scenes by Contrasting Multi-Modal Observations | ✓ Link | 52.5% | | SMMCL (ResNet-101) | 2023-08-23 |
Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation | ✓ Link | 52.4% | | SA-Gate | 2020-07-17 |
Warp-Refine Propagation: Semi-Supervised Auto-labeling via Cycle-consistency | | 52.2% | | Warp-Refine | 2021-09-28 |
Deep feature selection-and-fusion for RGB-D semantic segmentation | | 52.0% | | FSFNet | 2021-05-10 |
Variational Context-Deformable ConvNets for Indoor Scene Parsing | | 51.9% | | VCD+ACNet (ResNet-50) | 2020-06-01 |
Optimizing rgb-d semantic segmentation through multi-modal interaction and pooling attention | | 51.9% | | MIPANet (ResNet50) | 2023-11-19 |
DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation | ✓ Link | 51.8% | | DFormer-T | 2023-09-18 |
ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation | ✓ Link | 51.3% | | ShapeConv (ResNext-101) | 2021-08-24 |
Efficient Multi-Task Scene Analysis with RGB-D Transformers | ✓ Link | 51.26% | | EMSAFormer (SwinV2-T-128-Multi-Aug) | 2023-06-08 |
Depth-Adapted CNNs for RGB-D Semantic Segmentation | | 51.24% | | Z-ACN (ResNet-101) | 2022-06-08 |
Learning Deep Multimodal Feature Representation with Asymmetric Multi-layer Fusion | ✓ Link | 51.2% | | AsymFusion (ResNet-152) | 2021-08-11 |
Dynamic Multimodal Fusion | ✓ Link | 51.0% | | DynMM (ResNet-50) | 2022-03-31 |
Spatial Information Guided Convolution for Real-Time RGBD Semantic Segmentation | ✓ Link | 51.0% | | SGNet (ResNet-101) | 2020-04-09 |
Pattern-Structure Diffusion for Multi-Task Learning | | 51.0% | | PSD-ResNet50 | 2020-06-01 |
Malleable 2.5D Convolution: Learning Receptive Fields along the Depth-axis for RGB-D Scene Parsing | ✓ Link | 50.9% | | Malleable 2.5D (ResNet-101) | 2020-07-18 |
Scene Parsing via Integrated Classification Model and Variance-Based Regularization | ✓ Link | 50.70 | | ICM | 2019-06-01 |
Multi-layer Feature Aggregation for Deep Scene Parsing Models | | 50.7% | | SANet | 2020-11-04 |
Variational Context-Deformable ConvNets for Indoor Scene Parsing | | 50.7% | | VCD+RedNet (ResNet-50) | 2020-06-01 |
HaarNet: Large-scale Linear-Morphological Hybrid Network for RGB-D Semantic Segmentation | | 50.7% | | HaarNet | 2023-10-11 |
Pattern-Affinitive Propagation across Depth, Surface Normal and Semantic Segmentation | | 50.4% | | PAP (ResNet-50) | 2019-06-08 |
Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing | ✓ Link | 50.4% | | Cerberus | 2021-11-24 |
Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis | ✓ Link | 50.30 | | ESANet (R34-NBt1D) | 2020-11-13 |
Depth-Adapted CNNs for RGB-D Semantic Segmentation | | 50.05% | | Z-ACN (ResNet-50) | 2022-06-08 |
Malleable 2.5D Convolution: Learning Receptive Fields along the Depth-axis for RGB-D Scene Parsing | ✓ Link | 49.7% | | Malleable 2.5D (ResNet-50) | 2020-07-18 |
MMANet: Margin-aware Distillation and Modality-aware Regularization for Incomplete Multimodal Learning | ✓ Link | 49.62% | | MMANet | 2023-04-17 |
Spatial-information Guided Adaptive Context-aware Network for Efficient RGB-D Semantic Segmentation | ✓ Link | 49.4% | | SGACNet (R34-NBt1D) | 2023-08-11 |
ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer | ✓ Link | 49.2% | | ComPtr (Swin-T) | 2023-07-23 |
Depth-Adapted CNNs for RGB-D Semantic Segmentation | | 49.15% | | Z-ACN (ResNet-34) | 2022-06-08 |
Improving Multi-Modal Learning with Uni-Modal Teachers | | 49.14% | | UMT | 2021-06-21 |
MTI-Net: Multi-Scale Task Interaction Networks for Multi-Task Learning | ✓ Link | 49.0 | | MTI-Net (HRNet-48) | 2020-01-19 |
ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation | ✓ Link | 49.0% | | ShapeConv (ResNet-101) | 2021-08-24 |
Multimodal Knowledge Expansion | ✓ Link | 48.88% | | MKE | 2021-03-26 |
ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation | ✓ Link | 48.8% | | ShapeConv (ResNet-50) | 2021-08-24 |
mmFormer: Multimodal Medical Transformer for Incomplete Multimodal Learning of Brain Tumor Segmentation | ✓ Link | 48.45% | | mmFormer | 2022-06-06 |
ACNet: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation | ✓ Link | 48.3% | | ACNet | 2019-05-24 |
Spatial-information Guided Adaptive Context-aware Network for Efficient RGB-D Semantic Segmentation | ✓ Link | 48.2% | | SGACNet (R18-NBt1D) | 2023-08-11 |
Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis | ✓ Link | 48.17 | | ESANet (R18-NBt1D ) | 2020-11-13 |
RFNet: Region-Aware Fusion Network for Incomplete Multi-Modal Brain Tumor Segmentation | ✓ Link | 48.13% | | RFNet | 2021-01-01 |
Contrastive Multimodal Fusion with TupleInfoNCE | ✓ Link | 48.1% | | TupleInfoNCE | 2021-07-06 |
Dense Decoder Shortcut Connections for Single-Pass Semantic Segmentation | | 48.1% | | DDSC (ResNet-101) | 2018-06-01 |
Cascaded Feature Network for Semantic Segmentation of RGB-D Images | | 47.7% | | CFN | 2017-10-01 |
Learning Fully Dense Neural Networks for Image Semantic Segmentation | | 47.4% | | FDNet (DenseNet264) | 2019-05-22 |
RedNet: Residual Encoder-Decoder Network for indoor RGB-D Semantic Segmentation | ✓ Link | 47.2% | | RedNet | 2018-06-04 |
Depth-Adapted CNNs for RGB-D Semantic Segmentation | | 47.02% | | Z-ACN (ResNet-18) | 2022-06-08 |
Joint Task-Recursive Learning for Semantic Segmentation and Depth Estimation | | 46.8% | | TRL (ResNet-101) | 2018-09-01 |
RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation | ✓ Link | 46.5% | | RefineNet (ResNet-101) | 2016-11-20 |
Prompt Guided Transformer for Multi-Task Dense Prediction | ✓ Link | 46.43 | | PGT (Swin-S) | 2023-07-28 |
Exploring Relational Context for Multi-Task Dense Prediction | ✓ Link | 46.33% | | ATRC | 2021-04-28 |
Locality-Sensitive Deconvolution Networks With Gated Fusion for RGB-D Indoor Semantic Segmentation | | 45.9% | | LS-DeconvNet | 2017-07-01 |
Variational Context-Deformable ConvNets for Indoor Scene Parsing | | 45.3 | | VCD+DeepLab (VGG16) | 2020-06-01 |
SOSD-Net: Joint Semantic Object Segmentation and Depth Estimation from Monocular images | | 45.0% | | SOSD-Net | 2021-01-19 |
Multi-Modal Attention-based Fusion Model for Semantic Segmentation of RGB-Depth Images | | 44.8% | | MMAF-Net-152 | 2019-12-25 |
Recurrent Scene Parsing with Perspective Understanding in the Loop | ✓ Link | 44.5% | | RecurrentSceneParsing | 2017-05-20 |
Light-Weight RefineNet for Real-Time Semantic Segmentation | ✓ Link | 44.4% | | Light-Weight-RefineNet-152 | 2018-10-08 |
Depth-aware CNN for RGB-D Segmentation | ✓ Link | 43.9% | | Depth-aware CNN | 2018-03-19 |
Light-Weight RefineNet for Real-Time Semantic Segmentation | ✓ Link | 43.6% | | Light-Weight-RefineNet-101 | 2018-10-08 |
Temporally Distributed Networks for Fast Video Semantic Segmentation | ✓ Link | 43.5 | | TD2-PSP50 | 2020-04-03 |
NDDR-CNN: Layerwise Feature Fusing in Multi-Task CNNs by Neural Discriminative Dimensionality Reduction | ✓ Link | 43.3% | | NDDR-CNN | 2018-01-25 |
3D Graph Neural Networks for RGBD Semantic Segmentation | ✓ Link | 43.1% | | 3DGNN | 2017-10-01 |
CI-Net: Contextual Information for Joint Semantic Segmentation and Depth Estimation | | 42.6% | | CI-Net | 2021-07-29 |
Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations | ✓ Link | 42.0% | | Multi-Task Light-Weight-RefineNet | 2018-09-13 |
Light-Weight RefineNet for Real-Time Semantic Segmentation | ✓ Link | 41.7% | | Light-Weight-RefineNet-50 | 2018-10-08 |
Prompt Guided Transformer for Multi-Task Dense Prediction | ✓ Link | 41.61 | | PGT (Swin-T) | 2023-07-28 |
Multi-Task Meta Learning: learn how to adapt to unseen tasks | ✓ Link | 41.51% | | MTML | 2022-10-13 |
Semantic Segmentation with Reverse Attention | | 41.2% | | RAN | 2017-07-20 |
DenseMTL: Cross-task Attention Mechanism for Dense Multi-task Learning | ✓ Link | 40.84% | | DenseMTL | 2022-06-17 |
STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling | ✓ Link | 40.1% | | STD2P | 2016-04-08 |
Masked Supervised Learning for Semantic Segmentation | ✓ Link | 39.31% | | MaskSup | 2022-10-03 |
HeMIS: Hetero-Modal Image Segmentation | ✓ Link | 37.77% | | HeMIS | 2016-07-18 |
Temporally Distributed Networks for Fast Video Semantic Segmentation | ✓ Link | 37.4 | | TD4-PSP18 | 2020-04-03 |
What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? | ✓ Link | 37.3% | | Bayesian DenseNet | 2017-03-15 |
RGB-based Semantic Segmentation Using Self-Supervised Depth Pre-Training | | 33.49% | | HN-network | 2020-02-06 |
Composite Learning for Robust and Effective Dense Predictions | | 33.48% | | CompL | 2022-10-13 |
Efficient Yet Deep Convolutional Neural Networks for Semantic Segmentation | ✓ Link | 32.3% | | Dilated FCN-2s RGB | 2017-07-26 |
AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning | ✓ Link | 29.6% | | AdaShare | 2019-11-27 |
Toward Edge-Efficient Dense Predictions with Synergistic Multi-Task Neural Architecture Search | | 22.1% | | EDNAS+JAReD | 2022-10-04 |
Cross-stitch Networks for Multi-task Learning | ✓ Link | 19.3% | | Cross-stitch | 2016-04-12 |
Fully Convolutional Networks for Semantic Segmentation | ✓ Link | | 44 | FCN-32s RGB-HHA | 2016-05-20 |