DETRs with Collaborative Hybrid Assignments Training | ✓ Link | 57.1 | 80.2 | 63.4 | 41.6 | 60.1 | 72.0 | Co-DETR | 2022-11-22 |
CBNet: A Composite Backbone Network Architecture for Object Detection | ✓ Link | 56.1 | 80.3 | 62.1 | 39.7 | 59.3 | 70.9 | CBNetV2 (EVA02, single-scale) | 2021-07-01 |
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale | ✓ Link | 55.5 | 80.0 | | 36.3 | 58.0 | 72.4 | EVA | 2022-11-14 |
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation | ✓ Link | 55.4 | | | | | | FD-SwinV2-G | 2022-05-27 |
Mask Frozen-DETR: High Quality Instance Segmentation with One GPU | | 55.3 | 79.3 | 61.4 | 37.8 | 58.4 | 70.4 | Mask Frozen-DETR | 2023-08-07 |
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks | ✓ Link | 54.8 | | | | | | BEiT-3 | 2022-08-22 |
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation | ✓ Link | 54.7 | | | | | | MasK DINO (SwinL, multi-scale) | 2022-06-06 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 54.5 | | | | | | ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale) | 2022-05-17 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 54.5 | | | | | | GLEE-Pro | 2023-12-14 |
Swin Transformer V2: Scaling Up Capacity and Resolution | ✓ Link | 54.4 | | | | | | SwinV2-G (HTC++) | 2021-11-18 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 53.3 | | | | | | GLEE-Plus | 2023-12-14 |
End-to-End Semi-Supervised Object Detection with Soft Teacher | ✓ Link | 53.0 | | | | | | Soft Teacher + Swin-L (HTC++, multi-scale) | 2021-06-16 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 53.0 | | | | | | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) | 2022-05-17 |
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation | ✓ Link | 52.8 | | | | | | Mask DINO (SwinL, single -scale) | 2022-06-06 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 52.5 | | | | | | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) | 2022-05-17 |
CBNet: A Composite Backbone Network Architecture for Object Detection | ✓ Link | 52.3 | | | | | | CBNetV2 (Dual-Swin-L HTC, multi-scale) | 2021-07-01 |
Universal Instance Perception as Object Discovery and Retrieval | ✓ Link | 51.8 | 76.2 | 56.7 | 33.3 | 55.9 | 67.5 | UNINEXT-H | 2023-03-12 |
CBNet: A Composite Backbone Network Architecture for Object Detection | ✓ Link | 51.6 | | | | | | CBNetV2 (Dual-Swin-L HTC, single-scale) | 2021-07-01 |
Focal Self-attention for Local-Global Interactions in Vision Transformers | ✓ Link | 51.3 | 75.4 | 56.5 | 35.6 | | 64.2 | Focal-L (HTC++, multi-scale) | 2021-07-01 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | ✓ Link | 51.1 | | | | | | Swin-L (HTC++, multi scale) | 2021-03-25 |
Masked-attention Mask Transformer for Universal Image Segmentation | ✓ Link | 50.5 | 74.9 | 54.9 | 29.1 | 53.8 | 71.2 | Mask2Former (Swin-L, single scale) | 2021-12-02 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | ✓ Link | 50.2 | | | | | | Swin-L (HTC++, single scale) | 2021-03-25 |
ISTR: End-to-End Instance Segmentation with Transformers | ✓ Link | 49.7 | | | | | | ISTR-SMT (Swin-L, single scale) | 2021-05-03 |
Instances as Queries | ✓ Link | 49.1 | 74.2 | 53.8 | 31.5 | 51.8 | 63.2 | QueryInst (single scale) | 2021-05-05 |
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation | ✓ Link | 49.1 | | | | | | Cascade Eff-B7 NAS-FPN (1280, self-training Copy Paste, single-scale) | 2020-12-13 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 48.8 | | | | | | dBOT ViT-L (CLIP) | 2022-09-08 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 48.8 | | | | | | MogaNet-XL (Cascade Mask R-CNN) | 2022-11-07 |
DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution | ✓ Link | 48.5 | 72.0 | 53.3 | 31.6 | 50.9 | 61.5 | DetectoRS (ResNeXt-101-64x4d, multi-scale) | 2020-06-03 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 48.3 | | | | | | dBOT ViT-L | 2022-09-08 |
DiffusionInst: Diffusion Model for Instance Segmentation | ✓ Link | 48.3 | | | | | | DiffusionInst-SwinL | 2022-12-06 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 48.3 | | | | | | GLEE-Lite | 2023-12-14 |
DiffusionInst: Diffusion Model for Instance Segmentation | ✓ Link | 47.6 | | | | | | DiffusionInst-SwinB | 2022-12-06 |
DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution | ✓ Link | 47.1 | 71.1 | 51.6 | 30.3 | 49.5 | 59.6 | DetectoRS (ResNeXt-101-32x4d, multi-scale) | 2020-06-03 |
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation | ✓ Link | 46.9 | | | | | | Cascade Eff-B7 NAS-FPN (1280) | 2020-12-13 |
SOLQ: Segmenting Objects by Learning Queries | ✓ Link | 46.7 | | | | | | SOLQ (Swin-L, single scale) | 2021-06-04 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 46.3 | | | | | | dBOT ViT-B | 2022-09-08 |
Exploring Target Representations for Masked Autoencoders | ✓ Link | 46.2 | | | | | | dBOT ViT-B (CLIP) | 2022-09-08 |
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization | ✓ Link | 46.1 | | | | | | Mask R-CNN (SpineNet-190, 1536x1536) | 2019-12-10 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 46.1 | | | | | | MogaNet-L (Cascade Mask R-CNN) | 2022-11-07 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 46 | | | | | | MogaNet-B (Cascade Mask R-CNN) | 2022-11-07 |
A Tri-Layer Plugin to Improve Occluded Detection | ✓ Link | 45.9 | | | | | | Swin-B + Cascade Mask R-CNN (tri-layer modelling) | 2022-10-18 |
Global Context Networks | ✓ Link | 45.4 | 68.9 | 49.6 | | | | GCNet (ResNeXt-101 + DCN + cascade + GC r4) | 2020-12-24 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 45.1 | | | | | | MogaNet-S (Cascade Mask R-CNN) | 2022-11-07 |
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window | | 45.03 | | | | | | gSwin-S | 2022-08-24 |
iBOT: Image BERT Pre-Training with Online Tokenizer | ✓ Link | 44.2 | | | | | | iBOT (ViT-B/16) | 2021-11-15 |
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window | | 44.16 | | | | | | gSwin-T | 2022-08-24 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 44.1 | | | | | | MogaNet-L (Mask R-CNN 1x) | 2022-11-07 |
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | ✓ Link | 43.5 | | | | | | A2MIM (ViT-B) | 2022-05-27 |
CBNet: A Novel Composite Backbone Network Architecture for Object Detection | ✓ Link | 43.3 | | | | | | Cascade Mask R-CNN (ResNeXt152, CBNet) | 2019-09-09 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 43.2 | | | | | | MogaNet-B (Mask R-CNN 1x) | 2022-11-07 |
ResNeSt: Split-Attention Networks | ✓ Link | 43% | | | | | | ResNeSt101 | 2020-04-19 |
gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window | | 42.87 | | | | | | gSwin-VT | 2022-08-24 |
iBOT: Image BERT Pre-Training with Online Tokenizer | ✓ Link | 42.6 | | | | | | iBOT (ViT-S/16) | 2021-11-15 |
Mask Transfiner for High-Quality Instance Segmentation | ✓ Link | 42.2 | | | | | | Mask Transfiner(ResNet101-FPN) | 2021-11-26 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 42.2 | | | | | | MogaNet-S (Mask R-CNN 1x) | 2022-11-07 |
Path Aggregation Network for Instance Segmentation | ✓ Link | 42.0 | | | | | | PANet | 2018-03-05 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | 41.8 | | | 24.4 | 44.4 | 54.3 | CenterMask + VoVNet99 | 2019-11-15 |
SOLOv2: Dynamic and Fast Instance Segmentation | ✓ Link | 41.7 | 63.2 | 45.1 | 18.0 | 45.0 | 61.6 | SOLOv2(Res-DCN-101-FPN) | 2020-03-23 |
Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers | ✓ Link | 41.7 | | | | | | BCNet(ResNeXt-101 + FPN+ FCOS) | 2021-03-23 |
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond | ✓ Link | 41.5% | | | | | | GCNet (ResNeXt-101 + DCN + cascade + GC r16) | 2019-04-25 |
DiffusionInst: Diffusion Model for Instance Segmentation | ✓ Link | 41.5 | | | | | | DiffusionInst-ResNet101 | 2022-12-06 |
BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation | ✓ Link | 41.3 | 63.1 | 44.6 | 22.7 | 44.1 | 54.5 | BlendMask (ResNet-101 + DCN interval=3) | 2020-01-02 |
Hybrid Task Cascade for Instance Segmentation | ✓ Link | 41.2 | | | | | | HTC + ResNeXt-101-FPN + DCN | 2019-01-22 |
Hybrid Task Cascade for Instance Segmentation | ✓ Link | 41.2% | | | | | | HTC + ResNeXt-101-FPN | 2019-01-22 |
SOLQ: Segmenting Objects by Learning Queries | ✓ Link | 40.9 | | | | | | SOLQ (ResNet101, single scale) | 2021-06-04 |
An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection | ✓ Link | 40.8% | | | | | | VoVNetV1-57 | 2019-04-22 |
K-Net: Towards Unified Image Segmentation | ✓ Link | 40.6% | 63.3 | | 18.8 | 43.3 | 59 | K-Net-N256 (ResNet-101) | 2021-06-28 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | 40.6 | 62.3 | 44.1 | 20.1 | 42.8 | 57.0 | CenterMask + VoVNetV2-99 (single-scale) | 2019-11-15 |
SOLO: Segmenting Objects by Locations | ✓ Link | 40.4 | 62.7 | 43.3 | 17.6 | 43.3 | 58.9 | SOLO(Res-DCN-101-FPN) | 2019-12-10 |
SOLO: Segmenting Objects by Locations | ✓ Link | 40.4% | 62.7% | 43.3% | 17.6% | 43.3% | 58.9% | SOLO (ResNet-DCN-101-FPN) | 2019-12-10 |
D2Det: Towards High Quality Object Detection and Instance Segmentation | ✓ Link | 40.2 | 61.5 | 43.7 | 21.7 | 43.0 | 54.0 | D2Det (ResNet-101, single-scale test) | 2020-06-01 |
K-Net: Towards Unified Image Segmentation | ✓ Link | 40.1% | 62.8 | | 18.7 | 42.7 | 58.8 | K-Net (ResNet-101) | 2021-06-28 |
ISTR: End-to-End Instance Segmentation with Transformers | ✓ Link | 39.9% | | | 22.8 | 41.9 | 52.3 | ISTR (ResNet101-FPN-3x, single-scale) | 2021-05-03 |
Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers | ✓ Link | 39.8 | 61.5 | 43.1 | 22.7 | 42.4 | 51.1 | BCNet(ResNet-101-FPN + Faster RCNN) | 2021-03-23 |
An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection | ✓ Link | 39.7% | | | | | | VoVNetV1-39 | 2019-04-22 |
SOLQ: Segmenting Objects by Learning Queries | ✓ Link | 39.7 | | | | | | SOLQ (ResNet50, single scale) | 2021-06-04 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | 39.6 | 61.2 | 42.9 | 19.7 | | | CenterMask + X101-32x8d (single-scale) | 2019-11-15 |
Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers | ✓ Link | 39.6 | 61.2 | 42.7 | 22.3 | 42.3 | 51.0 | BCNet(ResNet-101-FPN + FCOS) | 2021-03-23 |
Mask Scoring R-CNN | ✓ Link | 39.6% | | | | | | MS R-CNN + ResNet-101 DCN + FPN | 2019-03-01 |
InstaBoost: Boosting Instance Segmentation via Probability Map Guided Copy-Pasting | ✓ Link | 39.5% | 61.4% | 42.9% | 21.2% | 42.5% | 52.1% | Cascade R-CNN (ResNet-101-FPN, map-guided) | 2019-08-21 |
Commonality-Parsing Network across Shape and Appearance for Partially Supervised Instance Segmentation | ✓ Link | 39.2 | 60.8 | 42.2 | 22.2 | 41.8 | 50.1 | CPMask | 2020-07-24 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 39.1 | | | | | | MogaNet-T (Mask R-CNN 1x) | 2022-11-07 |
PolarMask++: Enhanced Polar Representation for Single-Shot Instance Segmentation and Beyond | ✓ Link | 38.7 | 64.1 | 40.0 | 22.2 | 40.2 | 52.0 | PolarMask++ (ResNeXt-101-DCN) | 2021-05-05 |
ISDA: Position-Aware Instance Segmentation with Deformable Attention | ✓ Link | 38.7 | 62 | 41.1 | 17 | 41.2 | | ISDA (ours) | 2022-02-23 |
ISTR: End-to-End Instance Segmentation with Transformers | ✓ Link | 38.6% | | | 22.1 | 40.4 | 50.6 | ISTR (ResNet50-FPN-3x, single-scale) | 2021-05-03 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | 38.3 | | | | | | CenterMask + ResNet-101-FPN | 2019-11-15 |
SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation | ✓ Link | 38.1 | 60.2 | 40.8 | 17.8 | 40.8 | 54.3 | SipMask (ResNet-101, single-scale test) | 2020-07-29 |
MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features | | 38.1% | | | | | | MaskLab+ (ResNet-101, JFT) | 2017-12-13 |
EmbedMask: Embedding Coupling for One-stage Instance Segmentation | ✓ Link | 37.7% | 59.1% | 40.3% | 17.9% | 40.4% | 53% | EmbedMask (ResNet-101-FPN) | 2019-12-04 |
EmbedMask: Embedding Coupling for One-stage Instance Segmentation | ✓ Link | 37.7 | 59.1 | 40.3 | 17.9 | 40.4 | | EmbedMask(R-101-FPN) | 2019-12-04 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 37.6 | | | | | | MogaNet-XT | 2022-11-07 |
TensorMask: A Foundation for Dense Object Segmentation | ✓ Link | 37.3% | | | | | | TensorMask (ResNet-101-FPN) | 2019-03-28 |
Mask R-CNN | ✓ Link | 37.1 | 60.0 | 39.4 | 16.9 | 39.9 | 53.5 | Mask R-CNN (ResNeXt-101-FPN) | 2017-03-20 |
DiffusionInst: Diffusion Model for Instance Segmentation | ✓ Link | 37.1 | | | | | | DiffusionInst-ResNet50 | 2022-12-06 |
VirTex: Learning Visual Representations from Textual Annotations | ✓ Link | 36.9 | 58.4 | 39.7 | | | | VirTex Mask R-CNN (ResNet-50-FPN) | 2020-06-11 |
RDSNet: A New Deep Architecture for Reciprocal Object Detection and Instance Segmentation | ✓ Link | 36.4% | 57.9% | 39.0% | 16.4% | 39.5% | 51.6% | RDSNet (data aug) | 2019-12-11 |
MogaNet: Multi-order Gated Aggregation Network | ✓ Link | 35.8 | | | | | | MogaNet-T | 2022-11-07 |
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN | ✓ Link | 34.9 | | | | | | A2MIM (ResNet-50 2x) | 2022-05-27 |
E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation | ✓ Link | 33.8 | 52.9 | 35.9 | | | | E2EC DLA-34 | 2022-03-08 |
Fully Convolutional Instance-aware Semantic Segmentation | ✓ Link | 33.6% | 54.5% | | | | | FCIS+++ +OHEM | 2016-11-23 |
torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation | ✓ Link | 33.6 | | | | | | Mask R-CNN (Bottleneck-injected ResNet-50, FPN) | 2020-11-25 |
PolarMask: Single Shot Instance Segmentation with Polar Representation | ✓ Link | 32.9% | 55.4% | 33.8% | 15.5% | 35.1% | 46.3% | PolarMask (ResNeXt-101-FPN) | 2019-09-29 |
PolarMask: Single Shot Instance Segmentation with Polar Representation | ✓ Link | 30.4% | 51.9% | 31% | 13.4% | 32.4% | 42.8% | PolarMask (ResNet-101-FPN) | 2019-09-29 |
YOLACT: Real-time Instance Segmentation | ✓ Link | 29.8% | | | | | | YOLACT (ResNet-50-FPN) | 2019-04-04 |
Fully Convolutional Instance-aware Semantic Segmentation | ✓ Link | 29.2% | 49.5% | | 7.1% | 31.3% | 50.0% | FCIS +OHEM | 2016-11-23 |
A MultiPath Network for Object Detection | ✓ Link | 25.0% | | | | | | MultiPath Network | 2016-04-07 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | | 80.8 | 62.2 | 41.0 | 58.9 | 70.3 | InternImage-H | 2022-11-10 |
ResNeSt: Split-Attention Networks | ✓ Link | | 70.2 | 51.5 | 30.0 | 49.6 | 60.6 | ResNeSt-200 (multi-scale) | 2020-04-19 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | | 66.2 | 47.4 | 27.2 | | | CenterMask + VoVNetV2-99 (multi-scale) | 2019-11-15 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | | 60.8 | | 19.4 | 41.7 | | CenterMask + VoVNetV2-57 (single-scale) | 2019-11-15 |
Instance-aware Semantic Segmentation via Multi-task Network Cascades | ✓ Link | | 44.3% | | | | | MNC | 2015-12-14 |
ISDA: Position-Aware Instance Segmentation with Deformable Attention | ✓ Link | | | | | | 55.7 | ISDA (ResNet-50) | 2022-02-23 |