OpenCodePapers

instance-segmentation-on-coco-minival

Instance Segmentation
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodemask APAP50AP75APLAPMAPSGFLOPsParams (M)box APModelNameReleaseDate
DETRs with Collaborative Hybrid Assignments Training✓ Link56.679.762.874.659.738.9Co-DETR2022-11-22
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions)✓ Link55.9ViT-CoMer-L (Mask RCNN, DINOv2)2024-03-13
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link55.480.161.574.458.437.9InternImage-H2022-11-10
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale✓ Link55.079.460.972.058.437.6EVA2022-11-14
Mask Frozen-DETR: High Quality Instance Segmentation with One GPU54.978.960.872.958.4 37.2Mask Frozen-DETR2023-08-07
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation✓ Link54.5MasK DINO (SwinL, multi-scale)2022-06-06
Vision Transformer Adapter for Dense Predictions✓ Link54.2ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale)2022-05-17
General Object Foundation Model for Images and Videos at Scale✓ Link54.2GLEE-Pro2023-12-14
Swin Transformer V2: Scaling Up Capacity and Resolution✓ Link53.7SwinV2-G (HTC++)2021-11-18
Exploring Plain Vision Transformer Backbones for Object Detection✓ Link53.1ViTDet, ViT-H Cascade (multiscale)2022-03-30
General Object Foundation Model for Images and Videos at Scale✓ Link53.0GLEE-Plus2023-12-14
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation✓ Link52.6Mask DINO (SwinL)2022-06-06
End-to-End Semi-Supervised Object Detection with Soft Teacher✓ Link52.5Soft Teacher + Swin-L(HTC++, multi-scale)2021-06-16
Vision Transformer Adapter for Dense Predictions✓ Link52.5ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)2022-05-17
Vision Transformer Adapter for Dense Predictions✓ Link52.2ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)2022-05-17
Exploring Plain Vision Transformer Backbones for Object Detection✓ Link52ViTDet, ViT-H Cascade2022-03-30
End-to-End Semi-Supervised Object Detection with Soft Teacher✓ Link51.9Soft Teacher + Swin-L(HTC++, single-scale)2021-06-16
CBNet: A Composite Backbone Network Architecture for Object Detection✓ Link51.8CBNetV2 (Dual-Swin-L HTC, multi-scale)2021-07-01
Could Giant Pretrained Image Models Extract Universal Representations?51.6Frozen Backbone, SwinV2-G-ext22K (HTC)2022-11-03
CBNet: A Composite Backbone Network Architecture for Object Detection✓ Link51CBNetV2 (Dual-Swin-L HTC, multi-scale)2021-07-01
Focal Self-attention for Local-Global Interactions in Vision Transformers✓ Link50.9Focal-L (HTC++, multi-scale)2021-07-01
Dilated Neighborhood Attention Transformer✓ Link50.875.0DiNAT-L (single-scale, Mask2Former)2022-09-29
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link50.5MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)2021-12-02
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows✓ Link50.4Swin-L (HTC++, multi scale)2021-03-25
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link50.3MOAT-3 (IN-22K pretraining, single-scale)2022-10-04
Masked-attention Mask Transformer for Universal Image Segmentation✓ Link50.1Mask2Former (Swin-L)2021-12-02
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows✓ Link49.5Swin-L (HTC++, single scale)2021-03-25
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link49.3MOAT-2 (IN-22K pretraining, single-scale)2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link49.0MOAT-1 (IN-1K pretraining, single-scale)2022-10-04
Instances as Queries✓ Link48.974.053.968.352.630.8QueryInst (single scale)2021-05-05
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation✓ Link48.9Cascade Eff-B7 NAS-FPN (1280, self-training Copy Paste, single-scale)2020-12-13
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link48.81782387InternImage-XL2022-11-10
X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion✓ Link48.8CenterNet2 (Swin-L w/ X-Paste + Copy-Paste)2022-12-07
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles✓ Link48.6Heira-L2023-06-01
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link48.5139927756.1InternImage-L2022-11-10
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link48.5MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)2021-12-02
General Object Foundation Model for Images and Videos at Scale✓ Link48.4GLEE-Lite2023-12-14
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link47.4MOAT-0 (IN-1K pretraining, single-scale)2022-10-04
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link47.1MViTv2-L (Cascade Mask R-CNN, single-scale)2021-12-02
MPViT: Multi-Path Vision Transformer for Dense Prediction✓ Link47.0MPViT-B (Cascade Mask R-CNN, multi-scale, IN1k pre-train)2021-12-21
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link47.0tiny-MOAT-3 (IN-1K pretraining, single-scale)2022-10-04
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation✓ Link46.8Cascade Eff-B7 NAS-FPN (1280)2020-12-13
ResNeSt: Split-Attention Networks✓ Link46.25ResNeSt-200 (multi-scale)2020-04-19
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link46.2MViT-L (Mask R-CNN, single-scale)2021-12-02
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization✓ Link46.1RetinaNet (SpineNet-190, 1536x1536)2019-12-10
MPViT: Multi-Path Vision Transformer for Dense Prediction✓ Link45.8MPViT-B (Cascade R-CNN, sinlge-scale, IN-1K pre-train)2021-12-21
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding✓ Link45.749.9Mask R-CNN (ViL Base, multi-scale, 3x lr)2021-03-29
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding✓ Link45.167.249.3Mask R-CNN (ViL Base, 1x lr)2021-03-29
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link45.0tiny-MOAT-2 (IN-1K pretraining, single-scale)2022-10-04
Global Context Networks✓ Link44.767.948.4GCNet (ResNeXt-101 + DCN + cascade + GC r4)2020-12-24
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link44.6tiny-MOAT-1 (IN-1K pretraining, single-scale)2022-10-04
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link44.53406949.7InternImage-S2022-11-10
ResNeSt: Split-Attention Networks✓ Link44.5ResNeSt-200-DCN (single-scale)2020-04-19
ELSA: Enhanced Local Self-Attention for Vision Transformer✓ Link44.467.847.8ELSA-S (Cascade Mask RCNN)2021-12-23
Bottleneck Transformers for Visual Recognition✓ Link44.4BoTNet 200 (Mask R-CNN, single scale, 72 epochs)2021-01-27
DaViT: Dual Attention Vision Transformers✓ Link44.3DaViT-T (Mask R-CNN, 36 epochs)2022-04-07
ResNeSt: Split-Attention Networks✓ Link44.21ResNeSt-200 (single-scale)2020-04-19
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link43.72704949.1InternImage-T2022-11-10
Bottleneck Transformers for Visual Recognition✓ Link43.7BoTNet 152 (Mask R-CNN, single scale, 72 epochs)2021-01-27
XCiT: Cross-Covariance Image Transformers✓ Link43.7XCiT-M24/82021-06-17
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link43.3tiny-MOAT-0 (IN-1K pretraining, single-scale)2022-10-04
ELSA: Enhanced Local Self-Attention for Vision Transformer✓ Link43.067.346.4ELSA-S (Mask RCNN)2021-12-23
XCiT: Cross-Covariance Image Transformers✓ Link43.0XCiT-S24/82021-06-17
CenterMask : Real-Time Anchor-Free Instance Segmentation✓ Link42.5CenterMask-VoVNetV2-99 (multi-scale)2019-11-15
ResNeSt: Split-Attention Networks✓ Link41.56ResNeSt-101 (single-scale)2020-04-19
Scaling up Multi-domain Semantic Segmentation with Sentence Embeddings41.4SIW2022-02-04
Res2Net: A New Multi-scale Backbone Architecture✓ Link41.3Res2Net-101+HTC2019-04-02
Deep High-Resolution Representation Learning for Human Pose Estimation✓ Link41.0HTC (HRNetV2p-W48)2019-02-25
Deep High-Resolution Representation Learning for Visual Recognition✓ Link41.0HTC (HRNetV2p-W48)2019-08-20
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond✓ Link40.9GCNet (ResNeXt-101 + DCN + cascade + GC r16)2019-04-25
Bottleneck Transformers for Visual Recognition✓ Link40.7BoTNet 50 (72 epochs)2021-01-27
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing✓ Link40.461.34456.143.622.3R3-CNN (ResNet-50-FPN, DCN)2021-04-03
Non-local Neural Networks✓ Link40.3Mask R-CNN (ResNext-152, +1 NL)2017-11-21
Attentive Normalization✓ Link40.263.243.3Mask R-CNN-FPN (AOGNet-40M)2019-08-04
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing✓ Link40.261.143.542.822.6R3-CNN (ResNet-50-FPN, GC-Net)2021-04-03
CenterMask : Real-Time Anchor-Free Instance Segmentation✓ Link40.2CenterMask-VoVNetV2-99-3x2019-11-15
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing✓ Link39.158.842.354.342.120.7R3-CNN (ResNet-50-FPN, GRoIE)2021-04-03
Mask Scoring R-CNN✓ Link39.1Mask Scoring R-CNN (ResNet-101-FPN-DCN)2019-03-01
Micro-Batch Training with Batch-Channel Normalization and Weight Standardization✓ Link38.3461.0740.8256.0841.7318.32Mask R-CNN-FPN (ResNeXt-101, GN+WS)2019-03-25
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing✓ Link38.25841.452.84120.4R3-CNN (ResNet-50-FPN)2021-04-03
Hybrid Task Cascade for Instance Segmentation✓ Link38.2HTC (ResNet-50)2019-01-22
Mask Scoring R-CNN✓ Link38.2Mask Scoring R-CNN (ResNet-101 FPN)2019-03-01
Path Aggregation Network for Instance Segmentation✓ Link37.8PANet (ResNet-50)2018-03-05
A novel Region of Interest Extraction Layer for Instance Segmentation✓ Link37.259.339.851.24120.2GCnet (ResNet-50-FPN, GRoIE)2020-04-28
X-volution: On the unification of convolution and self-attention37.253.14019.2Mask R-CNN (FPN, X-volution, SA)2021-06-04
Non-local Neural Networks✓ Link37.1Mask R-CNN (ResNet-101, +1 NL)2017-11-21
Mask Scoring R-CNN✓ Link36.0Mask Scoring R-CNN (ResNet-50 FPN)2019-03-01
A novel Region of Interest Extraction Layer for Instance Segmentation✓ Link35.857.138.048.73919.1Mask R-CNN (ResNet-50-FPN, GRoIE)2020-04-28
Res2Net: A New Multi-scale Backbone Architecture✓ Link35.657.653.737.915.7Faster R-CNN (Res2Net-50)2019-04-02
Non-local Neural Networks✓ Link35.5Mask R-CNN (ResNet-50, +1 NL)2017-11-21
Adaptively Connected Neural Networks✓ Link35.2Mask R-CNN (ResNet-50, ACNet)2019-04-07
YOLACT: Real-time Instance Segmentation✓ Link29.9YOLACT-550 (ResNet-50)2019-04-04
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link501115InternImage-B2022-11-10