OpenCodePapers

object-detection-on-coco-minival

Object Detection
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodebox APAP50AP75APSAPMAPLParams (M)ModelNameReleaseDate
Perception Encoder: The best visual embeddings are not at the output of the network✓ Link66.01900PE_spatial (DETA)2025-04-17
DETRs with Collaborative Hybrid Assignments Training✓ Link65.9314Co-DETR2022-11-22
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information✓ Link65.0M3I Pre-training (InternImage-H)2022-11-17
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link65.0InternImage-H2022-11-10
DETRs with Collaborative Hybrid Assignments Training✓ Link64.7218Co-DETR (Swin-L)2022-11-22
A Strong and Reproducible Object Detector with Only Public Datasets✓ Link64.681.571.450.468.578.5689Focal-Stable-DINO (Focal-Huge, no TTA)2023-04-25
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale✓ Link64.582.170.849.468.478.5EVA2022-11-14
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions)✓ Link64.3363ViT-CoMer2024-03-13
Focal Modulation Networks✓ Link64.2FocalNet-H (DINO)2022-03-22
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions✓ Link64.2InternImage-XL2022-11-10
CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection64.1CP-DETR-L Swin-L(Fine tuning separately in COCO)2024-12-13
Reversible Column Networks✓ Link63.8RevCol-H(DINO)2022-12-22
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection✓ Link63.2DINO (Swin-L)2022-03-07
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection✓ Link63.0Grounding DINO2023-03-09
Swin Transformer V2: Scaling Up Capacity and Resolution✓ Link62.5SwinV2-G (HTC++)2021-11-18
Florence: A New Foundation Model for Computer Vision✓ Link62Florence-CoSwin-H2021-11-22
General Object Foundation Model for Images and Videos at Scale✓ Link62.0GLEE-Pro2023-12-14
Exploring Plain Vision Transformer Backbones for Object Detection✓ Link61.3ViTDet, ViT-H Cascade (multiscale)2022-03-30
Grounded Language-Image Pre-training✓ Link60.8GLIP (Swin-L, multi-scale)2021-12-07
End-to-End Semi-Supervised Object Detection with Soft Teacher✓ Link60.7Soft Teacher + Swin-L (HTC++, multi-scale)2021-06-16
Universal Instance Perception as Object Discovery and Retrieval✓ Link60.677.566.745.164.875.3UNINEXT-H2023-03-12
Vision Transformer Adapter for Dense Predictions✓ Link60.5ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)2022-05-17
Exploring Plain Vision Transformer Backbones for Object Detection✓ Link60.4ViTDet, ViT-H Cascade2022-03-30
General Object Foundation Model for Images and Videos at Scale✓ Link60.4GLEE-Plus2023-12-14
Dynamic Head: Unifying Object Detection Heads with Attentions✓ Link60.378.274.2DyHead (Swin-L, multi scale, self-training)2021-06-15
Vision Transformer Adapter for Dense Predictions✓ Link60.2ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)2022-05-17
End-to-End Semi-Supervised Object Detection with Soft Teacher✓ Link60.1Soft Teacher+Swin-L(HTC++, single scale)2021-06-16
CBNet: A Composite Backbone Network Architecture for Object Detection✓ Link59.6CBNetV2 (Dual-Swin-L HTC, multi-scale)2021-07-01
Could Giant Pretrained Image Models Extract Universal Representations?59.3Frozen Backbone, SwinV2-G-ext22K (HTC)2022-11-03
HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions✓ Link59.2HorNet-L2022-07-28
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link59.2MOAT-3 (IN-22K pretraining, single-scale)2022-10-04
CBNet: A Composite Backbone Network Architecture for Object Detection✓ Link59.1CBNetV2 (Dual-Swin-L HTC, multi-scale)2021-07-01
Focal Self-attention for Local-Global Interactions in Vision Transformers✓ Link58.777.273.4Focal-L (DyHead, multi-scale)2021-07-01
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link58.7MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)2021-12-02
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link58.5MOAT-2 (IN-22K pretraining, single-scale)2022-10-04
Dynamic Head: Unifying Object Detection Heads with Attentions✓ Link58.476.844.562.273.2DyHead (Swin-L, multi scale)2021-06-15
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows✓ Link58Swin-L (HTC++, multi scale)2021-03-25
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link57.7MOAT-1 (IN-1K pretraining, single-scale)2022-10-04
Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality✓ Link57.4UM-MAE(HTC++, Swin-L, IN1K)2022-05-20
YOLOv6 v3.0: A Full-Scale Reloading✓ Link57.274.5YOLOv6-L6(46 fps, 1280, V100)2023-01-13
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows✓ Link57.1Swin-L (HTC++, single scale)2021-03-25
TransNeXt: Robust Foveal Visual Perception for Vision Transformers✓ Link57.1TransNeXt-Base (IN-1K pretrain, DINO 1x)2023-11-28
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation✓ Link57.0Cascade Eff-B7 NAS-FPN (1280, self-training Copy Paste, single-scale)2020-12-13
TransNeXt: Robust Foveal Visual Perception for Vision Transformers✓ Link56.6TransNeXt-Small (IN-1K pretrain, DINO 1x)2023-11-28
Instances as Queries✓ Link56.175.861.740.259.871.5QueryInst (single scale)2021-05-05
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link56.1MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)2021-12-02
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link55.9MOAT-0 (IN-1K pretraining, single-scale)2022-10-04
TransNeXt: Robust Foveal Visual Perception for Vision Transformers✓ Link55.7TransNeXt-Tiny (IN-1K pretrain, DINO 1x)2023-11-28
Scaled-YOLOv4: Scaling Cross Stage Partial Network✓ Link55.473.360.738.159.567.4YOLOv4-P7 CSP-P7 (single-scale, 16 fps)2020-11-16
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link55.2tiny-MOAT-3 (IN-1K pretraining, single-scale)2022-10-04
Understanding The Robustness in Vision Transformers✓ Link55.1FAN-L-Hybrid2022-04-26
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles✓ Link55Hiera-L2023-06-01
General Object Foundation Model for Images and Videos at Scale✓ Link55.0GLEE-Lite2023-12-14
Towards Sustainable Self-supervised Learning✓ Link54.6TEC(VIT-B, Mask-RCNN)2022-10-20
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation✓ Link54.5Cascade Eff-B7 NAS-FPN (1280)2020-12-13
Context Autoencoder for Self-Supervised Representation Learning✓ Link54.5CAE (ViT-L, Mask R-CNN, 1x schedule)2022-02-07
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link54.3MViTv2-L (Cascade Mask R-CNN, single-scale)2021-12-02
Rethinking Pre-training and Self-training✓ Link54.2SpineNet-190 (1280, with Self-training on OpenImages, single-scale)2020-06-11
Simple Training Strategies and Model Scaling for Object Detection✓ Link53.634.556.770.6Cascade RCNN-RS (SpineNet-143L, single scale)2021-06-30
USB: Universal-Scale Object Detection Benchmark✓ Link53.570.858.936.957.568.1UniverseNet-20.08d (Res2Net-101, DCN, multi-scale)2021-03-25
Masked Autoencoders Are Scalable Vision Learners✓ Link53.3MAE (ViT-L, Mask R-CNN)2021-11-11
Simple Training Strategies and Model Scaling for Object Detection✓ Link53.133.956.270.3Cascade RCNN-RS (ResNet-200, single scale)2021-06-30
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link53.0tiny-MOAT-2 (IN-1K pretraining, single-scale)2022-10-04
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link52.7MViT-L (Mask R-CNN, single-scale, IN21k pre-train)2021-12-02
ResNeSt: Split-Attention Networks✓ Link52.4771.0057.0736.8056.3666.29ResNeSt-200 (multi-scale)2020-04-19
Active Token Mixer✓ Link52.3ActiveMLP-B (Cascade Mask R-CNN)2022-03-11
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization✓ Link52.2RetinaNet (SpineNet-190, 1536x1536)2019-12-10
EfficientDet: Scalable and Efficient Object Detection✓ Link52.1EfficientDet-D7 (1536)2019-11-20
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link51.9tiny-MOAT-1 (IN-1K pretraining, single-scale)2022-10-04
Global Context Networks✓ Link51.870.456.1GCNet (ResNeXt-101 + DCN + cascade + GC r4)2020-12-24
ELSA: Enhanced Local Self-Attention for Vision Transformer✓ Link51.670.556.0ELSA-S (Cascade Mask RCNN)2021-12-23
Focal Modulation Networks✓ Link51.570.356.0FocalNet-T (LRF, Cascade Mask R-CNN)2022-03-22
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection✓ Link51.369.15634.554.265.8DINO-5scale (24 epoch)2022-03-07
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection✓ Link51.26955.83554.365.3DINO-5scale (36 epoch)2022-03-07
ResNeSt: Split-Attention Networks✓ Link50.9169.5355.4032.6754.6665.83ResNeSt-200-DCN (single-scale)2020-04-19
USB: Universal-Scale Object Detection Benchmark✓ Link50.969.555.433.555.565.8UniverseNet-20.08d (Res2Net-101, DCN, single-scale)2021-03-25
ResNeSt: Split-Attention Networks✓ Link50.5468.7855.1754.263.9ResNeSt-200 (single-scale)2020-04-19
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models✓ Link50.5tiny-MOAT-0 (IN-1K pretraining, single-scale)2022-10-04
Masked Autoencoders Are Scalable Vision Learners✓ Link50.3MAE (ViT-B, Mask R-CNN)2021-11-11
PVT v2: Improved Baselines with Pyramid Vision Transformer✓ Link50.169.554.9Sparse R-CNN (PVTv2-B2)2021-06-25
Pix2seq: A Language Modeling Framework for Object Detection✓ Link50.0Pix2seq (ViT-L)2021-09-22
DaViT: Dual Attention Vision Transformers✓ Link49.9DaViT-T (Mask R-CNN, 36 epochs)2022-04-07
Bottleneck Transformers for Visual Recognition✓ Link49.771.354.6BoTNet 200 (Mask R-CNN, single scale, 72 epochs)2021-01-27
Bottleneck Transformers for Visual Recognition✓ Link49.57154.2BoTNet 152 (Mask R-CNN, single scale, 72 epochs)2021-01-27
DN-DETR: Accelerate DETR Training by Introducing Query DeNoising✓ Link49.567.653.831.352.665.447DN-Deformable-DETR-R50++2022-03-02
Recurrent Glimpse-based Decoder for Detection with Transformer✓ Link49.167.553.13052.665REGO-Deformable DETR-X1012021-12-09
CenterMask : Real-Time Anchor-Free Instance Segmentation✓ Link48.667.8CenterMask+VoVNet99 (multi-scale)2019-11-15
Rethinking ImageNet Pre-training✓ Link48.666.852.9Mask R-CNN (ResNeXt-152-FPN, cascade)2018-11-21
USB: Universal-Scale Object Detection Benchmark✓ Link48.567.052.630.652.762.7UniverseNet-20.08 (Res2Net-50, DCN, single-scale)2021-03-25
XCiT: Cross-Covariance Image Transformers✓ Link48.5XCiT-M24/82021-06-17
ELSA: Enhanced Local Self-Attention for Vision Transformer✓ Link48.370.452.9ELSA-S (Mask RCNN)2021-12-23
XCiT: Cross-Covariance Image Transformers✓ Link48.1XCiT-S24/82021-06-17
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond✓ Link47.966.952.2GCNet (ResNeXt-101 + DCN + cascade + GC r16)2019-04-25
MAE-DET: Revisiting Maximum Entropy Principle in Zero-Shot NAS for Efficient Object Detection✓ Link47.865.552.230.351.961.1MAE-Det(MAE-Det-L+GFLV2)2021-11-26
Res2Net: A New Multi-scale Backbone Architecture✓ Link47.566.551.328.651.662.1Res2Net101+HTC2019-04-02
Rethinking ImageNet Pre-training✓ Link47.4Mask R-CNN (ResNet-101-FPN, GN, Cascade)2018-11-21
Pix2seq: A Language Modeling Framework for Object Detection✓ Link47.3Pix2seq (R50-C4)2021-09-22
Pix2seq: A Language Modeling Framework for Object Detection✓ Link47.1Pix2seq (ViT-B)2021-09-22
Deep High-Resolution Representation Learning for Visual Recognition✓ Link47.028.850.362.2HTC (HRNetV2p-W48)2019-08-20
Augmenting Convolutional networks with attention-based aggregation✓ Link47.0PatchConvNet-S120 (Mask R-CNN)2021-12-27
RepPoints: Point Set Representation for Object Detection✓ Link46.8RPDet (ResNeXt-101-DCN, multi-scale)2019-04-25
DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR✓ Link46.66750.228.150.564.163DAB-DETR-DC5-R1012022-01-28
Dynamic Head: Unifying Object Detection Heads with Attentions✓ Link46.5DyHead (ResNet-101)2021-06-15
Rethinking ImageNet Pre-training✓ Link46.467.151.1Mask R-CNN (ResNeXt-152-FPN)2018-11-21
RepPoints: Point Set Representation for Object Detection✓ Link46.4RPDet (ResNet-101-DCN, multi-scale)2019-04-25
Augmenting Convolutional networks with attention-based aggregation✓ Link46.4PatchConvNet-S60 (Mask R-CNN)2021-12-27
Deep Residual Learning for Image Recognition✓ Link46.364.350.5Cascade Mask R-CNN (ResNet-50)2015-12-10
HoughNet: Integrating near and long-range evidence for bottom-up object detection✓ Link46.164.650.330.048.859.7HoughNet (HG-104, MS)2020-07-05
Deep High-Resolution Representation Learning for Visual Recognition✓ Link46.027.560.1Mask R-CNN (HRNetV2p-W48, cascade)2019-08-20
Conditional DETR for Fast Training Convergence✓ Link45.966.849.527.250.363.363Conditional DETR-DC5-R1012021-08-13
Bottleneck Transformers for Visual Recognition✓ Link45.9BoTNet 50 (72 epochs)2021-01-27
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals✓ Link45.664.649.528.348.361.6Sparse R-CNN (ResNet-101, learnable proposals, random crop aug, FPN)2020-11-25
CenterMask : Real-Time Anchor-Free Instance Segmentation✓ Link45.629.258.8CenterMask+VoVNetV2-99 (single-scale)2019-11-15
Deep High-Resolution Representation Learning for Visual Recognition✓ Link45.327.048.459.5HTC (HRNetV2p-W32)2019-08-20
Anchor DETR: Query Design for Transformer-Based Object Detection✓ Link45.165.748.825.849.461.6Anchor DETR-DC5-R1012021-09-15
Conditional DETR for Fast Training Convergence✓ Link45.165.448.525.34962.244Conditional DETR-DC5-R502021-08-13
Non-local Neural Networks✓ Link45.067.848.9Mask R-CNN (ResNeXt-152 + 1 NL)2017-11-21
Pix2seq: A Language Modeling Framework for Object Detection✓ Link45.063.248.628.248.960.4Pix2seq (R101-DC5)2021-09-22
Attentive Normalization✓ Link44.966.249.1Mask R-CNN-FPN (AOGNet-40M)2019-08-04
End-to-End Object Detection with Transformers✓ Link44.964.747.723.749.562.3DETR-DC5 (ResNet-101)2020-05-26
CenterMask : Real-Time Anchor-Free Instance Segmentation✓ Link44.928.557.7Mask R-CNN (VoVNetV2-99, single-scale)2019-11-15
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing✓ Link44.864.348.926.648.359.6R3-CNN (ResNet-50-FPN, DCN)2021-04-03
RepPoints: Point Set Representation for Object Detection✓ Link44.8RPDet (ResNet-101-DCN, multi-scale train)2019-04-25
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding✓ Link44.747.629.94858.1RetinaNet (ViL-Base, multi-scale, 3x)2021-03-29
Deep High-Resolution Representation Learning for Visual Recognition✓ Link44.662.748.726.348.158.5Cascade R-CNN (HRNetV2p-W48)2019-08-20
CenterMask : Real-Time Anchor-Free Instance Segmentation✓ Link44.627.748.3CenterMask+VoVNetV2-57 (single-scale)2019-11-15
Conditional DETR for Fast Training Convergence✓ Link44.565.647.523.648.463.663Conditional DETR-R1012021-08-13
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals✓ Link44.563.448.226.947.259.5Sparse R-CNN (ResNet-50, learnable proposals, random crop aug, FPN)2020-11-25
Deep Residual Learning for Image Recognition✓ Link44.563.048.3GFL (ResNet-50)2015-12-10
RepPoints: Point Set Representation for Object Detection✓ Link44.5RPDet (ResNeXt-101-DCN)2019-04-25
CenterMask : Real-Time Anchor-Free Instance Segmentation✓ Link44.426.757.1CenterMask+X101-32x8d (single-scale)2019-11-15
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding✓ Link44.365.547.128.947.958.3RetinaNet (ViL-Base)2021-03-29
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing✓ Link44.364.148.42747.158.9R3-CNN (ResNet-50-FPN, GC-Net)2021-04-03
Anchor DETR: Query Design for Transformer-Based Object Detection✓ Link44.264.747.524.748.260.6Anchor DETR-DC5-R502021-09-15
DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR✓ Link44.164.747.224.148.262.963DAB-DETR-R1012022-01-28
End-to-End Object Detection with Transformers✓ Link4463.947.827.248.156Faster RCNN-R101-FPN+2020-05-26
Deep High-Resolution Representation Learning for Visual Recognition✓ Link43.761.747.725.646.557.4Cascade R-CNN (HRNetV2p-W32)2019-08-20
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals✓ Link43.562.147.226.146.359.7Sparse R-CNN (ResNet-101, FPN)2020-11-25
Deep Residual Learning for Image Recognition✓ Link43.561.947.0ATSS (ResNet-50)2015-12-10
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions✓ Link43.463.646.126.146.059.5PVT-Large (RetinaNet 3x,MS)2021-02-24
Bottom-up Object Detection by Grouping Extreme and Center Points✓ Link43.359.646.825.746.659.4ExtremeNet (Hourglass-104, multi-scale)2019-01-23
Pix2seq: A Language Modeling Framework for Object Detection✓ Link43.261.046.126.64758.6Pix2seq (R50-DC5 )2021-09-22
Hybrid Task Cascade for Instance Segmentation✓ Link43.259.440.720.340.952.3HTC (cascade)2019-01-22
Micro-Batch Training with Batch-Channel Normalization and Weight Standardization✓ Link43.1264.1547.1125.4947.1956.39Mask R-CNN-FPN (ResNeXt-101, GN+WS)2019-03-25
Deep High-Resolution Representation Learning for Visual Recognition✓ Link43.126.646.0HTC (HRNetV2p-W18)2019-08-20
Deformable ConvNets v2: More Deformable, Better Results✓ Link43.1Mask R-CNN (ResNet-101, DCNv2)2018-11-27
Conditional DETR for Fast Training Convergence✓ Link436445.722.746.761.544Conditional DETR-R502021-08-13
HoughNet: Integrating near and long-range evidence for bottom-up object detection✓ Link43.062.246.925.547.655.8HoughNet (HG-104)2020-07-05
X-volution: On the unification of convolution and self-attention42.86446.426.94655Faster R-CNN (FPN, X-volution)2021-06-04
Cascade R-CNN: Delving into High Quality Object Detection✓ Link42.761.646.623.846.257.4Cascade R-CNN (ResNet-101-FPN+, cascade)2017-12-03
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions✓ Link42.663.745.425.846.058.4PVT-Large (RetinaNet 1x)2021-02-24
CornerNet-Lite: Efficient Keypoint Based Object Detection✓ Link42.625.544.358.4CornerNet-Saccade (Hourglass-54)2019-04-18
Pix2seq: A Language Modeling Framework for Object Detection✓ Link42.6Pix2seq (R50)2021-09-22
Group Normalization✓ Link42.362.846.2Mask R-CNN (ResNet-101-FPN, GroupNorm, long)2018-03-22
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals✓ Link42.361.245.726.744.657.6Sparse R-CNN (ResNet-50, FPN)2020-11-25
Deep High-Resolution Representation Learning for Visual Recognition✓ Link42.325.045.4Mask R-CNN (HRNetV2p-W32)2019-08-20
Rethinking and Improving Relative Position Encoding for Vision Transformer✓ Link42.3DETR-ResNet50 with iRPE-K (300 epochs)2021-07-29
Scale-Aware Trident Networks for Object Detection✓ Link4263.545.524.94756.9TridentNet (ResNet-101)2019-01-07
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing✓ Link426146.324.545.255.7R3-CNN (ResNet-50-FPN)2021-04-03
Deep High-Resolution Representation Learning for Visual Recognition✓ Link41.862.845.944.754.6Faster R-CNN (HRNetV2p-W48)2019-08-20
LIP: Local Importance-based Pooling✓ Link41.763.645.625.245.8Faster R-CNN (LIP-ResNet-101)2019-08-12
Deformable ConvNets v2: More Deformable, Better Results✓ Link41.722.245.858.7Faster R-CNN (ResNet-101, DCNv2)2018-11-27
Feature Selective Anchor-Free Module for Single-Shot Object Detection✓ Link41.662.4FSAF (ResNeXt-101, anchor-based branches)2019-03-02
CornerNet-Lite: Efficient Keypoint Based Object Detection✓ Link41.423.843.557.1CornerNet-Saccade (Hourglass-104)2019-04-18
Grid R-CNN✓ Link41.360.344.423.445.854.1Grid R-CNN (ResNet-101-FPN)2018-11-29
Deep High-Resolution Representation Learning for Visual Recognition✓ Link41.359.244.923.744.254.1Cascade R-CNN (HRNetV2p-W18)2019-08-20
CenterNet: Keypoint Triplets for Object Detection✓ Link41.359.243.923.643.855.8CenterNet511 (Hourglass-52)2019-04-17
RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free✓ Link41.160.244.1RetinaMask (ResNet-101-FPN)2019-01-10
MetaFormer Is Actually What You Need for Vision✓ Link41.063.144.8PoolFormer-S36 (Mask R-CNN)2021-11-22
Deep High-Resolution Representation Learning for Visual Recognition✓ Link40.961.844.824.443.753.3Faster R-CNN (HRNetV2p-W32)2019-08-20
VirTex: Learning Visual Representations from Textual Annotations✓ Link40.9VirTex Mask R-CNN (ResNet-50-FPN)2020-06-11
Non-local Neural Networks✓ Link40.863.144.5Mask R-CNN (ResNet-101 + 1 NL)2017-11-21
Group Normalization✓ Link40.861.644.4Mask R-CNN (ResNet-50-FPN, GroupNorm, long)2018-03-22
RepPoints: Point Set Representation for Object Detection✓ Link40.8RPDet (ResNet-50, multi-scale train)2019-04-25
Rethinking and Improving Relative Position Encoding for Vision Transformer✓ Link40.8DETR-ResNet50 with iRPE-K (150 epochs)2021-07-29
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection✓ Link40.760.743.3Faster R-CNN+aLRP Loss (ResNet-50, 500 scale)2020-09-28
Reducing Label Noise in Anchor-Free Object Detection✓ Link40.559.544.225.444.752.3PPDet (ResNet-101-FPN)2020-08-03
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond✓ Link40.362.44424.244.452.5GCnet (ResNet-50-FPN, GRoIE)2019-04-25
Group Normalization✓ Link40.36144Mask R-CNN (ResNet-50-FPN, GroupNorm)2018-03-22
Cascade R-CNN: Delving into High Quality Object Detection✓ Link40.3 59.443.722.943.754.1Cascade R-CNN (ResNet-50-FPN+)2017-12-03
Bottom-up Object Detection by Grouping Extreme and Center Points✓ Link40.355.143.721.644.056.1ExtremeNet (Hourglass-104, single-scale)2019-01-23
RepPoints: Point Set Representation for Object Detection✓ Link40.3RPDet (ResNet-101)2019-04-25
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection✓ Link40.260.342.3RetinaNet+aLRP Loss (ResNet-50, 500 scale)2020-09-28
Mask R-CNN✓ Link40.0Mask R-CNN (ResNet-101-FPN)2017-03-20
Feature Pyramid Networks for Object Detection✓ Link39.861.343.322.943.352.6FPN+2016-12-09
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection✓ Link39.758.841.5FoveaBox+aLRP Loss (ResNet-50, 500 scale)2020-09-28
Grid R-CNN✓ Link39.658.342.422.643.851.5Grid R-CNN (ResNet-50-FPN)2018-11-29
Adaptively Connected Neural Networks✓ Link39.5Mask R-CNN (ResNet-50, ACNet)2019-04-07
Feature Selective Anchor-Free Module for Single-Shot Object Detection✓ Link39.359.2FSAF (ResNet-101, anchor-based branches)2019-03-02
Deep High-Resolution Representation Learning for Visual Recognition✓ Link39.241.751.0Mask R-CNN (HRNetV2p-W18)2019-08-20
Non-local Neural Networks✓ Link39.061.141.9Mask R-CNN (ResNet-50 + 1 NL)2017-11-21
FoveaBox: Beyond Anchor-based Object Detector✓ Link38.958.441.522.343.551.7FoveaBox (ResNet-101-FPN, 800x800)2019-04-08
FCOS: Fully Convolutional One-Stage Object Detection✓ Link38.6 57.441.422.342.549.8FCOS (ResNet-50-FPN + improvements)2019-04-02
RepPoints: Point Set Representation for Object Detection✓ Link38.6RPDet (ResNet-50)2019-04-25
Libra R-CNN: Towards Balanced Learning for Object Detection✓ Link38.559.342.022.942.150.5Libra R-CNN (ResNet-50 FPN)2019-04-04
A novel Region of Interest Extraction Layer for Instance Segmentation✓ Link38.459.941.722.942.149.7Mask R-CNN (ResNet-50-FPN, GRoIE)2020-04-28
CornerNet: Detecting Objects as Paired Keypoints✓ Link38.453.840.918.640.551.8CornerNet511 (Hourglass-104)2018-08-03
FoveaBox: Beyond Anchor-based Object Detector✓ Link38.157.840.5FoveaBox+Retina (ResNet-50)2019-04-08
Deep High-Resolution Representation Learning for Visual Recognition✓ Link38.058.941.522.640.849.6Faster R-CNN (HRNetV2p-W18)2019-08-20
FoveaBox: Beyond Anchor-based Object Detector✓ Link3857.840.219.542.252.7FoveaBox (ResNet-101-FPN, 600x600)2019-04-08
Feature Selective Anchor-Free Module for Single-Shot Object Detection✓ Link37.958.0FSAF (ResNet-101)2019-03-02
Mask R-CNN✓ Link37.7Mask R-CNN (ResNet-50-FPN)2017-03-20
A novel Region of Interest Extraction Layer for Instance Segmentation✓ Link37.559.240.622.341.547.8Faster R-CNN (ResNet-50-FPN, GRoIE)2020-04-28
Mask R-CNN✓ Link36.759.538.9Mask R-CNN (ResNeXt-101-FPN)2017-03-20
FoveaBox: Beyond Anchor-based Object Detector✓ Link36.055.237.918.639.450.5FoveaBox (ResNet-50-FPN, 600x600)2019-04-08
Feature Selective Anchor-Free Module for Single-Shot Object Detection✓ Link35.955.037.919.839.648.2FSAF (ResNet-50)2019-03-02
Gradient Harmonized Single-stage Detector✓ Link35.855.538.119.639.646.7GHM-C + GHM-R (RetinaNet-FPN-ResNet-50, M=30)2018-11-13
Generating Positive Bounding Boxes for Balanced Training of Object Detectors✓ Link35.655.3Online Fg Bal. Sampling+Hard Negative Mining (ResNet-50)2019-09-21
M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network✓ Link34.153.715.939.549.3M2Det (ResNet-1o1, 320x320)2018-11-12
Res2Net: A New Multi-scale Backbone Architecture✓ Link33.753.61438.351.1Faster R-CNN (Res2Net-50)2019-04-02
M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network✓ Link33.252.21538.249.1M2Det (VGG-16, 320x320)2018-11-12
SOLQ: Segmenting Objects by Learning Queries✓ Link74.961.371.9SOLQ (Swin-L, single scale)2021-06-04
You Only Learn One Representation: Unified Network for Multiple Tasks✓ Link73.560.640.460.168.7YOLOR-D6 (1280, single-scale, 31 fps)2021-05-10
EfficientDet: Scalable and Efficient Object Detection✓ Link73.459.040.058.067.9EfficientDet-D7x (single-scale)2019-11-20
You Only Learn One Representation: Unified Network for Multiple Tasks✓ Link70.657.437.457.365.2YOLOR-P6 (1280, single-scale, 72 fps)2021-05-10
Focal Modulation Networks✓ Link70.155.8FocalNet-T (SRF, Cascade Mask R-CNN)2022-03-22
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing✓ Link61.245.624.4R3-CNN (ResNet-50-FPN, GRoIE)2021-04-03
Deep High-Resolution Representation Learning for Visual Recognition✓ Link26.147.9Mask R-CNN (HRNetV2p-W32, cascade)2019-08-20
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism✓ Link42.3Shift-T2022-01-26
Dynamic Head: Unifying Object Detection Heads with Attentions✓ Link66.3DyHead (ResNeXt-64x4d-101-DCN, multi scale)2021-06-15