Perception Encoder: The best visual embeddings are not at the output of the network | ✓ Link | 66.0 | | | | | | 1900 | PE_spatial (DETA) | 2025-04-17 |
DETRs with Collaborative Hybrid Assignments Training | ✓ Link | 65.9 | | | | | | 314 | Co-DETR | 2022-11-22 |
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information | ✓ Link | 65.0 | | | | | | | M3I Pre-training (InternImage-H) | 2022-11-17 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 65.0 | | | | | | | InternImage-H | 2022-11-10 |
DETRs with Collaborative Hybrid Assignments Training | ✓ Link | 64.7 | | | | | | 218 | Co-DETR (Swin-L) | 2022-11-22 |
A Strong and Reproducible Object Detector with Only Public Datasets | ✓ Link | 64.6 | 81.5 | 71.4 | 50.4 | 68.5 | 78.5 | 689 | Focal-Stable-DINO (Focal-Huge, no TTA) | 2023-04-25 |
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale | ✓ Link | 64.5 | 82.1 | 70.8 | 49.4 | 68.4 | 78.5 | | EVA | 2022-11-14 |
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions) | ✓ Link | 64.3 | | | | | | 363 | ViT-CoMer | 2024-03-13 |
Focal Modulation Networks | ✓ Link | 64.2 | | | | | | | FocalNet-H (DINO) | 2022-03-22 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 64.2 | | | | | | | InternImage-XL | 2022-11-10 |
CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection | | 64.1 | | | | | | | CP-DETR-L Swin-L(Fine tuning separately in COCO) | 2024-12-13 |
Reversible Column Networks | ✓ Link | 63.8 | | | | | | | RevCol-H(DINO) | 2022-12-22 |
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection | ✓ Link | 63.2 | | | | | | | DINO (Swin-L) | 2022-03-07 |
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection | ✓ Link | 63.0 | | | | | | | Grounding DINO | 2023-03-09 |
Swin Transformer V2: Scaling Up Capacity and Resolution | ✓ Link | 62.5 | | | | | | | SwinV2-G (HTC++) | 2021-11-18 |
Florence: A New Foundation Model for Computer Vision | ✓ Link | 62 | | | | | | | Florence-CoSwin-H | 2021-11-22 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 62.0 | | | | | | | GLEE-Pro | 2023-12-14 |
Exploring Plain Vision Transformer Backbones for Object Detection | ✓ Link | 61.3 | | | | | | | ViTDet, ViT-H Cascade (multiscale) | 2022-03-30 |
Grounded Language-Image Pre-training | ✓ Link | 60.8 | | | | | | | GLIP (Swin-L, multi-scale) | 2021-12-07 |
End-to-End Semi-Supervised Object Detection with Soft Teacher | ✓ Link | 60.7 | | | | | | | Soft Teacher + Swin-L (HTC++, multi-scale) | 2021-06-16 |
Universal Instance Perception as Object Discovery and Retrieval | ✓ Link | 60.6 | 77.5 | 66.7 | 45.1 | 64.8 | 75.3 | | UNINEXT-H | 2023-03-12 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 60.5 | | | | | | | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) | 2022-05-17 |
Exploring Plain Vision Transformer Backbones for Object Detection | ✓ Link | 60.4 | | | | | | | ViTDet, ViT-H Cascade | 2022-03-30 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 60.4 | | | | | | | GLEE-Plus | 2023-12-14 |
Dynamic Head: Unifying Object Detection Heads with Attentions | ✓ Link | 60.3 | 78.2 | | | | 74.2 | | DyHead (Swin-L, multi scale, self-training) | 2021-06-15 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 60.2 | | | | | | | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) | 2022-05-17 |
End-to-End Semi-Supervised Object Detection with Soft Teacher | ✓ Link | 60.1 | | | | | | | Soft Teacher+Swin-L(HTC++, single scale) | 2021-06-16 |
CBNet: A Composite Backbone Network Architecture for Object Detection | ✓ Link | 59.6 | | | | | | | CBNetV2 (Dual-Swin-L HTC, multi-scale) | 2021-07-01 |
Could Giant Pretrained Image Models Extract Universal Representations? | | 59.3 | | | | | | | Frozen Backbone, SwinV2-G-ext22K (HTC) | 2022-11-03 |
HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions | ✓ Link | 59.2 | | | | | | | HorNet-L | 2022-07-28 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 59.2 | | | | | | | MOAT-3 (IN-22K pretraining, single-scale) | 2022-10-04 |
CBNet: A Composite Backbone Network Architecture for Object Detection | ✓ Link | 59.1 | | | | | | | CBNetV2 (Dual-Swin-L HTC, multi-scale) | 2021-07-01 |
Focal Self-attention for Local-Global Interactions in Vision Transformers | ✓ Link | 58.7 | 77.2 | | | | 73.4 | | Focal-L (DyHead, multi-scale) | 2021-07-01 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 58.7 | | | | | | | MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train) | 2021-12-02 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 58.5 | | | | | | | MOAT-2 (IN-22K pretraining, single-scale) | 2022-10-04 |
Dynamic Head: Unifying Object Detection Heads with Attentions | ✓ Link | 58.4 | 76.8 | | 44.5 | 62.2 | 73.2 | | DyHead (Swin-L, multi scale) | 2021-06-15 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | ✓ Link | 58 | | | | | | | Swin-L (HTC++, multi scale) | 2021-03-25 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 57.7 | | | | | | | MOAT-1 (IN-1K pretraining, single-scale) | 2022-10-04 |
Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality | ✓ Link | 57.4 | | | | | | | UM-MAE(HTC++, Swin-L, IN1K) | 2022-05-20 |
YOLOv6 v3.0: A Full-Scale Reloading | ✓ Link | 57.2 | 74.5 | | | | | | YOLOv6-L6(46 fps, 1280, V100) | 2023-01-13 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | ✓ Link | 57.1 | | | | | | | Swin-L (HTC++, single scale) | 2021-03-25 |
TransNeXt: Robust Foveal Visual Perception for Vision Transformers | ✓ Link | 57.1 | | | | | | | TransNeXt-Base (IN-1K pretrain, DINO 1x) | 2023-11-28 |
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation | ✓ Link | 57.0 | | | | | | | Cascade Eff-B7 NAS-FPN (1280, self-training Copy Paste, single-scale) | 2020-12-13 |
TransNeXt: Robust Foveal Visual Perception for Vision Transformers | ✓ Link | 56.6 | | | | | | | TransNeXt-Small (IN-1K pretrain, DINO 1x) | 2023-11-28 |
Instances as Queries | ✓ Link | 56.1 | 75.8 | 61.7 | 40.2 | 59.8 | 71.5 | | QueryInst (single scale) | 2021-05-05 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 56.1 | | | | | | | MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train) | 2021-12-02 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 55.9 | | | | | | | MOAT-0 (IN-1K pretraining, single-scale) | 2022-10-04 |
TransNeXt: Robust Foveal Visual Perception for Vision Transformers | ✓ Link | 55.7 | | | | | | | TransNeXt-Tiny (IN-1K pretrain, DINO 1x) | 2023-11-28 |
Scaled-YOLOv4: Scaling Cross Stage Partial Network | ✓ Link | 55.4 | 73.3 | 60.7 | 38.1 | 59.5 | 67.4 | | YOLOv4-P7 CSP-P7 (single-scale, 16 fps) | 2020-11-16 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 55.2 | | | | | | | tiny-MOAT-3 (IN-1K pretraining, single-scale) | 2022-10-04 |
Understanding The Robustness in Vision Transformers | ✓ Link | 55.1 | | | | | | | FAN-L-Hybrid | 2022-04-26 |
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles | ✓ Link | 55 | | | | | | | Hiera-L | 2023-06-01 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 55.0 | | | | | | | GLEE-Lite | 2023-12-14 |
Towards Sustainable Self-supervised Learning | ✓ Link | 54.6 | | | | | | | TEC(VIT-B, Mask-RCNN) | 2022-10-20 |
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation | ✓ Link | 54.5 | | | | | | | Cascade Eff-B7 NAS-FPN (1280) | 2020-12-13 |
Context Autoencoder for Self-Supervised Representation Learning | ✓ Link | 54.5 | | | | | | | CAE (ViT-L, Mask R-CNN, 1x schedule) | 2022-02-07 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 54.3 | | | | | | | MViTv2-L (Cascade Mask R-CNN, single-scale) | 2021-12-02 |
Rethinking Pre-training and Self-training | ✓ Link | 54.2 | | | | | | | SpineNet-190 (1280, with Self-training on OpenImages, single-scale) | 2020-06-11 |
Simple Training Strategies and Model Scaling for Object Detection | ✓ Link | 53.6 | | | 34.5 | 56.7 | 70.6 | | Cascade RCNN-RS (SpineNet-143L, single scale) | 2021-06-30 |
USB: Universal-Scale Object Detection Benchmark | ✓ Link | 53.5 | 70.8 | 58.9 | 36.9 | 57.5 | 68.1 | | UniverseNet-20.08d (Res2Net-101, DCN, multi-scale) | 2021-03-25 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 53.3 | | | | | | | MAE (ViT-L, Mask R-CNN) | 2021-11-11 |
Simple Training Strategies and Model Scaling for Object Detection | ✓ Link | 53.1 | | | 33.9 | 56.2 | 70.3 | | Cascade RCNN-RS (ResNet-200, single scale) | 2021-06-30 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 53.0 | | | | | | | tiny-MOAT-2 (IN-1K pretraining, single-scale) | 2022-10-04 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 52.7 | | | | | | | MViT-L (Mask R-CNN, single-scale, IN21k pre-train) | 2021-12-02 |
ResNeSt: Split-Attention Networks | ✓ Link | 52.47 | 71.00 | 57.07 | 36.80 | 56.36 | 66.29 | | ResNeSt-200 (multi-scale) | 2020-04-19 |
Active Token Mixer | ✓ Link | 52.3 | | | | | | | ActiveMLP-B (Cascade Mask R-CNN) | 2022-03-11 |
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization | ✓ Link | 52.2 | | | | | | | RetinaNet (SpineNet-190, 1536x1536) | 2019-12-10 |
EfficientDet: Scalable and Efficient Object Detection | ✓ Link | 52.1 | | | | | | | EfficientDet-D7 (1536) | 2019-11-20 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 51.9 | | | | | | | tiny-MOAT-1 (IN-1K pretraining, single-scale) | 2022-10-04 |
Global Context Networks | ✓ Link | 51.8 | 70.4 | 56.1 | | | | | GCNet (ResNeXt-101 + DCN + cascade + GC r4) | 2020-12-24 |
ELSA: Enhanced Local Self-Attention for Vision Transformer | ✓ Link | 51.6 | 70.5 | 56.0 | | | | | ELSA-S (Cascade Mask RCNN) | 2021-12-23 |
Focal Modulation Networks | ✓ Link | 51.5 | 70.3 | 56.0 | | | | | FocalNet-T (LRF, Cascade Mask R-CNN) | 2022-03-22 |
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection | ✓ Link | 51.3 | 69.1 | 56 | 34.5 | 54.2 | 65.8 | | DINO-5scale (24 epoch) | 2022-03-07 |
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection | ✓ Link | 51.2 | 69 | 55.8 | 35 | 54.3 | 65.3 | | DINO-5scale (36 epoch) | 2022-03-07 |
ResNeSt: Split-Attention Networks | ✓ Link | 50.91 | 69.53 | 55.40 | 32.67 | 54.66 | 65.83 | | ResNeSt-200-DCN (single-scale) | 2020-04-19 |
USB: Universal-Scale Object Detection Benchmark | ✓ Link | 50.9 | 69.5 | 55.4 | 33.5 | 55.5 | 65.8 | | UniverseNet-20.08d (Res2Net-101, DCN, single-scale) | 2021-03-25 |
ResNeSt: Split-Attention Networks | ✓ Link | 50.54 | 68.78 | 55.17 | | 54.2 | 63.9 | | ResNeSt-200 (single-scale) | 2020-04-19 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 50.5 | | | | | | | tiny-MOAT-0 (IN-1K pretraining, single-scale) | 2022-10-04 |
Masked Autoencoders Are Scalable Vision Learners | ✓ Link | 50.3 | | | | | | | MAE (ViT-B, Mask R-CNN) | 2021-11-11 |
PVT v2: Improved Baselines with Pyramid Vision Transformer | ✓ Link | 50.1 | 69.5 | 54.9 | | | | | Sparse R-CNN (PVTv2-B2) | 2021-06-25 |
Pix2seq: A Language Modeling Framework for Object Detection | ✓ Link | 50.0 | | | | | | | Pix2seq (ViT-L) | 2021-09-22 |
DaViT: Dual Attention Vision Transformers | ✓ Link | 49.9 | | | | | | | DaViT-T (Mask R-CNN, 36 epochs) | 2022-04-07 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 49.7 | 71.3 | 54.6 | | | | | BoTNet 200 (Mask R-CNN, single scale, 72 epochs) | 2021-01-27 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 49.5 | 71 | 54.2 | | | | | BoTNet 152 (Mask R-CNN, single scale, 72 epochs) | 2021-01-27 |
DN-DETR: Accelerate DETR Training by Introducing Query DeNoising | ✓ Link | 49.5 | 67.6 | 53.8 | 31.3 | 52.6 | 65.4 | 47 | DN-Deformable-DETR-R50++ | 2022-03-02 |
Recurrent Glimpse-based Decoder for Detection with Transformer | ✓ Link | 49.1 | 67.5 | 53.1 | 30 | 52.6 | 65 | | REGO-Deformable DETR-X101 | 2021-12-09 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | 48.6 | 67.8 | | | | | | CenterMask+VoVNet99 (multi-scale) | 2019-11-15 |
Rethinking ImageNet Pre-training | ✓ Link | 48.6 | 66.8 | 52.9 | | | | | Mask R-CNN (ResNeXt-152-FPN, cascade) | 2018-11-21 |
USB: Universal-Scale Object Detection Benchmark | ✓ Link | 48.5 | 67.0 | 52.6 | 30.6 | 52.7 | 62.7 | | UniverseNet-20.08 (Res2Net-50, DCN, single-scale) | 2021-03-25 |
XCiT: Cross-Covariance Image Transformers | ✓ Link | 48.5 | | | | | | | XCiT-M24/8 | 2021-06-17 |
ELSA: Enhanced Local Self-Attention for Vision Transformer | ✓ Link | 48.3 | 70.4 | 52.9 | | | | | ELSA-S (Mask RCNN) | 2021-12-23 |
XCiT: Cross-Covariance Image Transformers | ✓ Link | 48.1 | | | | | | | XCiT-S24/8 | 2021-06-17 |
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond | ✓ Link | 47.9 | 66.9 | 52.2 | | | | | GCNet (ResNeXt-101 + DCN + cascade + GC r16) | 2019-04-25 |
MAE-DET: Revisiting Maximum Entropy Principle in Zero-Shot NAS for Efficient Object Detection | ✓ Link | 47.8 | 65.5 | 52.2 | 30.3 | 51.9 | 61.1 | | MAE-Det(MAE-Det-L+GFLV2) | 2021-11-26 |
Res2Net: A New Multi-scale Backbone Architecture | ✓ Link | 47.5 | 66.5 | 51.3 | 28.6 | 51.6 | 62.1 | | Res2Net101+HTC | 2019-04-02 |
Rethinking ImageNet Pre-training | ✓ Link | 47.4 | | | | | | | Mask R-CNN (ResNet-101-FPN, GN, Cascade) | 2018-11-21 |
Pix2seq: A Language Modeling Framework for Object Detection | ✓ Link | 47.3 | | | | | | | Pix2seq (R50-C4) | 2021-09-22 |
Pix2seq: A Language Modeling Framework for Object Detection | ✓ Link | 47.1 | | | | | | | Pix2seq (ViT-B) | 2021-09-22 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 47.0 | | | 28.8 | 50.3 | 62.2 | | HTC (HRNetV2p-W48) | 2019-08-20 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 47.0 | | | | | | | PatchConvNet-S120 (Mask R-CNN) | 2021-12-27 |
RepPoints: Point Set Representation for Object Detection | ✓ Link | 46.8 | | | | | | | RPDet (ResNeXt-101-DCN, multi-scale) | 2019-04-25 |
DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR | ✓ Link | 46.6 | 67 | 50.2 | 28.1 | 50.5 | 64.1 | 63 | DAB-DETR-DC5-R101 | 2022-01-28 |
Dynamic Head: Unifying Object Detection Heads with Attentions | ✓ Link | 46.5 | | | | | | | DyHead (ResNet-101) | 2021-06-15 |
Rethinking ImageNet Pre-training | ✓ Link | 46.4 | 67.1 | 51.1 | | | | | Mask R-CNN (ResNeXt-152-FPN) | 2018-11-21 |
RepPoints: Point Set Representation for Object Detection | ✓ Link | 46.4 | | | | | | | RPDet (ResNet-101-DCN, multi-scale) | 2019-04-25 |
Augmenting Convolutional networks with attention-based aggregation | ✓ Link | 46.4 | | | | | | | PatchConvNet-S60 (Mask R-CNN) | 2021-12-27 |
Deep Residual Learning for Image Recognition | ✓ Link | 46.3 | 64.3 | 50.5 | | | | | Cascade Mask R-CNN (ResNet-50) | 2015-12-10 |
HoughNet: Integrating near and long-range evidence for bottom-up object detection | ✓ Link | 46.1 | 64.6 | 50.3 | 30.0 | 48.8 | 59.7 | | HoughNet (HG-104, MS) | 2020-07-05 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 46.0 | | | 27.5 | | 60.1 | | Mask R-CNN (HRNetV2p-W48, cascade) | 2019-08-20 |
Conditional DETR for Fast Training Convergence | ✓ Link | 45.9 | 66.8 | 49.5 | 27.2 | 50.3 | 63.3 | 63 | Conditional DETR-DC5-R101 | 2021-08-13 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 45.9 | | | | | | | BoTNet 50 (72 epochs) | 2021-01-27 |
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals | ✓ Link | 45.6 | 64.6 | 49.5 | 28.3 | 48.3 | 61.6 | | Sparse R-CNN (ResNet-101, learnable proposals, random crop aug, FPN) | 2020-11-25 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | 45.6 | | | 29.2 | | 58.8 | | CenterMask+VoVNetV2-99 (single-scale) | 2019-11-15 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 45.3 | | | 27.0 | 48.4 | 59.5 | | HTC (HRNetV2p-W32) | 2019-08-20 |
Anchor DETR: Query Design for Transformer-Based Object Detection | ✓ Link | 45.1 | 65.7 | 48.8 | 25.8 | 49.4 | 61.6 | | Anchor DETR-DC5-R101 | 2021-09-15 |
Conditional DETR for Fast Training Convergence | ✓ Link | 45.1 | 65.4 | 48.5 | 25.3 | 49 | 62.2 | 44 | Conditional DETR-DC5-R50 | 2021-08-13 |
Non-local Neural Networks | ✓ Link | 45.0 | 67.8 | 48.9 | | | | | Mask R-CNN (ResNeXt-152 + 1 NL) | 2017-11-21 |
Pix2seq: A Language Modeling Framework for Object Detection | ✓ Link | 45.0 | 63.2 | 48.6 | 28.2 | 48.9 | 60.4 | | Pix2seq (R101-DC5) | 2021-09-22 |
Attentive Normalization | ✓ Link | 44.9 | 66.2 | 49.1 | | | | | Mask R-CNN-FPN (AOGNet-40M) | 2019-08-04 |
End-to-End Object Detection with Transformers | ✓ Link | 44.9 | 64.7 | 47.7 | 23.7 | 49.5 | 62.3 | | DETR-DC5 (ResNet-101) | 2020-05-26 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | 44.9 | | | 28.5 | | 57.7 | | Mask R-CNN (VoVNetV2-99, single-scale) | 2019-11-15 |
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing | ✓ Link | 44.8 | 64.3 | 48.9 | 26.6 | 48.3 | 59.6 | | R3-CNN (ResNet-50-FPN, DCN) | 2021-04-03 |
RepPoints: Point Set Representation for Object Detection | ✓ Link | 44.8 | | | | | | | RPDet (ResNet-101-DCN, multi-scale train) | 2019-04-25 |
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding | ✓ Link | 44.7 | | 47.6 | 29.9 | 48 | 58.1 | | RetinaNet (ViL-Base, multi-scale, 3x) | 2021-03-29 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 44.6 | 62.7 | 48.7 | 26.3 | 48.1 | 58.5 | | Cascade R-CNN (HRNetV2p-W48) | 2019-08-20 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | 44.6 | | | 27.7 | 48.3 | | | CenterMask+VoVNetV2-57 (single-scale) | 2019-11-15 |
Conditional DETR for Fast Training Convergence | ✓ Link | 44.5 | 65.6 | 47.5 | 23.6 | 48.4 | 63.6 | 63 | Conditional DETR-R101 | 2021-08-13 |
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals | ✓ Link | 44.5 | 63.4 | 48.2 | 26.9 | 47.2 | 59.5 | | Sparse R-CNN (ResNet-50, learnable proposals, random crop aug, FPN) | 2020-11-25 |
Deep Residual Learning for Image Recognition | ✓ Link | 44.5 | 63.0 | 48.3 | | | | | GFL (ResNet-50) | 2015-12-10 |
RepPoints: Point Set Representation for Object Detection | ✓ Link | 44.5 | | | | | | | RPDet (ResNeXt-101-DCN) | 2019-04-25 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | 44.4 | | | 26.7 | | 57.1 | | CenterMask+X101-32x8d (single-scale) | 2019-11-15 |
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding | ✓ Link | 44.3 | 65.5 | 47.1 | 28.9 | 47.9 | 58.3 | | RetinaNet (ViL-Base) | 2021-03-29 |
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing | ✓ Link | 44.3 | 64.1 | 48.4 | 27 | 47.1 | 58.9 | | R3-CNN (ResNet-50-FPN, GC-Net) | 2021-04-03 |
Anchor DETR: Query Design for Transformer-Based Object Detection | ✓ Link | 44.2 | 64.7 | 47.5 | 24.7 | 48.2 | 60.6 | | Anchor DETR-DC5-R50 | 2021-09-15 |
DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR | ✓ Link | 44.1 | 64.7 | 47.2 | 24.1 | 48.2 | 62.9 | 63 | DAB-DETR-R101 | 2022-01-28 |
End-to-End Object Detection with Transformers | ✓ Link | 44 | 63.9 | 47.8 | 27.2 | 48.1 | 56 | | Faster RCNN-R101-FPN+ | 2020-05-26 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 43.7 | 61.7 | 47.7 | 25.6 | 46.5 | 57.4 | | Cascade R-CNN (HRNetV2p-W32) | 2019-08-20 |
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals | ✓ Link | 43.5 | 62.1 | 47.2 | 26.1 | 46.3 | 59.7 | | Sparse R-CNN (ResNet-101, FPN) | 2020-11-25 |
Deep Residual Learning for Image Recognition | ✓ Link | 43.5 | 61.9 | 47.0 | | | | | ATSS (ResNet-50) | 2015-12-10 |
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions | ✓ Link | 43.4 | 63.6 | 46.1 | 26.1 | 46.0 | 59.5 | | PVT-Large (RetinaNet 3x,MS) | 2021-02-24 |
Bottom-up Object Detection by Grouping Extreme and Center Points | ✓ Link | 43.3 | 59.6 | 46.8 | 25.7 | 46.6 | 59.4 | | ExtremeNet (Hourglass-104, multi-scale) | 2019-01-23 |
Pix2seq: A Language Modeling Framework for Object Detection | ✓ Link | 43.2 | 61.0 | 46.1 | 26.6 | 47 | 58.6 | | Pix2seq (R50-DC5 ) | 2021-09-22 |
Hybrid Task Cascade for Instance Segmentation | ✓ Link | 43.2 | 59.4 | 40.7 | 20.3 | 40.9 | 52.3 | | HTC (cascade) | 2019-01-22 |
Micro-Batch Training with Batch-Channel Normalization and Weight Standardization | ✓ Link | 43.12 | 64.15 | 47.11 | 25.49 | 47.19 | 56.39 | | Mask R-CNN-FPN (ResNeXt-101, GN+WS) | 2019-03-25 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 43.1 | | | 26.6 | 46.0 | | | HTC (HRNetV2p-W18) | 2019-08-20 |
Deformable ConvNets v2: More Deformable, Better Results | ✓ Link | 43.1 | | | | | | | Mask R-CNN (ResNet-101, DCNv2) | 2018-11-27 |
Conditional DETR for Fast Training Convergence | ✓ Link | 43 | 64 | 45.7 | 22.7 | 46.7 | 61.5 | 44 | Conditional DETR-R50 | 2021-08-13 |
HoughNet: Integrating near and long-range evidence for bottom-up object detection | ✓ Link | 43.0 | 62.2 | 46.9 | 25.5 | 47.6 | 55.8 | | HoughNet (HG-104) | 2020-07-05 |
X-volution: On the unification of convolution and self-attention | | 42.8 | 64 | 46.4 | 26.9 | 46 | 55 | | Faster R-CNN (FPN, X-volution) | 2021-06-04 |
Cascade R-CNN: Delving into High Quality Object Detection | ✓ Link | 42.7 | 61.6 | 46.6 | 23.8 | 46.2 | 57.4 | | Cascade R-CNN (ResNet-101-FPN+, cascade) | 2017-12-03 |
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions | ✓ Link | 42.6 | 63.7 | 45.4 | 25.8 | 46.0 | 58.4 | | PVT-Large (RetinaNet 1x) | 2021-02-24 |
CornerNet-Lite: Efficient Keypoint Based Object Detection | ✓ Link | 42.6 | | | 25.5 | 44.3 | 58.4 | | CornerNet-Saccade (Hourglass-54) | 2019-04-18 |
Pix2seq: A Language Modeling Framework for Object Detection | ✓ Link | 42.6 | | | | | | | Pix2seq (R50) | 2021-09-22 |
Group Normalization | ✓ Link | 42.3 | 62.8 | 46.2 | | | | | Mask R-CNN (ResNet-101-FPN, GroupNorm, long) | 2018-03-22 |
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals | ✓ Link | 42.3 | 61.2 | 45.7 | 26.7 | 44.6 | 57.6 | | Sparse R-CNN (ResNet-50, FPN) | 2020-11-25 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 42.3 | | | 25.0 | 45.4 | | | Mask R-CNN (HRNetV2p-W32) | 2019-08-20 |
Rethinking and Improving Relative Position Encoding for Vision Transformer | ✓ Link | 42.3 | | | | | | | DETR-ResNet50 with iRPE-K (300 epochs) | 2021-07-29 |
Scale-Aware Trident Networks for Object Detection | ✓ Link | 42 | 63.5 | 45.5 | 24.9 | 47 | 56.9 | | TridentNet (ResNet-101) | 2019-01-07 |
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing | ✓ Link | 42 | 61 | 46.3 | 24.5 | 45.2 | 55.7 | | R3-CNN (ResNet-50-FPN) | 2021-04-03 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 41.8 | 62.8 | 45.9 | | 44.7 | 54.6 | | Faster R-CNN (HRNetV2p-W48) | 2019-08-20 |
LIP: Local Importance-based Pooling | ✓ Link | 41.7 | 63.6 | 45.6 | 25.2 | 45.8 | | | Faster R-CNN (LIP-ResNet-101) | 2019-08-12 |
Deformable ConvNets v2: More Deformable, Better Results | ✓ Link | 41.7 | | | 22.2 | 45.8 | 58.7 | | Faster R-CNN (ResNet-101, DCNv2) | 2018-11-27 |
Feature Selective Anchor-Free Module for Single-Shot Object Detection | ✓ Link | 41.6 | 62.4 | | | | | | FSAF (ResNeXt-101, anchor-based branches) | 2019-03-02 |
CornerNet-Lite: Efficient Keypoint Based Object Detection | ✓ Link | 41.4 | | | 23.8 | 43.5 | 57.1 | | CornerNet-Saccade (Hourglass-104) | 2019-04-18 |
Grid R-CNN | ✓ Link | 41.3 | 60.3 | 44.4 | 23.4 | 45.8 | 54.1 | | Grid R-CNN (ResNet-101-FPN) | 2018-11-29 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 41.3 | 59.2 | 44.9 | 23.7 | 44.2 | 54.1 | | Cascade R-CNN (HRNetV2p-W18) | 2019-08-20 |
CenterNet: Keypoint Triplets for Object Detection | ✓ Link | 41.3 | 59.2 | 43.9 | 23.6 | 43.8 | 55.8 | | CenterNet511 (Hourglass-52) | 2019-04-17 |
RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free | ✓ Link | 41.1 | 60.2 | 44.1 | | | | | RetinaMask (ResNet-101-FPN) | 2019-01-10 |
MetaFormer Is Actually What You Need for Vision | ✓ Link | 41.0 | 63.1 | 44.8 | | | | | PoolFormer-S36 (Mask R-CNN) | 2021-11-22 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 40.9 | 61.8 | 44.8 | 24.4 | 43.7 | 53.3 | | Faster R-CNN (HRNetV2p-W32) | 2019-08-20 |
VirTex: Learning Visual Representations from Textual Annotations | ✓ Link | 40.9 | | | | | | | VirTex Mask R-CNN (ResNet-50-FPN) | 2020-06-11 |
Non-local Neural Networks | ✓ Link | 40.8 | 63.1 | 44.5 | | | | | Mask R-CNN (ResNet-101 + 1 NL) | 2017-11-21 |
Group Normalization | ✓ Link | 40.8 | 61.6 | 44.4 | | | | | Mask R-CNN (ResNet-50-FPN, GroupNorm, long) | 2018-03-22 |
RepPoints: Point Set Representation for Object Detection | ✓ Link | 40.8 | | | | | | | RPDet (ResNet-50, multi-scale train) | 2019-04-25 |
Rethinking and Improving Relative Position Encoding for Vision Transformer | ✓ Link | 40.8 | | | | | | | DETR-ResNet50 with iRPE-K (150 epochs) | 2021-07-29 |
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection | ✓ Link | 40.7 | 60.7 | 43.3 | | | | | Faster R-CNN+aLRP Loss (ResNet-50, 500 scale) | 2020-09-28 |
Reducing Label Noise in Anchor-Free Object Detection | ✓ Link | 40.5 | 59.5 | 44.2 | 25.4 | 44.7 | 52.3 | | PPDet (ResNet-101-FPN) | 2020-08-03 |
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond | ✓ Link | 40.3 | 62.4 | 44 | 24.2 | 44.4 | 52.5 | | GCnet (ResNet-50-FPN, GRoIE) | 2019-04-25 |
Group Normalization | ✓ Link | 40.3 | 61 | 44 | | | | | Mask R-CNN (ResNet-50-FPN, GroupNorm) | 2018-03-22 |
Cascade R-CNN: Delving into High Quality Object Detection | ✓ Link | 40.3 | 59.4 | 43.7 | 22.9 | 43.7 | 54.1 | | Cascade R-CNN (ResNet-50-FPN+) | 2017-12-03 |
Bottom-up Object Detection by Grouping Extreme and Center Points | ✓ Link | 40.3 | 55.1 | 43.7 | 21.6 | 44.0 | 56.1 | | ExtremeNet (Hourglass-104, single-scale) | 2019-01-23 |
RepPoints: Point Set Representation for Object Detection | ✓ Link | 40.3 | | | | | | | RPDet (ResNet-101) | 2019-04-25 |
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection | ✓ Link | 40.2 | 60.3 | 42.3 | | | | | RetinaNet+aLRP Loss (ResNet-50, 500 scale) | 2020-09-28 |
Mask R-CNN | ✓ Link | 40.0 | | | | | | | Mask R-CNN (ResNet-101-FPN) | 2017-03-20 |
Feature Pyramid Networks for Object Detection | ✓ Link | 39.8 | 61.3 | 43.3 | 22.9 | 43.3 | 52.6 | | FPN+ | 2016-12-09 |
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection | ✓ Link | 39.7 | 58.8 | 41.5 | | | | | FoveaBox+aLRP Loss (ResNet-50, 500 scale) | 2020-09-28 |
Grid R-CNN | ✓ Link | 39.6 | 58.3 | 42.4 | 22.6 | 43.8 | 51.5 | | Grid R-CNN (ResNet-50-FPN) | 2018-11-29 |
Adaptively Connected Neural Networks | ✓ Link | 39.5 | | | | | | | Mask R-CNN (ResNet-50, ACNet) | 2019-04-07 |
Feature Selective Anchor-Free Module for Single-Shot Object Detection | ✓ Link | 39.3 | 59.2 | | | | | | FSAF (ResNet-101, anchor-based branches) | 2019-03-02 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 39.2 | | | | 41.7 | 51.0 | | Mask R-CNN (HRNetV2p-W18) | 2019-08-20 |
Non-local Neural Networks | ✓ Link | 39.0 | 61.1 | 41.9 | | | | | Mask R-CNN (ResNet-50 + 1 NL) | 2017-11-21 |
FoveaBox: Beyond Anchor-based Object Detector | ✓ Link | 38.9 | 58.4 | 41.5 | 22.3 | 43.5 | 51.7 | | FoveaBox (ResNet-101-FPN, 800x800) | 2019-04-08 |
FCOS: Fully Convolutional One-Stage Object Detection | ✓ Link | 38.6 | 57.4 | 41.4 | 22.3 | 42.5 | 49.8 | | FCOS (ResNet-50-FPN + improvements) | 2019-04-02 |
RepPoints: Point Set Representation for Object Detection | ✓ Link | 38.6 | | | | | | | RPDet (ResNet-50) | 2019-04-25 |
Libra R-CNN: Towards Balanced Learning for Object Detection | ✓ Link | 38.5 | 59.3 | 42.0 | 22.9 | 42.1 | 50.5 | | Libra R-CNN (ResNet-50 FPN) | 2019-04-04 |
A novel Region of Interest Extraction Layer for Instance Segmentation | ✓ Link | 38.4 | 59.9 | 41.7 | 22.9 | 42.1 | 49.7 | | Mask R-CNN (ResNet-50-FPN, GRoIE) | 2020-04-28 |
CornerNet: Detecting Objects as Paired Keypoints | ✓ Link | 38.4 | 53.8 | 40.9 | 18.6 | 40.5 | 51.8 | | CornerNet511 (Hourglass-104) | 2018-08-03 |
FoveaBox: Beyond Anchor-based Object Detector | ✓ Link | 38.1 | 57.8 | 40.5 | | | | | FoveaBox+Retina (ResNet-50) | 2019-04-08 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 38.0 | 58.9 | 41.5 | 22.6 | 40.8 | 49.6 | | Faster R-CNN (HRNetV2p-W18) | 2019-08-20 |
FoveaBox: Beyond Anchor-based Object Detector | ✓ Link | 38 | 57.8 | 40.2 | 19.5 | 42.2 | 52.7 | | FoveaBox (ResNet-101-FPN, 600x600) | 2019-04-08 |
Feature Selective Anchor-Free Module for Single-Shot Object Detection | ✓ Link | 37.9 | 58.0 | | | | | | FSAF (ResNet-101) | 2019-03-02 |
Mask R-CNN | ✓ Link | 37.7 | | | | | | | Mask R-CNN (ResNet-50-FPN) | 2017-03-20 |
A novel Region of Interest Extraction Layer for Instance Segmentation | ✓ Link | 37.5 | 59.2 | 40.6 | 22.3 | 41.5 | 47.8 | | Faster R-CNN (ResNet-50-FPN, GRoIE) | 2020-04-28 |
Mask R-CNN | ✓ Link | 36.7 | 59.5 | 38.9 | | | | | Mask R-CNN (ResNeXt-101-FPN) | 2017-03-20 |
FoveaBox: Beyond Anchor-based Object Detector | ✓ Link | 36.0 | 55.2 | 37.9 | 18.6 | 39.4 | 50.5 | | FoveaBox (ResNet-50-FPN, 600x600) | 2019-04-08 |
Feature Selective Anchor-Free Module for Single-Shot Object Detection | ✓ Link | 35.9 | 55.0 | 37.9 | 19.8 | 39.6 | 48.2 | | FSAF (ResNet-50) | 2019-03-02 |
Gradient Harmonized Single-stage Detector | ✓ Link | 35.8 | 55.5 | 38.1 | 19.6 | 39.6 | 46.7 | | GHM-C + GHM-R (RetinaNet-FPN-ResNet-50, M=30) | 2018-11-13 |
Generating Positive Bounding Boxes for Balanced Training of Object Detectors | ✓ Link | 35.6 | 55.3 | | | | | | Online Fg Bal. Sampling+Hard Negative Mining (ResNet-50) | 2019-09-21 |
M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network | ✓ Link | 34.1 | 53.7 | | 15.9 | 39.5 | 49.3 | | M2Det (ResNet-1o1, 320x320) | 2018-11-12 |
Res2Net: A New Multi-scale Backbone Architecture | ✓ Link | 33.7 | 53.6 | | 14 | 38.3 | 51.1 | | Faster R-CNN (Res2Net-50) | 2019-04-02 |
M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network | ✓ Link | 33.2 | 52.2 | | 15 | 38.2 | 49.1 | | M2Det (VGG-16, 320x320) | 2018-11-12 |
SOLQ: Segmenting Objects by Learning Queries | ✓ Link | | 74.9 | 61.3 | | | 71.9 | | SOLQ (Swin-L, single scale) | 2021-06-04 |
You Only Learn One Representation: Unified Network for Multiple Tasks | ✓ Link | | 73.5 | 60.6 | 40.4 | 60.1 | 68.7 | | YOLOR-D6 (1280, single-scale, 31 fps) | 2021-05-10 |
EfficientDet: Scalable and Efficient Object Detection | ✓ Link | | 73.4 | 59.0 | 40.0 | 58.0 | 67.9 | | EfficientDet-D7x (single-scale) | 2019-11-20 |
You Only Learn One Representation: Unified Network for Multiple Tasks | ✓ Link | | 70.6 | 57.4 | 37.4 | 57.3 | 65.2 | | YOLOR-P6 (1280, single-scale, 72 fps) | 2021-05-10 |
Focal Modulation Networks | ✓ Link | | 70.1 | 55.8 | | | | | FocalNet-T (SRF, Cascade Mask R-CNN) | 2022-03-22 |
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing | ✓ Link | | 61.2 | 45.6 | 24.4 | | | | R3-CNN (ResNet-50-FPN, GRoIE) | 2021-04-03 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | | | | 26.1 | 47.9 | | | Mask R-CNN (HRNetV2p-W32, cascade) | 2019-08-20 |
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism | ✓ Link | | | | | 42.3 | | | Shift-T | 2022-01-26 |
Dynamic Head: Unifying Object Detection Heads with Attentions | ✓ Link | | | | | | 66.3 | | DyHead (ResNeXt-64x4d-101-DCN, multi scale) | 2021-06-15 |