| EVA: Exploring the Limits of Masked Visual Representation Learning at Scale | ✓ Link | 57.8 | 28.86 | EVA | 2022-11-14 |
| NMS Strikes Back | ✓ Link | 48.5 | 20.15 | DETA
(Swin-L) | 2022-12-12 |
| Grounded Language-Image Pre-training | ✓ Link | 48.0 | 24.89 | GLIP-L
(Swin-L) | 2021-12-07 |
| GRiT: A Generative Region-to-text Transformer for Object Understanding | ✓ Link | 42.9 | 15.72 | GRiT
(ViT-H) | 2022-12-01 |
| DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection | ✓ Link | 42.1 | 15.76 | DINO (Swin-L) | 2022-03-07 |
| CBNet: A Composite Backbone Network Architecture for Object Detection | ✓ Link | 39.0 | 12.36 | CBNetV2
(Swin-L) | 2021-07-01 |
| A ConvNet for the 2020s | ✓ Link | 37.5 | 12.68 | ConvNeXt-XL
(Cascade Mask R-CNN) | 2022-01-10 |
| InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 37.0 | 11.72 | InternImage-L (Cascade Mask R-CNN) | 2022-11-10 |
| Dynamic Head: Unifying Object Detection Heads with Attentions | ✓ Link | 35.3 | 10.00 | DyHead
(Swin-L) | 2021-06-15 |
| Exploring Plain Vision Transformer Backbones for Object Detection | ✓ Link | 34.3 | | ViTDet (ViT-H) | 2022-03-30 |
| Vision Transformer Adapter for Dense Predictions | ✓ Link | 34.25 | 7.79 | ViT-Adapter (BEiTv2-L) | 2022-05-17 |
| Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | ✓ Link | 33.7 | 11.43 | FIBER-B
(Swin-B) | 2022-06-15 |
| Instances as Queries | ✓ Link | 33.2 | 8.26 | QueryInst
(Swin-L) | 2021-05-05 |
| YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications | ✓ Link | 32.5 | 6.73 | YOLOv6-L6 | 2022-09-07 |
| YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors | ✓ Link | 32.0 | 6.42 | YOLOv7-E6E | 2022-07-06 |
| MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 30.9 | 5.62 | MViTV2-H
(Cascade Mask R-CNN) | 2021-12-02 |
| Robust and Accurate Object Detection via Adversarial Learning | ✓ Link | 30.8 | 7.34 | Det-AdvProp
(EfficientNet-B5) | 2021-03-23 |
| YOLOv4: Optimal Speed and Accuracy of Object Detection | ✓ Link | 30.4 | 5.89 | YOLOv4-P6 | 2020-04-23 |
| YOLOX: Exceeding YOLO Series in 2021 | ✓ Link | 30.3 | 7.26 | YOLOX-X | 2021-07-18 |
| Probabilistic two-stage detection | ✓ Link | 29.5 | 4.29 | CenterNet2
(R2-101-DCN) | 2021-03-12 |
| Grounded Language-Image Pre-training | ✓ Link | 29.1 | 8.11 | GLIP-T
(Swin-T) | 2021-12-07 |
| EfficientDet: Scalable and Efficient Object Detection | ✓ Link | 28.5 | 5.44 | EfficientDet-D5
(EfficientNet-B5) | 2019-11-20 |
| PVT v2: Improved Baselines with Pyramid Vision Transformer | ✓ Link | 28.2 | 6.85 | PVTv2-B5
(Mask R-CNN) | 2021-06-25 |
| VarifocalNet: An IoU-aware Dense Object Detector | ✓ Link | 28.0 | 5.27 | VFNet
(RX-101-64x4d) | 2020-08-31 |
| GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond | ✓ Link | 26.0 | 4.38 | GCNet
(RX-101-32x4d-DCN) | 2019-04-25 |
| Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection | ✓ Link | 25.1 | 2.6 | GFLv2
(R2-101-DCN) | 2020-11-25 |
| RepPoints V2: Verification Meets Regression for Object Detection | ✓ Link | 24.9 | 2.7 | RepPointsV2
(RX-101-64x4d-DCN) | 2020-07-16 |
| USB: Universal-Scale Object Detection Benchmark | ✓ Link | 24.8 | | UniverseNet
(R2-101-DCN) | 2021-03-25 |
| YOLOX: Exceeding YOLO Series in 2021 | ✓ Link | 20.6 | 2.48 | YOLOX-S | 2021-07-18 |
| You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection | ✓ Link | 20.0 | 1.05 | YOLOS-B
(ViT-B) | 2021-06-01 |
| Dynamic Head: Unifying Object Detection Heads with Attentions | ✓ Link | 19.3 | 0.16 | DyHead
(ResNet-50) | 2021-06-15 |
| Hybrid Task Cascade for Instance Segmentation | ✓ Link | 19.1 | 0.08 | HTC
(ResNet-50) | 2019-01-22 |
| Deformable DETR: Deformable Transformers for End-to-End Object Detection | ✓ Link | 18.5 | -1.49 | Deformable-DETR
(ResNet-50) | 2020-10-08 |
| Cascade R-CNN: High Quality Object Detection and Instance Segmentation | ✓ Link | 18.2 | 0.02 | Cascade R-CNN
(ResNet-50) | 2019-06-24 |
| Mask R-CNN | ✓ Link | 17.1 | | Mask R-CNN
(ResNet-50) | 2017-03-20 |
| End-to-End Object Detection with Transformers | ✓ Link | 17.1 | -1.82 | DETR
(ResNet-50) | 2020-05-26 |
| Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection | ✓ Link | 16.8 | -0.91 | ATSS
(ResNet-50) | 2019-12-05 |
| FCOS: Fully Convolutional One-Stage Object Detection | ✓ Link | 16.7 | 0.25 | FCOS
(ResNet-50) | 2019-04-02 |
| Focal Loss for Dense Object Detection | ✓ Link | 16.6 | 0.18 | RetinaNet
(ResNet-50) | 2017-08-07 |
| Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks | ✓ Link | 16.4 | -0.41 | Faster R-CNN (ResNet-50-FPN) | 2015-06-04 |
| YOLOv3: An Incremental Improvement | ✓ Link | 14.8 | -0.37 | YOLOv3
(DarkNet-53) | 2018-04-08 |
| SSD: Single Shot MultiBox Detector | ✓ Link | 13.6 | 0.36 | SSD (VGG-16) | 2015-12-08 |
| Exploring Plain Vision Transformer Backbones for Object Detection | ✓ Link | | 7.89 | ViTDet
(ViT-H) | 2022-03-30 |
| USB: Universal-Scale Object Detection Benchmark | ✓ Link | | 1.86 | UniverseNet (R2-101-DCN) | 2021-03-25 |
| Mask R-CNN | ✓ Link | | -0.11 | Mask R-CNN (ResNet-50) | 2017-03-20 |