DETRs with Collaborative Hybrid Assignments Training | ✓ Link | 56.6 | 79.7 | 62.8 | 74.6 | 59.7 | 38.9 | | | | Co-DETR | 2022-11-22 |
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions) | ✓ Link | 55.9 | | | | | | | | | ViT-CoMer-L (Mask RCNN, DINOv2) | 2024-03-13 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 55.4 | 80.1 | 61.5 | 74.4 | 58.4 | 37.9 | | | | InternImage-H | 2022-11-10 |
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale | ✓ Link | 55.0 | 79.4 | 60.9 | 72.0 | 58.4 | 37.6 | | | | EVA | 2022-11-14 |
Mask Frozen-DETR: High Quality Instance Segmentation with One GPU | | 54.9 | 78.9 | 60.8 | 72.9 | 58.4 | 37.2 | | | | Mask Frozen-DETR | 2023-08-07 |
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation | ✓ Link | 54.5 | | | | | | | | | MasK DINO (SwinL, multi-scale) | 2022-06-06 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 54.2 | | | | | | | | | ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale) | 2022-05-17 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 54.2 | | | | | | | | | GLEE-Pro | 2023-12-14 |
Swin Transformer V2: Scaling Up Capacity and Resolution | ✓ Link | 53.7 | | | | | | | | | SwinV2-G (HTC++) | 2021-11-18 |
Exploring Plain Vision Transformer Backbones for Object Detection | ✓ Link | 53.1 | | | | | | | | | ViTDet, ViT-H Cascade (multiscale) | 2022-03-30 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 53.0 | | | | | | | | | GLEE-Plus | 2023-12-14 |
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation | ✓ Link | 52.6 | | | | | | | | | Mask DINO (SwinL) | 2022-06-06 |
End-to-End Semi-Supervised Object Detection with Soft Teacher | ✓ Link | 52.5 | | | | | | | | | Soft Teacher + Swin-L(HTC++, multi-scale) | 2021-06-16 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 52.5 | | | | | | | | | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) | 2022-05-17 |
Vision Transformer Adapter for Dense Predictions | ✓ Link | 52.2 | | | | | | | | | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) | 2022-05-17 |
Exploring Plain Vision Transformer Backbones for Object Detection | ✓ Link | 52 | | | | | | | | | ViTDet, ViT-H Cascade | 2022-03-30 |
End-to-End Semi-Supervised Object Detection with Soft Teacher | ✓ Link | 51.9 | | | | | | | | | Soft Teacher + Swin-L(HTC++, single-scale) | 2021-06-16 |
CBNet: A Composite Backbone Network Architecture for Object Detection | ✓ Link | 51.8 | | | | | | | | | CBNetV2 (Dual-Swin-L HTC, multi-scale) | 2021-07-01 |
Could Giant Pretrained Image Models Extract Universal Representations? | | 51.6 | | | | | | | | | Frozen Backbone, SwinV2-G-ext22K (HTC) | 2022-11-03 |
CBNet: A Composite Backbone Network Architecture for Object Detection | ✓ Link | 51 | | | | | | | | | CBNetV2 (Dual-Swin-L HTC, multi-scale) | 2021-07-01 |
Focal Self-attention for Local-Global Interactions in Vision Transformers | ✓ Link | 50.9 | | | | | | | | | Focal-L (HTC++, multi-scale) | 2021-07-01 |
Dilated Neighborhood Attention Transformer | ✓ Link | 50.8 | 75.0 | | | | | | | | DiNAT-L (single-scale, Mask2Former) | 2022-09-29 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 50.5 | | | | | | | | | MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train) | 2021-12-02 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | ✓ Link | 50.4 | | | | | | | | | Swin-L (HTC++, multi scale) | 2021-03-25 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 50.3 | | | | | | | | | MOAT-3 (IN-22K pretraining, single-scale) | 2022-10-04 |
Masked-attention Mask Transformer for Universal Image Segmentation | ✓ Link | 50.1 | | | | | | | | | Mask2Former (Swin-L) | 2021-12-02 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | ✓ Link | 49.5 | | | | | | | | | Swin-L (HTC++, single scale) | 2021-03-25 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 49.3 | | | | | | | | | MOAT-2 (IN-22K pretraining, single-scale) | 2022-10-04 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 49.0 | | | | | | | | | MOAT-1 (IN-1K pretraining, single-scale) | 2022-10-04 |
Instances as Queries | ✓ Link | 48.9 | 74.0 | 53.9 | 68.3 | 52.6 | 30.8 | | | | QueryInst (single scale) | 2021-05-05 |
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation | ✓ Link | 48.9 | | | | | | | | | Cascade Eff-B7 NAS-FPN (1280, self-training Copy Paste, single-scale) | 2020-12-13 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 48.8 | | | | | | 1782 | 387 | | InternImage-XL | 2022-11-10 |
X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion | ✓ Link | 48.8 | | | | | | | | | CenterNet2 (Swin-L w/ X-Paste + Copy-Paste) | 2022-12-07 |
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles | ✓ Link | 48.6 | | | | | | | | | Heira-L | 2023-06-01 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 48.5 | | | | | | 1399 | 277 | 56.1 | InternImage-L | 2022-11-10 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 48.5 | | | | | | | | | MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train) | 2021-12-02 |
General Object Foundation Model for Images and Videos at Scale | ✓ Link | 48.4 | | | | | | | | | GLEE-Lite | 2023-12-14 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 47.4 | | | | | | | | | MOAT-0 (IN-1K pretraining, single-scale) | 2022-10-04 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 47.1 | | | | | | | | | MViTv2-L (Cascade Mask R-CNN, single-scale) | 2021-12-02 |
MPViT: Multi-Path Vision Transformer for Dense Prediction | ✓ Link | 47.0 | | | | | | | | | MPViT-B (Cascade Mask R-CNN, multi-scale, IN1k pre-train) | 2021-12-21 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 47.0 | | | | | | | | | tiny-MOAT-3 (IN-1K pretraining, single-scale) | 2022-10-04 |
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation | ✓ Link | 46.8 | | | | | | | | | Cascade Eff-B7 NAS-FPN (1280) | 2020-12-13 |
ResNeSt: Split-Attention Networks | ✓ Link | 46.25 | | | | | | | | | ResNeSt-200 (multi-scale) | 2020-04-19 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 46.2 | | | | | | | | | MViT-L (Mask R-CNN, single-scale) | 2021-12-02 |
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization | ✓ Link | 46.1 | | | | | | | | | RetinaNet (SpineNet-190, 1536x1536) | 2019-12-10 |
MPViT: Multi-Path Vision Transformer for Dense Prediction | ✓ Link | 45.8 | | | | | | | | | MPViT-B (Cascade R-CNN, sinlge-scale, IN-1K pre-train) | 2021-12-21 |
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding | ✓ Link | 45.7 | | 49.9 | | | | | | | Mask R-CNN (ViL Base, multi-scale, 3x lr) | 2021-03-29 |
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding | ✓ Link | 45.1 | 67.2 | 49.3 | | | | | | | Mask R-CNN (ViL Base, 1x lr) | 2021-03-29 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 45.0 | | | | | | | | | tiny-MOAT-2 (IN-1K pretraining, single-scale) | 2022-10-04 |
Global Context Networks | ✓ Link | 44.7 | 67.9 | 48.4 | | | | | | | GCNet (ResNeXt-101 + DCN + cascade + GC r4) | 2020-12-24 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 44.6 | | | | | | | | | tiny-MOAT-1 (IN-1K pretraining, single-scale) | 2022-10-04 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 44.5 | | | | | | 340 | 69 | 49.7 | InternImage-S | 2022-11-10 |
ResNeSt: Split-Attention Networks | ✓ Link | 44.5 | | | | | | | | | ResNeSt-200-DCN (single-scale) | 2020-04-19 |
ELSA: Enhanced Local Self-Attention for Vision Transformer | ✓ Link | 44.4 | 67.8 | 47.8 | | | | | | | ELSA-S (Cascade Mask RCNN) | 2021-12-23 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 44.4 | | | | | | | | | BoTNet 200 (Mask R-CNN, single scale, 72 epochs) | 2021-01-27 |
DaViT: Dual Attention Vision Transformers | ✓ Link | 44.3 | | | | | | | | | DaViT-T (Mask R-CNN, 36 epochs) | 2022-04-07 |
ResNeSt: Split-Attention Networks | ✓ Link | 44.21 | | | | | | | | | ResNeSt-200 (single-scale) | 2020-04-19 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | 43.7 | | | | | | 270 | 49 | 49.1 | InternImage-T | 2022-11-10 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 43.7 | | | | | | | | | BoTNet 152 (Mask R-CNN, single scale, 72 epochs) | 2021-01-27 |
XCiT: Cross-Covariance Image Transformers | ✓ Link | 43.7 | | | | | | | | | XCiT-M24/8 | 2021-06-17 |
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | ✓ Link | 43.3 | | | | | | | | | tiny-MOAT-0 (IN-1K pretraining, single-scale) | 2022-10-04 |
ELSA: Enhanced Local Self-Attention for Vision Transformer | ✓ Link | 43.0 | 67.3 | 46.4 | | | | | | | ELSA-S (Mask RCNN) | 2021-12-23 |
XCiT: Cross-Covariance Image Transformers | ✓ Link | 43.0 | | | | | | | | | XCiT-S24/8 | 2021-06-17 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | 42.5 | | | | | | | | | CenterMask-VoVNetV2-99 (multi-scale) | 2019-11-15 |
ResNeSt: Split-Attention Networks | ✓ Link | 41.56 | | | | | | | | | ResNeSt-101 (single-scale) | 2020-04-19 |
Scaling up Multi-domain Semantic Segmentation with Sentence Embeddings | | 41.4 | | | | | | | | | SIW | 2022-02-04 |
Res2Net: A New Multi-scale Backbone Architecture | ✓ Link | 41.3 | | | | | | | | | Res2Net-101+HTC | 2019-04-02 |
Deep High-Resolution Representation Learning for Human Pose Estimation | ✓ Link | 41.0 | | | | | | | | | HTC (HRNetV2p-W48) | 2019-02-25 |
Deep High-Resolution Representation Learning for Visual Recognition | ✓ Link | 41.0 | | | | | | | | | HTC (HRNetV2p-W48) | 2019-08-20 |
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond | ✓ Link | 40.9 | | | | | | | | | GCNet (ResNeXt-101 + DCN + cascade + GC r16) | 2019-04-25 |
Bottleneck Transformers for Visual Recognition | ✓ Link | 40.7 | | | | | | | | | BoTNet 50 (72 epochs) | 2021-01-27 |
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing | ✓ Link | 40.4 | 61.3 | 44 | 56.1 | 43.6 | 22.3 | | | | R3-CNN (ResNet-50-FPN, DCN) | 2021-04-03 |
Non-local Neural Networks | ✓ Link | 40.3 | | | | | | | | | Mask R-CNN (ResNext-152, +1 NL) | 2017-11-21 |
Attentive Normalization | ✓ Link | 40.2 | 63.2 | 43.3 | | | | | | | Mask R-CNN-FPN (AOGNet-40M) | 2019-08-04 |
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing | ✓ Link | 40.2 | 61.1 | 43.5 | | 42.8 | 22.6 | | | | R3-CNN (ResNet-50-FPN, GC-Net) | 2021-04-03 |
CenterMask : Real-Time Anchor-Free Instance Segmentation | ✓ Link | 40.2 | | | | | | | | | CenterMask-VoVNetV2-99-3x | 2019-11-15 |
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing | ✓ Link | 39.1 | 58.8 | 42.3 | 54.3 | 42.1 | 20.7 | | | | R3-CNN (ResNet-50-FPN, GRoIE) | 2021-04-03 |
Mask Scoring R-CNN | ✓ Link | 39.1 | | | | | | | | | Mask Scoring R-CNN (ResNet-101-FPN-DCN) | 2019-03-01 |
Micro-Batch Training with Batch-Channel Normalization and Weight Standardization | ✓ Link | 38.34 | 61.07 | 40.82 | 56.08 | 41.73 | 18.32 | | | | Mask R-CNN-FPN (ResNeXt-101, GN+WS) | 2019-03-25 |
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing | ✓ Link | 38.2 | 58 | 41.4 | 52.8 | 41 | 20.4 | | | | R3-CNN (ResNet-50-FPN) | 2021-04-03 |
Hybrid Task Cascade for Instance Segmentation | ✓ Link | 38.2 | | | | | | | | | HTC (ResNet-50) | 2019-01-22 |
Mask Scoring R-CNN | ✓ Link | 38.2 | | | | | | | | | Mask Scoring R-CNN (ResNet-101 FPN) | 2019-03-01 |
Path Aggregation Network for Instance Segmentation | ✓ Link | 37.8 | | | | | | | | | PANet (ResNet-50) | 2018-03-05 |
A novel Region of Interest Extraction Layer for Instance Segmentation | ✓ Link | 37.2 | 59.3 | 39.8 | 51.2 | 41 | 20.2 | | | | GCnet (ResNet-50-FPN, GRoIE) | 2020-04-28 |
X-volution: On the unification of convolution and self-attention | | 37.2 | | | 53.1 | 40 | 19.2 | | | | Mask R-CNN (FPN, X-volution, SA) | 2021-06-04 |
Non-local Neural Networks | ✓ Link | 37.1 | | | | | | | | | Mask R-CNN (ResNet-101, +1 NL) | 2017-11-21 |
Mask Scoring R-CNN | ✓ Link | 36.0 | | | | | | | | | Mask Scoring R-CNN (ResNet-50 FPN) | 2019-03-01 |
A novel Region of Interest Extraction Layer for Instance Segmentation | ✓ Link | 35.8 | 57.1 | 38.0 | 48.7 | 39 | 19.1 | | | | Mask R-CNN (ResNet-50-FPN, GRoIE) | 2020-04-28 |
Res2Net: A New Multi-scale Backbone Architecture | ✓ Link | 35.6 | 57.6 | | 53.7 | 37.9 | 15.7 | | | | Faster R-CNN (Res2Net-50) | 2019-04-02 |
Non-local Neural Networks | ✓ Link | 35.5 | | | | | | | | | Mask R-CNN (ResNet-50, +1 NL) | 2017-11-21 |
Adaptively Connected Neural Networks | ✓ Link | 35.2 | | | | | | | | | Mask R-CNN (ResNet-50, ACNet) | 2019-04-07 |
YOLACT: Real-time Instance Segmentation | ✓ Link | 29.9 | | | | | | | | | YOLACT-550 (ResNet-50) | 2019-04-04 |
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | ✓ Link | | | | | | | 501 | 115 | | InternImage-B | 2022-11-10 |