On the Benefits of 3D Pose and Tracking for Human Action Recognition | ✓ Link | 45.1 | LART (Hiera-H, K700 PT+FT) | 2023-04-03 |
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles | ✓ Link | 43.3 | Hiera-H (K700 PT+FT) | 2023-06-01 |
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking | ✓ Link | 42.6 | VideoMAE V2-g | 2023-03-29 |
End-to-End Spatio-Temporal Action Localisation with Video Transformers | | 41.7 | STAR/L | 2023-04-24 |
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 41.1 | MVD (Kinetics400 pretrain+finetune, ViT-H, 16x4) | 2022-12-08 |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 41.01 | InternVideo | 2022-12-06 |
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 40.1 | MVD (Kinetics400 pretrain, ViT-H, 16x4) | 2022-12-08 |
Masked Feature Prediction for Self-Supervised Visual Pre-Training | ✓ Link | 39.8 | MaskFeat (Kinetics-600 pretrain, MViT-L) | 2021-12-16 |
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 39.8 | UMT-L (ViT-L/16) | 2023-03-28 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 39.5 | VideoMAE (K400 pretrain+finetune, ViT-H, 16x4) | 2022-03-23 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 39.3 | VideoMAE (K700 pretrain+finetune, ViT-L, 16x4) | 2022-03-23 |
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 38.7 | MVD (Kinetics400 pretrain+finetune, ViT-L, 16x4) | 2022-12-08 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 37.8 | VideoMAE (K400 pretrain+finetune, ViT-L, 16x4) | 2022-03-23 |
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 37.7 | MVD (Kinetics400 pretrain, ViT-L, 16x4) | 2022-12-08 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 36.5 | VideoMAE (K400 pretrain, ViT-H, 16x4) | 2022-03-23 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 36.1 | VideoMAE (K700 pretrain, ViT-L, 16x4) | 2022-03-23 |
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition | ✓ Link | 35.4 | MeMViT-24 | 2022-01-20 |
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | ✓ Link | 34.4 | MViTv2-L (IN21k, K700) | 2021-12-02 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 34.3 | VideoMAE (K400 pretrain, ViT-L, 16x4) | 2022-03-23 |
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 34.2 | MVD (Kinetics400 pretrain+finetune, ViT-B, 16x4) | 2022-12-08 |
Asymmetric Masked Distillation for Pre-Training Small Foundation Models | | 33.5 | AMD(ViT-B/16) | 2023-11-06 |
Holistic Interaction Transformer Network for Action Detection | ✓ Link | 32.6 | HIT | 2022-10-23 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 31.8 | VideoMAE (K400 pretrain+finetune, ViT-B, 16x4) | 2022-03-23 |
Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization | ✓ Link | 31.72 | ACAR-Net, SlowFast R-101 (Kinetics-700 pretraining) | 2020-06-14 |
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | ✓ Link | 31.1 | MVD (Kinetics400 pretrain, ViT-B, 16x4) | 2022-12-08 |
Towards Long-Form Video Understanding | ✓ Link | 31.0 | Object Transformer | 2021-06-21 |
Multiscale Vision Transformers | ✓ Link | 28.7 | MViT-B-24, 32x3 (Kinetics-600 pretraining) | 2021-04-22 |
Multiscale Vision Transformers | ✓ Link | 27.5 | MViT-B, 32x3 (Kinetics-500 pretraining) | 2021-04-22 |
SlowFast Networks for Video Recognition | ✓ Link | 27.5 | SlowFast, 16x8 R101+NL (Kinetics-600 pretraining) | 2018-12-10 |
Multiscale Vision Transformers | ✓ Link | 27.3 | MViT-B, 64x3 (Kinetics-400 pretraining) | 2021-04-22 |
SlowFast Networks for Video Recognition | ✓ Link | 27.1 | SlowFast, 8x8 R101+NL (Kinetics-600 pretraining) | 2018-12-10 |
Multiscale Vision Transformers | ✓ Link | 26.8 | MViT-B, 32x3 (Kinetics-400 pretraining) | 2021-04-22 |
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | ✓ Link | 26.7 | VideoMAE (K400 pretrain, ViT-B, 16x4) | 2022-03-23 |
Object-Region Video Transformers | ✓ Link | 26.6 | ORViT MViT-B, 16x4 (K400 pretraining) | 2021-10-13 |
Multiscale Vision Transformers | ✓ Link | 26.1 | MViT-B, 16x4 (Kinetics-600 pretraining) | 2021-04-22 |
Multiscale Vision Transformers | ✓ Link | 24.5 | MViT-B, 16x4 (Kinetics-400 pretraining) | 2021-04-22 |
SlowFast Networks for Video Recognition | ✓ Link | 23.8 | SlowFast, 8x8, R101 (Kinetics-400 pretraining) | 2018-12-10 |
SlowFast Networks for Video Recognition | ✓ Link | 21.9 | SlowFast, 4x16, R50 (Kinetics-400 pretraining) | 2018-12-10 |