OpenCodePapers

action-recognition-on-ava-v2-2

Action Recognition
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodemAPModelNameReleaseDate
On the Benefits of 3D Pose and Tracking for Human Action Recognition✓ Link45.1LART (Hiera-H, K700 PT+FT)2023-04-03
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles✓ Link43.3Hiera-H (K700 PT+FT)2023-06-01
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking✓ Link42.6VideoMAE V2-g2023-03-29
End-to-End Spatio-Temporal Action Localisation with Video Transformers41.7STAR/L2023-04-24
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning✓ Link41.1MVD (Kinetics400 pretrain+finetune, ViT-H, 16x4)2022-12-08
InternVideo: General Video Foundation Models via Generative and Discriminative Learning✓ Link41.01InternVideo2022-12-06
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning✓ Link40.1MVD (Kinetics400 pretrain, ViT-H, 16x4)2022-12-08
Masked Feature Prediction for Self-Supervised Visual Pre-Training✓ Link39.8MaskFeat (Kinetics-600 pretrain, MViT-L)2021-12-16
Unmasked Teacher: Towards Training-Efficient Video Foundation Models✓ Link39.8UMT-L (ViT-L/16)2023-03-28
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link39.5VideoMAE (K400 pretrain+finetune, ViT-H, 16x4)2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link39.3VideoMAE (K700 pretrain+finetune, ViT-L, 16x4)2022-03-23
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning✓ Link38.7MVD (Kinetics400 pretrain+finetune, ViT-L, 16x4)2022-12-08
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link37.8VideoMAE (K400 pretrain+finetune, ViT-L, 16x4)2022-03-23
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning✓ Link37.7MVD (Kinetics400 pretrain, ViT-L, 16x4)2022-12-08
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link36.5VideoMAE (K400 pretrain, ViT-H, 16x4)2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link36.1VideoMAE (K700 pretrain, ViT-L, 16x4)2022-03-23
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition✓ Link35.4MeMViT-242022-01-20
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection✓ Link34.4MViTv2-L (IN21k, K700)2021-12-02
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link34.3VideoMAE (K400 pretrain, ViT-L, 16x4)2022-03-23
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning✓ Link34.2MVD (Kinetics400 pretrain+finetune, ViT-B, 16x4)2022-12-08
Asymmetric Masked Distillation for Pre-Training Small Foundation Models33.5AMD(ViT-B/16)2023-11-06
Holistic Interaction Transformer Network for Action Detection✓ Link32.6HIT2022-10-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link31.8VideoMAE (K400 pretrain+finetune, ViT-B, 16x4)2022-03-23
Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization✓ Link31.72ACAR-Net, SlowFast R-101 (Kinetics-700 pretraining)2020-06-14
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning✓ Link31.1MVD (Kinetics400 pretrain, ViT-B, 16x4)2022-12-08
Towards Long-Form Video Understanding✓ Link31.0Object Transformer2021-06-21
Multiscale Vision Transformers✓ Link28.7MViT-B-24, 32x3 (Kinetics-600 pretraining)2021-04-22
Multiscale Vision Transformers✓ Link27.5MViT-B, 32x3 (Kinetics-500 pretraining)2021-04-22
SlowFast Networks for Video Recognition✓ Link27.5SlowFast, 16x8 R101+NL (Kinetics-600 pretraining)2018-12-10
Multiscale Vision Transformers✓ Link27.3MViT-B, 64x3 (Kinetics-400 pretraining)2021-04-22
SlowFast Networks for Video Recognition✓ Link27.1SlowFast, 8x8 R101+NL (Kinetics-600 pretraining)2018-12-10
Multiscale Vision Transformers✓ Link26.8MViT-B, 32x3 (Kinetics-400 pretraining)2021-04-22
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training✓ Link26.7VideoMAE (K400 pretrain, ViT-B, 16x4)2022-03-23
Object-Region Video Transformers✓ Link26.6ORViT MViT-B, 16x4 (K400 pretraining)2021-10-13
Multiscale Vision Transformers✓ Link26.1MViT-B, 16x4 (Kinetics-600 pretraining)2021-04-22
Multiscale Vision Transformers✓ Link24.5MViT-B, 16x4 (Kinetics-400 pretraining)2021-04-22
SlowFast Networks for Video Recognition✓ Link23.8SlowFast, 8x8, R101 (Kinetics-400 pretraining)2018-12-10
SlowFast Networks for Video Recognition✓ Link21.9SlowFast, 4x16, R50 (Kinetics-400 pretraining)2018-12-10