action-recognition-on-ava-v2-2

Action Recognition

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	mAP	ModelName	ReleaseDate
On the Benefits of 3D Pose and Tracking for Human Action Recognition	✓ Link	45.1	LART (Hiera-H, K700 PT+FT)	2023-04-03
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles	✓ Link	43.3	Hiera-H (K700 PT+FT)	2023-06-01
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking	✓ Link	42.6	VideoMAE V2-g	2023-03-29
End-to-End Spatio-Temporal Action Localisation with Video Transformers		41.7	STAR/L	2023-04-24
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	✓ Link	41.1	MVD (Kinetics400 pretrain+finetune, ViT-H, 16x4)	2022-12-08
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	41.01	InternVideo	2022-12-06
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	✓ Link	40.1	MVD (Kinetics400 pretrain, ViT-H, 16x4)	2022-12-08
Masked Feature Prediction for Self-Supervised Visual Pre-Training	✓ Link	39.8	MaskFeat (Kinetics-600 pretrain, MViT-L)	2021-12-16
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	✓ Link	39.8	UMT-L (ViT-L/16)	2023-03-28
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	39.5	VideoMAE (K400 pretrain+finetune, ViT-H, 16x4)	2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	39.3	VideoMAE (K700 pretrain+finetune, ViT-L, 16x4)	2022-03-23
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	✓ Link	38.7	MVD (Kinetics400 pretrain+finetune, ViT-L, 16x4)	2022-12-08
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	37.8	VideoMAE (K400 pretrain+finetune, ViT-L, 16x4)	2022-03-23
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	✓ Link	37.7	MVD (Kinetics400 pretrain, ViT-L, 16x4)	2022-12-08
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	36.5	VideoMAE (K400 pretrain, ViT-H, 16x4)	2022-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	36.1	VideoMAE (K700 pretrain, ViT-L, 16x4)	2022-03-23
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition	✓ Link	35.4	MeMViT-24	2022-01-20
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link	34.4	MViTv2-L (IN21k, K700)	2021-12-02
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	34.3	VideoMAE (K400 pretrain, ViT-L, 16x4)	2022-03-23
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	✓ Link	34.2	MVD (Kinetics400 pretrain+finetune, ViT-B, 16x4)	2022-12-08
Asymmetric Masked Distillation for Pre-Training Small Foundation Models		33.5	AMD(ViT-B/16)	2023-11-06
Holistic Interaction Transformer Network for Action Detection	✓ Link	32.6	HIT	2022-10-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	31.8	VideoMAE (K400 pretrain+finetune, ViT-B, 16x4)	2022-03-23
Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization	✓ Link	31.72	ACAR-Net, SlowFast R-101 (Kinetics-700 pretraining)	2020-06-14
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	✓ Link	31.1	MVD (Kinetics400 pretrain, ViT-B, 16x4)	2022-12-08
Towards Long-Form Video Understanding	✓ Link	31.0	Object Transformer	2021-06-21
Multiscale Vision Transformers	✓ Link	28.7	MViT-B-24, 32x3 (Kinetics-600 pretraining)	2021-04-22
Multiscale Vision Transformers	✓ Link	27.5	MViT-B, 32x3 (Kinetics-500 pretraining)	2021-04-22
SlowFast Networks for Video Recognition	✓ Link	27.5	SlowFast, 16x8 R101+NL (Kinetics-600 pretraining)	2018-12-10
Multiscale Vision Transformers	✓ Link	27.3	MViT-B, 64x3 (Kinetics-400 pretraining)	2021-04-22
SlowFast Networks for Video Recognition	✓ Link	27.1	SlowFast, 8x8 R101+NL (Kinetics-600 pretraining)	2018-12-10
Multiscale Vision Transformers	✓ Link	26.8	MViT-B, 32x3 (Kinetics-400 pretraining)	2021-04-22
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	26.7	VideoMAE (K400 pretrain, ViT-B, 16x4)	2022-03-23
Object-Region Video Transformers	✓ Link	26.6	ORViT MViT-B, 16x4 (K400 pretraining)	2021-10-13
Multiscale Vision Transformers	✓ Link	26.1	MViT-B, 16x4 (Kinetics-600 pretraining)	2021-04-22
Multiscale Vision Transformers	✓ Link	24.5	MViT-B, 16x4 (Kinetics-400 pretraining)	2021-04-22
SlowFast Networks for Video Recognition	✓ Link	23.8	SlowFast, 8x8, R101 (Kinetics-400 pretraining)	2018-12-10
SlowFast Networks for Video Recognition	✓ Link	21.9	SlowFast, 4x16, R50 (Kinetics-400 pretraining)	2018-12-10

OpenCodePapers

action-recognition-on-ava-v2-2