action-recognition-in-videos-on-something

Action Recognition

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Top-1 Accuracy	Top-5 Accuracy	Parameters	GFLOPs	ModelName	ReleaseDate
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	✓ Link	77.3	95.7	633	1192x6	MVD (Kinetics400 pretrain, ViT-H, 16 frame)	2022-12-08
DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification	✓ Link	77.2	96.3			DejaVid	2025-01-01
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	77.2				InternVideo	2022-12-06
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	77.1				InternVideo2-1B	2024-03-22
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking	✓ Link	77.0	95.9	1013	2544x6	VideoMAE V2-g	2023-03-29
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	✓ Link	76.7	95.5	305	597x6	MVD (Kinetics400 pretrain, ViT-L, 16 frame)	2022-12-08
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles	✓ Link	76.5				Hiera-L (no extra data)	2023-06-01
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning	✓ Link	76.1	95.2			TubeViT-L	2022-12-06
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	75.4	95.2	305	1436x3	VideoMAE (no extra data, ViT-L, 32x2)	2022-03-23
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning	✓ Link	75.2	94.0			Side4Video (EVA ViT-E/14)	2023-11-27
Masked Feature Prediction for Self-Supervised Visual Pre-Training	✓ Link	75.0	95.0	218	2828*3	MaskFeat (Kinetics600 pretrain, MViT-L)	2021-12-16
MAR: Masked Autoencoders for Efficient Action Recognition	✓ Link	74.7	94.9	311	276x6	MAR (50% mask, ViT-L, 16x4)	2022-07-24
What Can Simple Arithmetic Operations Do for Temporal Modeling?	✓ Link	74.6	94.4			ATM	2023-07-18
The effectiveness of MAE pre-pretraining for billion-scale pretraining	✓ Link	74.4				MAWS (ViT-L)	2023-03-23
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	74.3	94.6	305	597x6	VideoMAE (no extra data, ViT-L, 16frame)	2022-03-23
MAR: Masked Autoencoders for Efficient Action Recognition	✓ Link	73.8	94.4	311	131x6	MAR (75% mask, ViT-L, 16x4)	2022-07-24
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	✓ Link	73.7	94.0	87	180x6	MVD (Kinetics400 pretrain, ViT-B, 16 frame)	2022-12-08
ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders	✓ Link	73.7				ViC-MAE (ViT-L)	2023-03-21
Temporally-Adaptive Models for Efficient Video Understanding	✓ Link	73.6				TAdaFormer-L/14	2023-08-10
TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning	✓ Link	73.4	93.8			TDS-CLIP-ViT-L/14(8frames)	2024-08-20
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link	73.3	94.1	213.1		MViTv2-L (IN-21K + Kinetics400 pretrain)	2021-12-02
Asymmetric Masked Distillation for Pre-Training Small Foundation Models		73.3	94.0	87	180x6	AMD(ViT-B/16)	2023-11-06
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer	✓ Link	73.0	94.5		5154	UniFormerV2-L	2022-09-22
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning	✓ Link	72.3	93.9		8248	ST-Adapter (ViT-L, CLIP)	2022-06-27
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video	✓ Link	72.2	93.0			ZeroI2V ViT-L/14	2023-10-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link	72.1			225x3	MViT-B (IN-21K + Kinetics400 pretrain)	2021-12-02
CAST: Cross-Attention in Space and Time for Video Action Recognition	✓ Link	71.6				CAST(ViT-B/16)	2023-11-30
Learning Correlation Structures for Vision Transformers		71.5				StructVit-B-4-1	2024-04-05
Omnivore: A Single Model for Many Visual Modalities	✓ Link	71.4	93.5			OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain)	2022-01-20
BEVT: BERT Pretraining of Video Transformers	✓ Link	71.4	-	89	321x3	BEVT (IN-1K + Kinetics400 pretrain)	2021-12-02
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning	✓ Link	71.2	92.8	50.1	259x3	UniFormer-B (IN-1K + Kinetics400 pretrain)	2021-09-29
Temporally-Adaptive Models for Efficient Video Understanding	✓ Link	71.1				TAdaConvNeXtV2-B	2023-08-10
MAR: Masked Autoencoders for Efficient Action Recognition	✓ Link	71.0	92.8	94	86x6	MAR (50% mask, ViT-B, 16x4)	2022-07-24
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	✓ Link	70.9	92.8	22	57x6	MVD (Kinetics400 pretrain, ViT-S, 16 frame)	2022-12-08
Co-training Transformer with Videos and Images Improves Action Recognition		70.9	92.5			CoVeR(JFT-3B)	2021-12-14
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	70.8	92.4	87	180x6	VideoMAE (no extra data, ViT-B, 16frame)	2022-03-23
Asymmetric Masked Distillation for Pre-Training Small Foundation Models		70.2	92.5	22	57x6	AMD(ViT-S/16)	2023-11-06
Implicit Temporal Modeling with Learnable Alignment for Video Recognition	✓ Link	70.2	91.8			ILA (ViT-L/14)	2023-04-20
MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning	✓ Link	70.1	92.8	68.5	197x3	MorphMLP-B (IN-1K)	2021-11-24
Co-training Transformer with Videos and Images Improves Action Recognition		69.8	91.9			CoVeR(JFT-300M)	2021-12-14
Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition	✓ Link	69.8				TPS	2022-07-27
Stand-Alone Inter-Frame Attention in Video Models	✓ Link	69.8				SIFA	2022-06-14
Video Swin Transformer	✓ Link	69.6	92.7	89	321x3	Swin-B (IN-21K + Kinetics400 pretrain)	2021-06-24
TDN: Temporal Difference Networks for Efficient Action Recognition	✓ Link	69.6	92.2		198x3	TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only)	2020-12-18
MAR: Masked Autoencoders for Efficient Action Recognition	✓ Link	69.5	91.9	94	41x6	MAR (75% mask, ViT-B, 16x4)	2022-07-24
Object-Region Video Transformers	✓ Link	69.5	91.5	N/A	N/A	ORViT Mformer-L (ORViT blocks)	2021-10-13
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning	✓ Link	69.4	92.1	21.4	41.8x3	UniFormer-S (IN-1K + Kinetics600 pretrain)	2021-09-29
Mutual Modality Learning for Video Action Classification	✓ Link	69.02	92.70			MML (ensemble)	2020-11-04
Multiscale Vision Transformers	✓ Link	68.7	91.5	53.2M	236x3	MViT-B-24, 32x3	2021-04-22
Multiview Transformers for Video Recognition	✓ Link	68.5	90.4			MTV-B	2022-01-12
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing		68.5				MLP-3D	2022-06-13
TDN: Temporal Difference Networks for Efficient Action Recognition	✓ Link	68.2	91.6		198x1	TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)	2020-12-18
Multi-scale Motion-Aware Module for Video Action Recognition		68.2				MSMA (8+16frames)	2023-02-19
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers	✓ Link	68.1	91.2	N/A	1181x3	Mformer-L	2021-06-09
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning	✓ Link	68.1				VIMPAC	2021-06-21
Object-Region Video Transformers	✓ Link	67.9	90.5	N/A	N/A	ORViT Mformer (ORViT blocks)	2021-10-13
Multiscale Vision Transformers	✓ Link	67.8	91.3	36.6	170x3	MViT-B, 32x3(Kinetics600 pretrain)	2021-04-22
Group Contextualization for Video Recognition	✓ Link	67.8	91.2	27.4	110.1	GC-TDN Ensemble (R50,8+16)	2022-03-18
CT-Net: Channel Tensorization Network for Video Classification	✓ Link	67.8	91.1	83.8	280	CT-Net Ensemble (R50, 8+12+16+24)	2021-06-03
Motion-driven Visual Tempo Learning for Video-based Action Recognition	✓ Link	67.8				TCM (Ensemble)	2022-02-24
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition	✓ Link	67.7	91.1			SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)	2021-02-14
Relational Self-Attention: What's Missing in Attention for Video Understanding	✓ Link	67.7	91.1			RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips	2021-11-02
Global Temporal Difference Network for Action Recognition		67.6				GTDNet	2022-11-23
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition	✓ Link	67.4	91			SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)	2021-02-14
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification	✓ Link	67.35	90.50	5.8M	20.9x6	VoV3D-L (32frames, Kinetics pretrained, single)	2020-12-01
SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition		67.3	91			PLAR	2023-05-21
Relational Self-Attention: What's Missing in Attention for Video Understanding	✓ Link	67.3	90.8			RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)	2021-11-02
Space-time Mixing Attention for Video Transformer	✓ Link	67.2	90.8	N/A	850x1	X-Vit (x16)	2021-06-10
TAda! Temporally-Adaptive Convolutions for Video Understanding	✓ Link	67.2	89.8			TAda2D-En (ResNet-50, 8+16 frames)	2021-10-12
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers	✓ Link	67.1	90.6	N/A	958.8x3	Mformer-HR	2021-06-09
TAda! Temporally-Adaptive Convolutions for Video Understanding	✓ Link	67.1	90.4			TAdaConvNeXt-T	2021-10-12
Action Recognition With Motion Diversification and Dynamic Selection		67.1				MoDS (8+16frames)	2022-07-15
Spatial-Temporal Pyramid Graph Reasoning for Action Recognition		67.0				STPG (8+16frames)	2022-08-09
Mutual Modality Learning for Video Action Classification	✓ Link	66.83	91.30			MML (single)	2020-11-04
Implicit Temporal Modeling with Learnable Alignment for Video Recognition	✓ Link	66.8	90.3			ILA (ViT-B/16)	2023-04-20
TSM: Temporal Shift Module for Efficient Video Understanding	✓ Link	66.6	91.3			TSM (RGB + Flow)	2018-11-20
MotionSqueeze: Neural Motion Feature Learning for Video Understanding	✓ Link	66.6	90.6			MSNet-R50En (8+16 ensemble, ImageNet pretrained)	2020-07-20
PAN: Towards Fast Action Recognition via Learning Persistence of Appearance	✓ Link	66.5	90.6			PAN ResNet101 (RGB only, no Flow)	2020-08-08
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention		66.5	90.4			TSM+W3 (16 frames, RGB ResNet-50)	2020-04-02
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers	✓ Link	66.5	90.1			Mformer	2021-06-09
MVFNet: Multi-View Fusion Network for Efficient Video Recognition	✓ Link	66.3				MVFNet-ResNet50 (center crop, 8+16 ensemble, ImageNet pretrained, RGB only)	2020-12-13
Multiscale Vision Transformers	✓ Link	66.2	90.2			MViT-B, 16x4	2021-04-22
Relational Self-Attention: What's Missing in Attention for Video Understanding	✓ Link	66	89.8			RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	2021-11-02
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification	✓ Link	65.8	89.5	5.8M	20.9x6	VoV3D-L (32frames, from scratch, single)	2020-12-01
Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video Recognition	✓ Link	65.7	89.8		18.3	E3D-L	2023-03-05
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition	✓ Link	65.7	89.8			SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)	2021-02-14
TAda! Temporally-Adaptive Convolutions for Video Understanding	✓ Link	65.6	89.2			TAda2D (ResNet-50, 16 frames)	2021-10-12
ViViT: A Video Vision Transformer	✓ Link	65.4	89.8			ViViT-L/16x2 Fact. encoder	2021-03-29
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification	✓ Link	65.24	89.48	3.3M	11.5x6	VoV3D-M (32frames, Kinetics pretrained, single)	2020-12-01
More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation	✓ Link	65.2				bLVNet	2019-12-02
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition	✓ Link	64.94	87.9			DirecFormer	2022-03-19
Relational Self-Attention: What's Missing in Attention for Video Understanding	✓ Link	64.8	89.1			RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	2021-11-02
MotionSqueeze: Neural Motion Feature Learning for Video Understanding	✓ Link	64.7	89.4			MSNet-R50 (16 frames, ImageNet pretrained)	2020-07-20
Action Keypoint Network for Efficient Video Recognition		64.3				AK-Net	2022-01-17
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification	✓ Link	64.2	88.8	3.3M	11.5x6	VoV3D-M (32frames, from scratch, single)	2020-12-01
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification	✓ Link	64.1	88.6	5.8M	9.3x6	VoV3D-L (16frames, from scratch, single)	2020-12-01
TAda! Temporally-Adaptive Convolutions for Video Understanding	✓ Link	64.0	88.0			TAda2D (ResNet-50, 8 frames)	2021-10-12
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	63.5	89.0	4.8M	10.3x1	MoViNet-A2	2021-03-21
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification	✓ Link	63.2	88.2	3.3M	5.7x6	VoV3D-M (16frames, from scratch, single)	2020-12-01
MotionSqueeze: Neural Motion Feature Learning for Video Understanding	✓ Link	63	88.4			MSNet-R50 (8 frames, ImageNet pretrained)	2020-07-20
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	62.7	89.0	4.6M	6.0x1	MoViNet-A1	2021-03-21
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks		62.5	86.2			OmniVL	2022-09-15
Is Space-Time Attention All You Need for Video Understanding?	✓ Link	62.5				TimeSformer-HR	2021-02-09
Is Space-Time Attention All You Need for Video Understanding?	✓ Link	62.3				TimeSformer-L	2021-02-09
Temporal Reasoning Graph for Activity Recognition		62.2	90.3			TRG (ResNet-50)	2019-08-27
Temporal Pyramid Network for Action Recognition	✓ Link	62.0				TPN (TSM-50)	2020-04-07
A Multigrid Method for Efficiently Training Video Models	✓ Link	61.7				Multigrid	2019-12-02
SlowFast Networks for Video Recognition	✓ Link	61.7				SlowFast	2018-12-10
Temporal Reasoning Graph for Activity Recognition		61.3	91.4			TRG (Inception-V3)	2019-08-27
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	61.3	88.2	3.1M	2.7x1	MoViNet-A0	2021-03-21
Cooperative Cross-Stream Network for Discriminative Action Representation		61.2	89.3			CCS + two-stream + TRN	2019-08-27
VidTr: Video Transformer Without Convolutions		60.2				VidTr-L	2021-04-23
Is Space-Time Attention All You Need for Video Understanding?	✓ Link	59.5				TimeSformer	2021-02-09
Self-supervised Video Transformer	✓ Link	59.2				SVT	2021-12-02
Few-Shot Video Classification via Temporal Alignment		52.3				TAM (5-shot)	2019-06-27
The "something something" video database for learning and evaluating visual common sense	✓ Link	51.33	80.46			model3D_1 with left-right augmentation and fps jitter	2017-06-13
Attention Distillation for Learning Video Representations		49.9	79.1			Prob-Distill	2019-04-05
Comparative Analysis of CNN-based Spatiotemporal Reasoning in Videos	✓ Link	47.73				STM + TRNMultiscale	2019-09-11
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	1	12	2131	13321	InternVideo2-6B	2024-03-22
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link		93.4	51.1		MViTv2-B (IN-21K + Kinetics400 pretrain)	2021-12-02
Relational Self-Attention: What's Missing in Attention for Video Understanding	✓ Link		91.1			RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	2021-11-02
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link			5.3M	23.7x1	MoViNet-A3	2021-03-21
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link				2828x3	MViT-L (IN-21K + Kinetics400 pretrain)	2021-12-02

OpenCodePapers

action-recognition-in-videos-on-something