action-recognition-in-videos-on-something-1

Action RecognitionAction Recognition In Videos

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Top 1 Accuracy	Top 5 Accuracy	Param.	GFLOPs	ModelName	ReleaseDate
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	70.0				InternVideo	2022-12-06
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking	✓ Link	68.7	91.9			VideoMAE V2-g	2023-03-29
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning	✓ Link	67.3	88.8			Side4Video (EVA ViT-E/14	2023-11-27
What Can Simple Arithmetic Operations Do for Temporal Modeling?	✓ Link	65.6	88.6			ATM	2023-07-18
Temporally-Adaptive Models for Efficient Video Understanding	✓ Link	63.7				TAdaFormer-L/14	2023-08-10
TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning	✓ Link	63.0	87.8			TDS-CLIP-ViT-L/14(8frames)	2024-08-20
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer	✓ Link	62.7	88.0			UniFormerV2-L	2022-09-22
Learning Correlation Structures for Vision Transformers		61.3				StructVit-B-4-1	2024-04-05
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning	✓ Link	60.9	87.3	50.1	259x3	UniFormer-B (IN-1K + Kinetics400)	2021-09-29
Temporally-Adaptive Models for Efficient Video Understanding	✓ Link	60.7				TAdaConvNeXtV2-B	2023-08-10
Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition	✓ Link	58.3				TPS	2022-07-27
Multi-scale Motion-Aware Module for Video Action Recognition		57.9				MSMA (8+16frames)	2023-02-19
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning	✓ Link	57.6	84.9	21.4	41.8x3	UniFormer-B (IN-1K + Kinetics600)	2021-09-29
Stand-Alone Inter-Frame Attention in Video Models	✓ Link	57.3				SIFA	2022-06-14
EAN: Event Adaptive Network for Enhanced Action Recognition	✓ Link	57.2	83.9			EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer)	2021-07-22
Motion-driven Visual Tempo Learning for Video-based Action Recognition	✓ Link	57.2				TCM (Ensemble)	2022-02-24
Busy-Quiet Video Disentangling for Video Classification	✓ Link	57.1	84.2			BQNEn (ImageNet + K400 pretrained)	2021-03-29
TDN: Temporal Difference Networks for Efficient Action Recognition	✓ Link	56.8	84.1			TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)	2020-12-18
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition	✓ Link	56.6	84.4			SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)	2021-02-14
CT-Net: Channel Tensorization Network for Video Classification	✓ Link	56.6				CT-Net Ensemble (R50, 8+12+16+24)	2021-06-03
Action Recognition With Motion Diversification and Dynamic Selection		56.6				MoDS (8+16frames)	2022-07-15
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing		56.5				MLP-3D	2022-06-13
Relational Self-Attention: What's Missing in Attention for Video Understanding	✓ Link	56.1	82.8			RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	2021-11-02
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition	✓ Link	55.8	83.9			SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)	2021-02-14
Relational Self-Attention: What's Missing in Attention for Video Understanding	✓ Link	55.5	82.6			RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)	2021-11-02
PAN: Towards Fast Action Recognition via Learning Persistence of Appearance	✓ Link	55.3	82.8			PAN ResNet101 (RGB only, no Flow)	2020-08-08
Gate-Shift Networks for Video Action Recognition	✓ Link	55.16				GSM Ensemble InceptionV3 (ImageNet pretrained)	2019-12-01
MotionSqueeze: Neural Motion Feature Learning for Video Understanding	✓ Link	55.1				MSNet-R50En (ensemble)	2020-07-20
AE-Net:Adjoint Enhancement Network for Efficient Action Recognition in Video Understanding		55.0				AE-Net (8+16frames)	2022-07-21
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification	✓ Link	54.59	82.30	5.8M	20.9x6	VoV3D-L (32frames, Kinetics pretrained, single)	2020-12-01
MotionSqueeze: Neural Motion Feature Learning for Video Understanding	✓ Link	54.4	83.8			MSNet-R50En (8+16 ensemble, ImageNet pretrained)	2020-07-20
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition	✓ Link	54.3	82.9			SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)	2021-02-14
Region-based Non-local Operation for Video Classification	✓ Link	54.1	82.2			RNL+TSM Ensemble(R50+R101, ImageNet pretrained)	2020-07-17
MVFNet: Multi-View Fusion Network for Efficient Video Recognition	✓ Link	54.0				MVFNet-R50EN	2020-12-13
Relational Self-Attention: What's Missing in Attention for Video Understanding	✓ Link	54.0	81.1			RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	2021-11-02
Spatial-Temporal Pyramid Graph Reasoning for Action Recognition		53.5				STPG (8+16frames)	2022-08-09
Action recognition with spatial-temporal discriminative filter banks		53.4				GB + DF + LB (ResNet152, ImageNet pretrained)	2019-08-20
Video Classification with Channel-Separated Convolutional Networks	✓ Link	53.3				ip-CSN-152 (IG-65M pretraining)	2019-04-04
MARS: Motion-Augmented RGB Stream for Action Recognition	✓ Link	53.0				MARS+RGB+Flow (64 frames, Kinetics pretrained)	2019-06-01
Region-based Non-local Operation for Video Classification	✓ Link	52.7	81.5			RNL+TSM Ensemble(ResNet50, ImageNet pretrained)	2020-07-17
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification	✓ Link	52.68	80.43	3.3M	11.5x6	VoV3D-M (32frames, Kinetics pretrained, single)	2020-12-01
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention		52.6	81.3			TSM+W3 (16 frames, ResNet50)	2020-04-02
Action Keypoint Network for Efficient Video Recognition		52.5				AK-Net	2022-01-17
Video Classification with Channel-Separated Convolutional Networks	✓ Link	52.1				ir-CSN-152 (IG-65M pretraining)	2019-04-04
MotionSqueeze: Neural Motion Feature Learning for Video Understanding	✓ Link	52.1	82.3			MSNet-R50 (16 frames, ImageNet pretrained)	2020-07-20
Relational Self-Attention: What's Missing in Attention for Video Understanding	✓ Link	51.9	79.6			RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	2021-11-02
Gate-Shift Networks for Video Action Recognition	✓ Link	51.68				GSM InceptionV3 (16 frames, ImageNet pretrained)	2019-12-01
Video Classification with Channel-Separated Convolutional Networks	✓ Link	51.6				R(2+1)D-152 (IG-65M pretraining)	2019-04-04
MotionSqueeze: Neural Motion Feature Learning for Video Understanding	✓ Link	50.9	80.3			MSNet-R50 (8 frames, ImageNet pretrained)	2020-07-20
TSM: Temporal Shift Module for Efficient Video Understanding	✓ Link	50.7				TSM (RGB + Flow)	2018-11-20
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification	✓ Link	50.6	78.7	5.8M	20.9x6	VoV3D-L (32frames, from scratch, single)	2020-12-01
Moments in Time Dataset: one million videos for event understanding	✓ Link	50				ResNet50 I3D (Moments pretrained)	2018-01-09
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification	✓ Link	49.8	78.0	3.3M	11.5x6	VoV3D-M (32frames, from scratch, single)	2020-12-01
TSM: Temporal Shift Module for Efficient Video Understanding	✓ Link	49.7	78.5			TSMEn	2018-11-20
Temporal Reasoning Graph for Activity Recognition		49.7				TRG (Inception-V3)	2019-08-27
Temporal Reasoning Graph for Activity Recognition		49.5	86.1			TRG (ResNet-50)	2019-08-27
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification	✓ Link	49.5	78.0	5.8M	9.3x6	VoV3D-L (16frames, from scratch, single)	2020-12-01
Video Classification with Channel-Separated Convolutional Networks	✓ Link	49.3				ir-CSN-152	2019-04-04
Recurrent Space-time Graph Neural Networks	✓ Link	49.2				RSTG (Kinetics pretrained)	2019-04-11
Moments in Time Dataset: one million videos for event understanding	✓ Link	48.6				ResNet50 I3D (Kinetics pretrained)	2018-01-09
Video Classification with Channel-Separated Convolutional Networks	✓ Link	48.4				ir-CSN-101	2019-04-04
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification	✓ Link	48.2	78.7			S3D-G (ImageNet pretrained)	2017-12-13
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification	✓ Link	48.1	76.9	3.3M	5.7x6	VoV3D-M (16frames, from scratch, single)	2020-12-01
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification	✓ Link	47.3	78.1			S3D	2017-12-13
TSM: Temporal Shift Module for Efficient Video Understanding	✓ Link	47.2	77.1			TSM	2018-11-20
ECO: Efficient Convolutional Network for Online Video Understanding	✓ Link	46.4				ECO-Net (ImageNet pretrained)	2018-04-24
ECO: Efficient Convolutional Network for Online Video Understanding	✓ Link	46.4				ECO-Net	2018-04-24
Videos as Space-Time Region Graphs		46.1				NL I3D + GCN	2018-06-05
Non-local Neural Networks	✓ Link	44.4				NL I3D	2017-11-21
Motion Feature Network: Fixed Motion Filter for Action Recognition		43.9				Motion Feature Net	2018-07-26
Temporal Relational Reasoning in Videos	✓ Link	42.01				2-Stream TRN	2017-11-22
Hierarchical Feature Aggregation Networks for Video Action Recognition		41.97				HF-TSN (ImageNet pretraining)	2019-05-29
MARS: Motion-Augmented RGB Stream for Action Recognition	✓ Link	40.4				MARS+RGB+Flow (16 frames, Kinetics pretrained)	2019-06-01
Temporal Relational Reasoning in Videos	✓ Link	34.4				M-TRN	2017-11-22

OpenCodePapers

action-recognition-in-videos-on-something-1