action-classification-on-kinetics-400

VideoAction Classification

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Acc@1	Acc@5	FLOPs (G) x views	Parameters (M)	ModelName	ReleaseDate
OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning		93.6				OmniVec2	2024-01-01
Enhancing Video Transformers for Action Understanding with VLM-aided Training		93.4				FTP-UniFormerV2-L/14	2024-03-24
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	92.1				InternVideo2-6B	2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	91.6				InternVideo2-1B	2024-03-22
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	91.1				InternVideo	2022-12-06
OmniVec: Learning robust representations with cross modal sharing		91.1				OmniVec	2023-11-07
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning	✓ Link	90.9	98.9	176400x4x3	632	TubeViT-H (ImageNet-1k)	2022-12-06
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	✓ Link	90.6	98.7	1434×3×4	304	Unmasked Teacher (ViT-L)	2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	✓ Link	90.6	98.7			UMT-L (ViT-L/16)	2023-03-28
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning	✓ Link	90.2	98.6	95300x4x3	307	TubeVit-L (ImageNet-1k)	2022-12-06
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer	✓ Link	90.0	98.4	75300x3x2	354	UniFormerV2-L (ViT-L, 336)	2022-09-22
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking	✓ Link	90.0	98.4			VideoMAE V2-g (64x266x266)	2023-03-29
Make Your Training Flexible: Towards Deployment-Efficient Video Models	✓ Link	90.0		440x3x4	97	FluxViT-B	2025-03-18
Multiview Transformers for Video Recognition	✓ Link	89.9	98.3	735700x4x3		MTV-H (WTS 60M)	2022-01-12
Temporally-Adaptive Models for Efficient Video Understanding	✓ Link	89.9				TAdaFormer-L/14	2023-08-10
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale	✓ Link	89.7				EVA	2022-11-14
AM Flow: Adapters for Temporal Processing in Action Recognition		89.6				AM/12 ViT-B Dinov2	2024-11-04
What Can Simple Arithmetic Operations Do for Temporal Modeling?	✓ Link	89.4	98.3			ATM	2023-07-18
DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification	✓ Link	89.1	98.2			DejaVid	2025-01-01
CoCa: Contrastive Captioners are Image-Text Foundation Models	✓ Link	88.9				CoCa (finetuned)	2022-05-04
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models	✓ Link	88.7	98.4			BIKE (CLIP ViT-L/14)	2022-12-31
Implicit Temporal Modeling with Learnable Alignment for Video Recognition	✓ Link	88.7	97.8			ILA (ViT-L/14)	2023-04-20
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning	✓ Link	88.6	98.2			Side4Video (EVA, ViT-E/14)	2023-11-27
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning	✓ Link	88.6	97.6	8700x3x4	86	TubeVit-B (ImageNet-1k)	2022-12-06
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking	✓ Link	88.5	98.1			VideoMAE V2-g	2023-03-29
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities	✓ Link	88.1	97.8			ONE-PEACE	2023-05-18
Make Your Training Flexible: Towards Deployment-Efficient Video Models	✓ Link	88.0		154x3x4	24	FluxViT-S	2025-03-18
CoCa: Contrastive Captioners are Image-Text Foundation Models	✓ Link	88.0				CoCa (frozen)	2022-05-04
Scaling Vision Transformers to 22 Billion Parameters	✓ Link	88.0				ViT-22B	2023-02-10
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition	✓ Link	87.8	97.6			Text4Vis (CLIP ViT-L/14)	2022-07-04
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles	✓ Link	87.8				Hiera-H (no extra data)	2023-06-01
Frozen CLIP Models are Efficient Video Learners	✓ Link	87.7	97.8			EVL (CLIP ViT-L/14@336px, frozen, 32 frames)	2022-08-06
Dual-path Adaptation from Image to Video Transformers	✓ Link	87.7	97.8			DualPath w/ ViT-L/14	2023-03-17
Expanding Language-Image Pretrained Models for General Video Recognition	✓ Link	87.7	97.4			X-CLIP(ViT-L/14, CLIP)	2022-08-04
AIM: Adapting Image Models for Efficient Video Action Recognition	✓ Link	87.5	97.7			AIM (CLIP ViT-L/14, 32x224)	2023-02-06
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	87.4	97.6			VideoMAE (no extra data, ViT-H, 32x320x320)	2022-03-23
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning	✓ Link	87.2	97.6			ST-Adapter (ViT-L, CLIP)	2022-06-27
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video	✓ Link	87.2	97.6			ZeroI2V ViT-L/14	2023-10-02
Co-training Transformer with Videos and Images Improves Action Recognition		87.2	97.5			CoVeR (JFT-3B)	2021-12-14
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	✓ Link	87.2	97.4			MVD (K400 pretrain, ViT-H, 16x224x224)	2022-12-08
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	✓ Link	87.1	97.7			mPLUG-2	2023-02-01
Masked Feature Prediction for Self-Supervised Visual Pre-Training	✓ Link	87.0	97.4			MaskFeat (K600, MViT-L)	2021-12-16
VicTR: Video-conditioned Text Representations for Activity Recognition		87.0				VicTR (ViT-L/14)	2023-04-05
Swin Transformer V2: Scaling Up Capacity and Resolution	✓ Link	86.8				Video-SwinV2-G (ImageNet-22k and external 70M pretrain)	2021-11-18
Masked Feature Prediction for Self-Supervised Visual Pre-Training	✓ Link	86.7	97.3			MaskFeat (no extra data, MViT-L)	2021-12-16
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	86.6	97.1			VideoMAE (no extra data, ViT-H)	2022-03-23
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	✓ Link	86.4	97.0			MVD (K400 pretrain, ViT-L, 16x224x224)	2022-12-08
Temporally-Adaptive Models for Efficient Video Understanding	✓ Link	86.4				TAdaConvNeXtV2-B	2023-08-10
Co-training Transformer with Videos and Images Improves Action Recognition		86.3	97.2			CoVeR (JFT-300M)	2021-12-14
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	86.1	97.3			VideoMAE (no extra data, ViT-L, 32x320x320)	2022-03-23
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link	86.1	97.0			MViTv2-L (ImageNet-21k pretrain)	2021-12-02
Implicit Temporal Modeling with Learnable Alignment for Video Recognition	✓ Link	85.7	97.2			ILA (ViT-B/16)	2023-04-20
Dual-path Adaptation from Image to Video Transformers	✓ Link	85.4	97.1			DualPath w/ ViT-B/16	2023-03-17
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?	✓ Link	85.4				TokenLearner 16at18 (L/10)	2021-06-21
MAR: Masked Autoencoders for Efficient Action Recognition	✓ Link	85.3	96.3			MAR (50% mask, ViT-L, 16x4)	2022-07-24
CAST: Cross-Attention in Space and Time for Video Action Recognition	✓ Link	85.3				CAST(ViT-B/16)	2023-11-30
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	85.2	96.8			VideoMAE (no extra data, ViT-L, 16x4)	2022-03-23
ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders	✓ Link	85.1				ViC-MAE (ViT-L)	2023-03-21
VideoMamba: State Space Model for Efficient Video Understanding	✓ Link	85.0				VideoMamba-M800	2024-03-11
Video Swin Transformer	✓ Link	84.9	96.7			Swin-L (384x384, ImageNet-21k pretrain)	2021-06-24
ViViT: A Video Vision Transformer	✓ Link	84.9	95.8			ViViT-H/16x2 (JFT)	2021-03-29
Omnivore: A Single Model for Many Visual Modalities	✓ Link	84.1	96.1			OMNIVORE (Swin-L)	2022-01-20
Omnivore: A Single Model for Many Visual Modalities	✓ Link	84.0	96.2			OMNIVORE (Swin-B)	2022-01-20
MAR: Masked Autoencoders for Efficient Action Recognition	✓ Link	83.9	96.0			MAR (75% mask, ViT-L, 16x4)	2022-07-24
ActionCLIP: A New Paradigm for Video Action Recognition	✓ Link	83.8	97.1			ActionCLIP (CLIP-pretrained)	2021-09-17
Omni-sourced Webly-supervised Learning for Video Recognition	✓ Link	83.6				OmniSource irCSN-152 (IG-Kinetics-65M pretrain)	2020-03-29
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	✓ Link	83.4	95.8			MVD (K400 pretrain, ViT-B, 16x224x224)	2022-12-08
Learning Correlation Structures for Vision Transformers		83.4				StructViT-B-4-1	2024-04-05
Video Swin Transformer	✓ Link	83.1	95.9			Swin-L (ImageNet-21k pretrain)	2021-06-24
Stand-Alone Inter-Frame Attention in Video Models	✓ Link	83.1				SIFA	2022-06-14
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning	✓ Link	82.9	94.5	259x4		UniFormer-B (ImageNet-1K)	2021-09-29
Large-scale weakly-supervised pre-training for video action recognition	✓ Link	82.8				irCSN-152 (IG-Kinetics-65M pretrain)	2019-05-02
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition	✓ Link	82.75	94.86			DirecFormer	2022-03-19
Video Swin Transformer	✓ Link	82.7	95.5			Swin-B (ImageNet-21k pretrain)	2021-06-24
Video Classification with Channel-Separated Convolutional Networks	✓ Link	82.6				ir-CSN-152 (IG-65M pretraining)	2019-04-04
Video Classification with Channel-Separated Convolutional Networks	✓ Link	82.5	95.3			ip-CSN-152 (IG-65M pretraining)	2019-04-04
Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition	✓ Link	82.5				TPS	2022-07-27
Implicit Temporal Modeling with Learnable Alignment for Video Recognition	✓ Link	82.4	95.8			ILA (ViT-B/32)	2023-04-20
Asymmetric Masked Distillation for Pre-Training Small Foundation Models		82.2	95.3	180x15	87	AMD(ViT-B/16)	2023-11-06
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text	✓ Link	82.1	95.5			VATT-Large	2021-04-22
AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders	✓ Link	81.7	95.2			AdaMAE	2022-11-16
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	✓ Link	81.5	95.1			VideoMAE (no extra data, ViT-B, 16x4)	2022-03-23
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	81.5		386x1		MoViNet-A6	2021-03-21
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing		81.4				MLP-3D	2022-06-13
Video Classification with Channel-Separated Convolutional Networks	✓ Link	81.3	95.1			R[2+1]D-152 (IG-65M pretraining)	2019-04-04
Learning Spatio-Temporal Representation with Local and Global Diffusion		81.2	95.2			LGD-3D Two-stream (ResNet-101)	2019-06-13
Multiscale Vision Transformers	✓ Link	81.2	95.1			MViT-B, 64x3	2021-04-22
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers	✓ Link	81.1	95.2			Motionformer-HR	2021-06-09
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	✓ Link	81.0	94.8			MVD (K400 pretrain, ViT-S, 16x224x224)	2022-12-08
MAR: Masked Autoencoders for Efficient Action Recognition	✓ Link	81.0	94.4			MAR (50% mask, ViT-B, 16x4)	2022-07-24
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	80.9	94.9	281x1		MoViNet-A5	2021-03-21
Attention Bottlenecks for Multimodal Fusion	✓ Link	80.8	94.6			MBT (AV)	2021-06-30
Is Space-Time Attention All You Need for Video Understanding?	✓ Link	80.7	94.7	7140x3	121.4	TimeSformer-L	2021-02-09
Video Swin Transformer	✓ Link	80.6	94.6			Swin-B (ImageNet-1k pretrain)	2021-06-24
Video Swin Transformer	✓ Link	80.6	94.5			Swin-S (ImageNet-1k pretrain)	2021-06-24
VidTr: Video Transformer Without Convolutions		80.5	94.6			En-VidTr-L	2021-04-23
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	80.5	94.5	105x1		MoViNet-A4	2021-03-21
Omni-sourced Webly-supervised Learning for Video Recognition	✓ Link	80.5	94.4			OmniSource SlowOnly R101 8x8(ImageNet pretrain)	2020-03-29
An Image is Worth 16x16 Words, What is a Video Worth?	✓ Link	80.5		1040x1		STAM (64 Frames)	2021-03-25
X3D: Expanding Architectures for Efficient Video Recognition	✓ Link	80.4	94.6			X3D-XXL	2020-04-09
Revisiting 3D ResNets for Video Recognition	✓ Link	80.4	94.4			R3D-RS-200	2021-09-03
Omni-sourced Webly-supervised Learning for Video Recognition	✓ Link	80.4	94.4			OmniSource SlowOnly R101 8x8 (Scratch)	2020-03-29
Multiscale Vision Transformers	✓ Link	80.2	94.4			MViT-B, 32x3	2021-04-22
Asymmetric Masked Distillation for Pre-Training Small Foundation Models		80.1	94.5	57X15	22	AMD(ViT-S/16)	2023-11-06
SlowFast Networks for Video Recognition	✓ Link	79.8				SlowFast 16x8 (ResNet-101 + NL)	2018-12-10
CT-Net: Channel Tensorization Network for Video Classification	✓ Link	79.8				CT-Net Ensemble	2021-06-03
Video Transformer Network	✓ Link	79.8				ViT-B-VTN+ ImageNet-21K (84.0 [10])	2021-02-01
Is Space-Time Attention All You Need for Video Understanding?	✓ Link	79.7	94.4			TimeSformer-HR	2021-02-09
VidTr: Video Transformer Without Convolutions		79.7	94.2			En-VidTr-M	2021-04-23
Learning Spatio-Temporal Representation with Local and Global Diffusion		79.4	94.4			LGD-3D RGB (ResNet-101)	2019-06-13
TDN: Temporal Difference Networks for Efficient Action Recognition	✓ Link	79.4	94.4			TDN-ResNet101 (ensemble, ImageNet pretrained, RGB only)	2020-12-18
VidTr: Video Transformer Without Convolutions		79.4	94			En-VidTr-S	2021-04-23
MAR: Masked Autoencoders for Efficient Action Recognition	✓ Link	79.4	93.7			MAR (75% mask, ViT-B, 16x4)	2022-07-24
An Image is Worth 16x16 Words, What is a Video Worth?	✓ Link	79.3		270x1		STAM (16 Frames)	2021-03-25
Video Classification with Channel-Separated Convolutional Networks	✓ Link	79.2	93.8			ip-CSN-152 (Sports-1M pretraining)	2019-04-04
Video Modeling with Correlation Networks		79.2				CorrNet	2019-06-07
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks		79.1	94.5			OmniVL	2022-09-15
X3D: Expanding Architectures for Efficient Video Recognition	✓ Link	79.1	93.9			X3D-XL	2020-04-09
MVFNet: Multi-View Fusion Network for Efficient Video Recognition	✓ Link	79.1	93.8			MVFNet-ResNet101 (ensemble, ImageNet pretrained, RGB only)	2020-12-13
TAda! Temporally-Adaptive Convolutions for Video Understanding	✓ Link	79.1	93.7			TAdaConvNeXt-T	2021-10-12
SlowFast Networks for Video Recognition	✓ Link	78.9	93.5			SlowFast 16x8 (ResNet-101)	2018-12-10
What Makes Training Multi-Modal Classification Networks Hard?	✓ Link	78.9				G-Blend (Sports-1M pretrain)	2019-05-29
Video Swin Transformer	✓ Link	78.8	93.6			Swin-T (ImageNet-1k pretrain)	2021-06-24
Action recognition with spatial-temporal discriminative filter banks		78.8				GB + DF + LB (ResNet 152, ImageNet pretrained)	2019-08-20
Video Transformer Network	✓ Link	78.6	93.7			ViT-B-VTN (3 layers, ImageNet pretrain)	2021-02-01
Multiscale Vision Transformers	✓ Link	78.4	93.5			MViT-B, 16x4	2021-04-22
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	78.2	93.8	56.9x1		MoViNet-A3	2021-03-21
TAda! Temporally-Adaptive Convolutions for Video Understanding	✓ Link	78.2	93.5			TAda2D-En (ResNet-50, 8+16 frames)	2021-10-12
Self-supervised Video Transformer	✓ Link	78.1				SVT	2021-12-02
Is Space-Time Attention All You Need for Video Understanding?	✓ Link	78	93.7			TimeSformer	2021-02-09
SlowFast Networks for Video Recognition	✓ Link	77.9	93.2			SlowFast 8x8 (ResNet-101)	2018-12-10
Representation Flow for Action Recognition	✓ Link	77.9				RepFlow-50 ([2+1]D CNN, FcF, Non-local block)	2018-10-02
Video Classification with Channel-Separated Convolutional Networks	✓ Link	77.8	92.8			ip-CSN-152	2019-04-04
Non-local Neural Networks	✓ Link	77.7	93.3			I3D + NL	2017-11-21
What Makes Training Multi-Modal Classification Networks Hard?	✓ Link	77.7				G-Blend	2019-05-29
Large Scale Holistic Video Understanding	✓ Link	77.6				HATNet (32 frames)	2019-04-25
X3D: Expanding Architectures for Efficient Video Recognition	✓ Link	77.5	92.9			X3D-L	2020-04-09
Collaborative Spatiotemporal Feature Learning for Video Action Recognition	✓ Link	77.5				CoST ResNet-101 (ImageNet pretrain)	2019-06-01
TAda! Temporally-Adaptive Convolutions for Video Understanding	✓ Link	77.4	93.1			TAda2D (ResNet-50, 16 frames)	2021-10-12
Evolving Space-Time Neural Architectures for Videos		77.4				EvaNet	2018-11-26
Region-based Non-local Operation for Video Classification	✓ Link	77.4				RNL+TSM Ensemble(ResNet50, 8 + 16 frames)	2020-07-17
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning	✓ Link	77.4				VIMPAC	2021-06-21
Busy-Quiet Video Disentangling for Video Classification	✓ Link	77.3	93.2			BQN (ResNet-50)	2021-03-29
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification	✓ Link	77.2	93			S3D-G (RGB+Flow, ImageNet pretrained)	2017-12-13
SlowFast Networks for Video Recognition	✓ Link	77	92.6			SlowFast 8x8 (ResNet-50)	2018-12-10
TAda! Temporally-Adaptive Convolutions for Video Understanding	✓ Link	76.7	92.6			TAda2D (ResNet-50, 8 frames)	2021-10-12
D3D: Distilled 3D Networks for Video Action Recognition	✓ Link	76.5				D3D+S3D-G (RGB + RGB)	2018-12-19
MotionSqueeze: Neural Motion Feature Learning for Video Understanding	✓ Link	76.4				MSNet-R50 (16 frames, ImageNet pretrained)	2020-07-20
Global Textual Relation Embedding for Relational Understanding	✓ Link	76.1				GloRe	2019-06-03
X3D: Expanding Architectures for Efficient Video Recognition	✓ Link	76	92.3			X3D-M	2020-04-09
Multiscale Vision Transformers	✓ Link	76	92.1			MViT-S	2021-04-22
Two-Stream Video Classification with Cross-Modality Attention		75.98				CMA iter1 (16 frames)	2019-08-01
D3D: Distilled 3D Networks for Video Action Recognition	✓ Link	75.9				D3D (RGB)	2018-12-19
Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution	✓ Link	75.7				Oct-I3D + NL	2019-04-10
SlowFast Networks for Video Recognition	✓ Link	75.6	92.1			SlowFast 4x16 (ResNet-50)	2018-12-10
A Closer Look at Spatiotemporal Convolutions for Action Recognition	✓ Link	75.4	91.9			R[2+1]D-Flow (Sports-1M pretrain)	2017-11-30
FASTER Recurrent Networks for Efficient Video Classification		75.1				FASTER32	2019-06-10
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	75.0	92.3	10.3x1		MoViNet-A2	2021-03-21
MARS: Motion-Augmented RGB Stream for Action Recognition	✓ Link	74.9				MARS+RGB+Flow (64 frames)	2019-06-01
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification	✓ Link	74.7	93.4			S3D-G (RGB, ImageNet pretrained)	2017-12-13
TSM: Temporal Shift Module for Efficient Video Understanding	✓ Link	74.7				TSM	2018-11-20
$A^2$-Nets: Double Attention Networks		74.6	91.5			A2 Net	2018-10-27
A Closer Look at Spatiotemporal Convolutions for Action Recognition	✓ Link	74.3	91.4			R[2+1]D-RGB (Sports-1M pretrain)	2017-11-30
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition	✓ Link	73.9	91.1			TSN	2016-08-02
A Closer Look at Spatiotemporal Convolutions for Action Recognition	✓ Link	73.9	90.9			R[2+1]D-Two-Stream	2017-11-30
ConvNet Architecture Search for Spatiotemporal Feature Learning	✓ Link	73.9				TSN	2017-08-16
STM: SpatioTemporal and Motion Encoding for Action Recognition		73.7				STM (ResNet-50)	2019-08-07
More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation	✓ Link	73.5	91.2			bLVNet Fan et al. (2019)	2019-12-02
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	73.05		6.90x1	32.45	Co Slow_64	2021-05-31
Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification		73.0	90.9			Inception-ResNet	2017-08-12
Multi-Fiber Networks for Video Recognition		72.8	90.4			MFNet	2018-07-30
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	72.7	91.2	6.0x1		MoViNet-A1	2021-03-21
Appearance-and-Relation Networks for Video Classification	✓ Link	72.4	90.4			ARTNet	2017-11-24
Learning Spatio-Temporal Representation with Local and Global Diffusion		72.3	90.9			LGD-3D Flow (ResNet-101)	2019-06-13
A Closer Look at Spatiotemporal Convolutions for Action Recognition	✓ Link	72	90			R[2+1]D	2017-11-30
A Closer Look at Spatiotemporal Convolutions for Action Recognition	✓ Link	72	90			R[2+1]D-RGB	2017-11-30
FASTER Recurrent Networks for Efficient Video Classification		71.7				FASTER16 w/o sp	2019-06-10
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	71.61		1.25x1	6.15	Co X3D-L_64	2021-05-31
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset	✓ Link	71.1	89.3			I3D	2017-05-22
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	71.03		0.33x1	3.79	Co X3D-M_64	2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	69.29		19.17x1	6.15	X3D-L	2021-05-31
MARS: Motion-Augmented RGB Stream for Action Recognition	✓ Link	68.9				MARS+RGB+Flow (16 frames)	2019-06-01
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	68.45		66.25x1	66.25	SlowFast-8×8-R50	2021-05-31
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification	✓ Link	68	87.6			S3D-G (Flow, ImageNet pretrained)	2017-12-13
A Closer Look at Spatiotemporal Convolutions for Action Recognition	✓ Link	67.5	87.2			R[2+1]D-Flow	2017-11-30
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	67.42		54.87x1	32.45	Slow-8x8-R50	2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	67.33		0.17x1	3.79	Co X3D-S_64	2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	67.24		4.97x1	3.79	X3D-M	2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	67.06		36.46x1	34.48	SlowFast-4×16-R50	2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	65.90		6.90x1	32.45	Co Slow_8	2021-05-31
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	65.8	87.4	2.7x1		MoViNet-A0	2021-03-21
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	64.71		2.06x1	3.79	X3D-S	2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	63.98		28.61x1	28.04	I3D-R50	2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	63.03		1.25x1	6.15	Co X3D-L_16	2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	62.80		0.33x1	3.79	Co X3D-M_16	2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	60.18		0.17x1	3.79	Co X3D-S_13	2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	59.58		5.68x1	28.04	Co I3D_8	2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	59.52		40.71x1	31.51	R(2+1)D-18_16	2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	59.37		0.64x1	3.79	X3D-XS	2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	56.86		5.68x1	28.04	Co I3D_64	2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	53.52		20.35x1	31.51	R(2+1)D-18_8	2021-05-31
Continual 3D Convolutional Neural Networks for Real-time Processing of Videos	✓ Link	53.40		4.71x1	12.80	RCU_8	2021-05-31
ViViT: A Video Vision Transformer	✓ Link		94.7			ViViT-L/16x2 320	2021-03-29
Video Transformer Network	✓ Link		94.2			ViT-B-VTN+ ImageNet-21K (84.0 [10])	2021-02-01
SlowFast Networks for Video Recognition	✓ Link		93.9			SlowFast 16x8 (ResNet-101 + NL)	2018-12-10
Video Transformer Network	✓ Link		93.4			ViT-B-VTN (1 layer, ImageNet pretrain)	2021-02-01
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link			225x5		MViT-B (train from scratch)	2021-12-02

OpenCodePapers

action-classification-on-kinetics-400