action-classification-on-kinetics-600

VideoAction Classification

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Top-1 Accuracy	Top-5 Accuracy	GFLOPs	ModelName	ReleaseDate
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	91.9			InternVideo2-6B	2024-03-22
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning	✓ Link	91.8	98.9		TubeVit-H	2022-12-06
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	✓ Link	91.6			InternVideo2-1B	2024-03-22
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning	✓ Link	91.5	98.7		TubeVit-L	2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	✓ Link	91.3			InternVideo-T	2022-12-06
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound		91.1	97.1		🍷MerlotReserve-Large (+Audio)	2022-01-07
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning	✓ Link	90.9	97.3		TubeVit-B	2022-12-06
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	✓ Link	90.5	98.8		UMT-L (ViT-L/16)	2023-03-28
Multiview Transformers for Video Recognition	✓ Link	90.3	98.5		MTV-H (WTS 60M)	2022-01-12
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer	✓ Link	90.1	98.5		UniFormerV2-L	2022-09-22
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking	✓ Link	89.9	98.5		VideoMAE V2-g (64x266x266)	2023-03-29
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale	✓ Link	89.8%			EVA	2022-11-14
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	✓ Link	89.8	98.3		mPLUG-2	2023-02-01
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound		89.7	96.6		🍷MerlotReserve-Base (+Audio)	2022-01-07
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound		89.4	96.3		🍷MerlotReserve-Large (no Audio)	2022-01-07
CoCa: Contrastive Captioners are Image-Text Foundation Models	✓ Link	89.4			CoCa (finetuned)	2022-05-04
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking	✓ Link	88.8	98.2		VideoMAE V2-g	2023-03-29
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles	✓ Link	88.8			Hiera-H (no extra data)	2023-06-01
CoCa: Contrastive Captioners are Image-Text Foundation Models	✓ Link	88.5			CoCa (frozen)	2022-05-04
Masked Feature Prediction for Self-Supervised Visual Pre-Training	✓ Link	88.3	98.0		MaskFeat (no extra data, MViT-L)	2021-12-16
Expanding Language-Image Pretrained Models for General Video Recognition	✓ Link	88.3	97.7		X-CLIP(ViT-L/14, CLIP)	2022-08-04
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound		88.1	95.8		🍷MerlotReserve-Base (no Audio)	2022-01-07
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link	87.9	97.9		MViTv2-L (ImageNet-21k pretrain)	2021-12-02
Co-training Transformer with Videos and Images Improves Action Recognition		87.9	97.8		CoVeR (JFT-3B)	2021-12-14
Florence: A New Foundation Model for Computer Vision	✓ Link	87.8	97.9		Florence (curated FLD-900M pretrain)	2021-11-22
Co-training Transformer with Videos and Images Improves Action Recognition		86.8	97.3		CoVeR (JFT-300M)	2021-12-14
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?	✓ Link	86.3	97.0		TokenLearner 16at18 w. Fuser (L/10)	2021-06-21
Video Swin Transformer	✓ Link	86.1	97.3		Swin-L (384x384, ImageNet-21k pretrain)	2021-06-24
ViViT: A Video Vision Transformer	✓ Link	85.8	96.5		ViViT-H/16x2 (JFT)	2021-03-29
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link	85.5			MViTv2-L (train from scratch)	2021-12-02
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning	✓ Link	84.8	96.7	259x4	UniFormer-B (ImageNet-1K)	2021-09-29
Space-time Mixing Attention for Video Transformer	✓ Link	84.5	96.3		XViT (x16)	2021-06-10
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	84.3	96.4	281x1	MoViNet-A5 (AutoAugment)	2021-03-21
ViViT: A Video Vision Transformer	✓ Link	84.3	95.6		ViViT-L/16x2	2021-03-29
Video Swin Transformer	✓ Link	84.0	96.5		Swin-B (ImageNet-21k pretrain)	2021-06-24
Multiscale Vision Transformers	✓ Link	83.8	96.3		MViT-B-24, 32x3	2021-04-22
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text	✓ Link	83.6	96.6		VATT-Large	2021-04-22
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	83.5	96.5	386x1	MoViNet-A6	2021-03-21
Multiscale Vision Transformers	✓ Link	83.4	96.3		MViT-B, 32x3	2021-04-22
Learning Spatio-Temporal Representation with Local and Global Diffusion		83.1	96.2		LGD-3D Two-stream	2019-06-13
Revisiting 3D ResNets for Video Recognition	✓ Link	83.1			R3D-RS-200	2021-09-03
ViViT: A Video Vision Transformer	✓ Link	83.0	95.7		ViViT-L/16x2 (320x320)	2021-03-29
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	82.7	95.7	281x1	MoViNet-A5	2021-03-21
Multiscale Vision Transformers	✓ Link	82.1	95.7		MViT-B, 16x4	2021-04-22
PERF-Net: Pose Empowered RGB-Flow Net		82.0	95.7		PERF-Net (distilled ResNet50-G)	2020-09-28
SlowFast Networks for Video Recognition	✓ Link	81.8	95.1		SlowFast 16x8 (ResNet-101 + NL)	2018-12-10
Learning Spatio-Temporal Representation with Local and Global Diffusion		81.5	95.6		LGD-3D RGB	2019-06-13
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	81.2	94.9	105x1	MoViNet-A4	2021-03-21
SlowFast Networks for Video Recognition	✓ Link	81.1	95.1		SlowFast 16x8 (ResNet-101)	2018-12-10
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	80.8	80.8	56.9x1	MoViNet-A3	2021-03-21
SlowFast Networks for Video Recognition	✓ Link	80.4	94.8		SlowFast 8x8 (ResNet-101)	2018-12-10
SlowFast Networks for Video Recognition	✓ Link	79.9	94.5		SlowFast 8x8 (ResNet-50)	2018-12-10
D3D: Distilled 3D Networks for Video Action Recognition	✓ Link	79.1			D3D+S3D-G	2018-12-19
SlowFast Networks for Video Recognition	✓ Link	78.8	94		SlowFast 4x16 (ResNet-50)	2018-12-10
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification	✓ Link	78.6			S3D-G (RGB+Flow)	2017-12-13
D3D: Distilled 3D Networks for Video Action Recognition	✓ Link	77.9			D3D	2018-12-19
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	77.5	93.4	10.3x1	MoViNet-A2	2021-03-21
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification	✓ Link	76.6			S3D-G (RGB)	2017-12-13
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	76.0	92.6	6.0x1	MoViNet-A1	2021-03-21
Learning Spatio-Temporal Representation with Local and Global Diffusion		75	92.4		LGD-3D Flow	2019-06-13
A Short Note about Kinetics-600	✓ Link	73.6			I3D (RGB)	2018-08-03
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	71.5	90.4	2.7x1	MoViNet-A0	2021-03-21
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification	✓ Link	69.7			S3D-G (Flow)	2017-12-13
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link		97.2		MViTv2-B (train from scratch)	2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	✓ Link			206x5	MViT-L (train from scratch)	2021-12-02

OpenCodePapers

action-classification-on-kinetics-600