action-recognition-on-epic-kitchens-100

Action Recognition

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Action@1	Verb@1	Noun@1	GFLOPs	ModelName	ReleaseDate
LLaVAction: evaluating and training multi-modal large language models for action recognition	✓ Link	58.3	76	69		LLaVAction	2025-03-24
TIM: A Time Interval Machine for Audio-Visual Action Recognition	✓ Link	56.4	76.2	66.4		TIM	2024-04-08
Training a Large Video Model on a Single Machine in a Day	✓ Link	54.4	73.0	65.4		Avion (ViT-L)	2023-09-28
M&M Mix: A Multimodal Multiview Transformer Ensemble		53.6	72.0	66.3		M&M (WTS 60M)	2022-06-20
Extending Video Masked Autoencoders to 128 frames		52.1	75.0	61.8		LVMAE	2024-11-20
Temporally-Adaptive Models for Efficient Video Understanding	✓ Link	51.8	71.7	64.1		TAdaFormer-L/14	2023-08-10
Learning Video Representations from Large Language Models	✓ Link	51	72	62.9		LaViLa (TimeSformer-L)	2022-12-08
Multiview Transformers for Video Recognition	✓ Link	50.5	69.9	63.9		MTV-B (WTS 60M)	2022-01-12
Omnivore: A Single Model for Many Visual Modalities	✓ Link	49.9	69.5	61.7		OMNIVORE (Swin-B, finetuned)	2022-01-20
CAST: Cross-Attention in Space and Time for Video Action Recognition	✓ Link	49.3	72.5	60.9		CAST(ViT-B/16)	2023-11-30
Temporally-Adaptive Models for Efficient Video Understanding	✓ Link	48.9	71.0	60.2		TAdaConvNeXtV2-S	2023-08-10
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition	✓ Link	48.4	71.4	60.3		MeMViT-24	2022-01-20
Multiscale Multimodal Transformer for Multimodal Action Recognition		47.8	70.1	61.0		MMT	2022-09-22
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	47.7	72.2	57.3	117x1	MoViNet-A6	2021-03-21
AVT: Audio-Video Transformer for Multimodal Action Recognition		47.2	70.4	59.3		AVT	2022-09-22
Object-Region Video Transformers	✓ Link	45.7	68.4	58.7		ORViT Mformer-L (ORViT blocks)	2021-10-13
Technical Report: Temporal Aggregate Representations	✓ Link	45.26	66	53.35		TempAgg	2021-06-06
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	44.5	69.1	55.1	74.9x1	MoViNet-A5	2021-03-21
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers	✓ Link	44.5	67.0	58.5		Mformer-HR	2021-06-09
Gate-Shift-Fuse for Video Action Recognition	✓ Link	44.48	69.06	53.18		GSF	2022-03-16
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	44.4	68.8	56.2	42.2x1	MoViNet-A4	2021-03-21
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers	✓ Link	44.1	67.1	57.6		Mformer-L	2021-06-09
ViViT: A Video Vision Transformer	✓ Link	44.0	66.4	56.8		ViViT-L/16x2 Fact. encoder	2021-03-29
Attention Bottlenecks for Multimodal Fusion	✓ Link	43.4	64.8	58		MBT	2021-06-30
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers	✓ Link	43.1	66.7	56.5		Mformer	2021-06-09
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	41.2	67.1	52.3	7.59x1	MoViNet-A2	2021-03-21
Rescaling Egocentric Vision	✓ Link	37.39				TSM	2020-06-23
Rescaling Egocentric Vision	✓ Link	36.81				SlowFast	2020-06-23
MoViNets: Mobile Video Networks for Efficient Video Recognition	✓ Link	36.8	64.8	47.4	1.74x1	MoViNet-A0	2021-03-21
Rescaling Egocentric Vision	✓ Link	35.55				TBN	2020-06-23
Rescaling Egocentric Vision	✓ Link	35.28				TRN	2020-06-23
Rescaling Egocentric Vision	✓ Link	33.57				TSN	2020-06-23

OpenCodePapers

action-recognition-on-epic-kitchens-100