LLaVAction: evaluating and training multi-modal large language models for action recognition | ✓ Link | 58.3 | 76 | 69 | | LLaVAction | 2025-03-24 |
TIM: A Time Interval Machine for Audio-Visual Action Recognition | ✓ Link | 56.4 | 76.2 | 66.4 | | TIM | 2024-04-08 |
Training a Large Video Model on a Single Machine in a Day | ✓ Link | 54.4 | 73.0 | 65.4 | | Avion (ViT-L) | 2023-09-28 |
M&M Mix: A Multimodal Multiview Transformer Ensemble | | 53.6 | 72.0 | 66.3 | | M&M (WTS 60M) | 2022-06-20 |
Extending Video Masked Autoencoders to 128 frames | | 52.1 | 75.0 | 61.8 | | LVMAE | 2024-11-20 |
Temporally-Adaptive Models for Efficient Video Understanding | ✓ Link | 51.8 | 71.7 | 64.1 | | TAdaFormer-L/14 | 2023-08-10 |
Learning Video Representations from Large Language Models | ✓ Link | 51 | 72 | 62.9 | | LaViLa (TimeSformer-L) | 2022-12-08 |
Multiview Transformers for Video Recognition | ✓ Link | 50.5 | 69.9 | 63.9 | | MTV-B (WTS 60M) | 2022-01-12 |
Omnivore: A Single Model for Many Visual Modalities | ✓ Link | 49.9 | 69.5 | 61.7 | | OMNIVORE (Swin-B, finetuned) | 2022-01-20 |
CAST: Cross-Attention in Space and Time for Video Action Recognition | ✓ Link | 49.3 | 72.5 | 60.9 | | CAST(ViT-B/16) | 2023-11-30 |
Temporally-Adaptive Models for Efficient Video Understanding | ✓ Link | 48.9 | 71.0 | 60.2 | | TAdaConvNeXtV2-S | 2023-08-10 |
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition | ✓ Link | 48.4 | 71.4 | 60.3 | | MeMViT-24 | 2022-01-20 |
Multiscale Multimodal Transformer for Multimodal Action Recognition | | 47.8 | 70.1 | 61.0 | | MMT | 2022-09-22 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 47.7 | 72.2 | 57.3 | 117x1 | MoViNet-A6 | 2021-03-21 |
AVT: Audio-Video Transformer for Multimodal Action Recognition | | 47.2 | 70.4 | 59.3 | | AVT | 2022-09-22 |
Object-Region Video Transformers | ✓ Link | 45.7 | 68.4 | 58.7 | | ORViT Mformer-L (ORViT blocks) | 2021-10-13 |
Technical Report: Temporal Aggregate Representations | ✓ Link | 45.26 | 66 | 53.35 | | TempAgg | 2021-06-06 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 44.5 | 69.1 | 55.1 | 74.9x1 | MoViNet-A5 | 2021-03-21 |
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers | ✓ Link | 44.5 | 67.0 | 58.5 | | Mformer-HR | 2021-06-09 |
Gate-Shift-Fuse for Video Action Recognition | ✓ Link | 44.48 | 69.06 | 53.18 | | GSF | 2022-03-16 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 44.4 | 68.8 | 56.2 | 42.2x1 | MoViNet-A4 | 2021-03-21 |
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers | ✓ Link | 44.1 | 67.1 | 57.6 | | Mformer-L | 2021-06-09 |
ViViT: A Video Vision Transformer | ✓ Link | 44.0 | 66.4 | 56.8 | | ViViT-L/16x2 Fact. encoder | 2021-03-29 |
Attention Bottlenecks for Multimodal Fusion | ✓ Link | 43.4 | 64.8 | 58 | | MBT | 2021-06-30 |
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers | ✓ Link | 43.1 | 66.7 | 56.5 | | Mformer | 2021-06-09 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 41.2 | 67.1 | 52.3 | 7.59x1 | MoViNet-A2 | 2021-03-21 |
Rescaling Egocentric Vision | ✓ Link | 37.39 | | | | TSM | 2020-06-23 |
Rescaling Egocentric Vision | ✓ Link | 36.81 | | | | SlowFast | 2020-06-23 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 36.8 | 64.8 | 47.4 | 1.74x1 | MoViNet-A0 | 2021-03-21 |
Rescaling Egocentric Vision | ✓ Link | 35.55 | | | | TBN | 2020-06-23 |
Rescaling Egocentric Vision | ✓ Link | 35.28 | | | | TRN | 2020-06-23 |
Rescaling Egocentric Vision | ✓ Link | 33.57 | | | | TSN | 2020-06-23 |