InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 70.0 | | | | InternVideo | 2022-12-06 |
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking | ✓ Link | 68.7 | 91.9 | | | VideoMAE V2-g | 2023-03-29 |
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning | ✓ Link | 67.3 | 88.8 | | | Side4Video (EVA ViT-E/14 | 2023-11-27 |
What Can Simple Arithmetic Operations Do for Temporal Modeling? | ✓ Link | 65.6 | 88.6 | | | ATM | 2023-07-18 |
Temporally-Adaptive Models for Efficient Video Understanding | ✓ Link | 63.7 | | | | TAdaFormer-L/14 | 2023-08-10 |
TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning | ✓ Link | 63.0 | 87.8 | | | TDS-CLIP-ViT-L/14(8frames) | 2024-08-20 |
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer | ✓ Link | 62.7 | 88.0 | | | UniFormerV2-L | 2022-09-22 |
Learning Correlation Structures for Vision Transformers | | 61.3 | | | | StructVit-B-4-1 | 2024-04-05 |
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning | ✓ Link | 60.9 | 87.3 | 50.1 | 259x3 | UniFormer-B (IN-1K + Kinetics400) | 2021-09-29 |
Temporally-Adaptive Models for Efficient Video Understanding | ✓ Link | 60.7 | | | | TAdaConvNeXtV2-B | 2023-08-10 |
Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition | ✓ Link | 58.3 | | | | TPS | 2022-07-27 |
Multi-scale Motion-Aware Module for Video Action Recognition | | 57.9 | | | | MSMA (8+16frames) | 2023-02-19 |
UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning | ✓ Link | 57.6 | 84.9 | 21.4 | 41.8x3 | UniFormer-B (IN-1K + Kinetics600) | 2021-09-29 |
Stand-Alone Inter-Frame Attention in Video Models | ✓ Link | 57.3 | | | | SIFA | 2022-06-14 |
EAN: Event Adaptive Network for Enhanced Action Recognition | ✓ Link | 57.2 | 83.9 | | | EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer) | 2021-07-22 |
Motion-driven Visual Tempo Learning for Video-based Action Recognition | ✓ Link | 57.2 | | | | TCM (Ensemble) | 2022-02-24 |
Busy-Quiet Video Disentangling for Video Classification | ✓ Link | 57.1 | 84.2 | | | BQNEn (ImageNet + K400 pretrained) | 2021-03-29 |
TDN: Temporal Difference Networks for Efficient Action Recognition | ✓ Link | 56.8 | 84.1 | | | TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only) | 2020-12-18 |
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition | ✓ Link | 56.6 | 84.4 | | | SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips) | 2021-02-14 |
CT-Net: Channel Tensorization Network for Video Classification | ✓ Link | 56.6 | | | | CT-Net Ensemble (R50, 8+12+16+24) | 2021-06-03 |
Action Recognition With Motion Diversification and Dynamic Selection | | 56.6 | | | | MoDS (8+16frames) | 2022-07-15 |
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing | | 56.5 | | | | MLP-3D | 2022-06-13 |
Relational Self-Attention: What's Missing in Attention for Video Understanding | ✓ Link | 56.1 | 82.8 | | | RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips) | 2021-11-02 |
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition | ✓ Link | 55.8 | 83.9 | | | SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip) | 2021-02-14 |
Relational Self-Attention: What's Missing in Attention for Video Understanding | ✓ Link | 55.5 | 82.6 | | | RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip) | 2021-11-02 |
PAN: Towards Fast Action Recognition via Learning Persistence of Appearance | ✓ Link | 55.3 | 82.8 | | | PAN ResNet101 (RGB only, no Flow) | 2020-08-08 |
Gate-Shift Networks for Video Action Recognition | ✓ Link | 55.16 | | | | GSM Ensemble InceptionV3 (ImageNet pretrained) | 2019-12-01 |
MotionSqueeze: Neural Motion Feature Learning for Video Understanding | ✓ Link | 55.1 | | | | MSNet-R50En (ensemble) | 2020-07-20 |
AE-Net:Adjoint Enhancement Network for Efficient Action Recognition in Video Understanding | | 55.0 | | | | AE-Net (8+16frames) | 2022-07-21 |
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification | ✓ Link | 54.59 | 82.30 | 5.8M | 20.9x6 | VoV3D-L (32frames, Kinetics pretrained, single) | 2020-12-01 |
MotionSqueeze: Neural Motion Feature Learning for Video Understanding | ✓ Link | 54.4 | 83.8 | | | MSNet-R50En (8+16 ensemble, ImageNet pretrained) | 2020-07-20 |
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition | ✓ Link | 54.3 | 82.9 | | | SELFYNet-TSM-R50 (16 frames, ImageNet pretrained) | 2021-02-14 |
Region-based Non-local Operation for Video Classification | ✓ Link | 54.1 | 82.2 | | | RNL+TSM Ensemble(R50+R101, ImageNet pretrained) | 2020-07-17 |
Relational Self-Attention: What's Missing in Attention for Video Understanding | ✓ Link | 54.0 | 81.1 | | | RSANet-R50 (16 frames, ImageNet pretrained, a single clip) | 2021-11-02 |
MVFNet: Multi-View Fusion Network for Efficient Video Recognition | ✓ Link | 54.0 | | | | MVFNet-R50EN | 2020-12-13 |
Spatial-Temporal Pyramid Graph Reasoning for Action Recognition | | 53.5 | | | | STPG (8+16frames) | 2022-08-09 |
Action recognition with spatial-temporal discriminative filter banks | | 53.4 | | | | GB + DF + LB (ResNet152, ImageNet pretrained) | 2019-08-20 |
Video Classification with Channel-Separated Convolutional Networks | ✓ Link | 53.3 | | | | ip-CSN-152 (IG-65M pretraining) | 2019-04-04 |
MARS: Motion-Augmented RGB Stream for Action Recognition | ✓ Link | 53.0 | | | | MARS+RGB+Flow (64 frames, Kinetics pretrained) | 2019-06-01 |
Region-based Non-local Operation for Video Classification | ✓ Link | 52.7 | 81.5 | | | RNL+TSM Ensemble(ResNet50, ImageNet pretrained) | 2020-07-17 |
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification | ✓ Link | 52.68 | 80.43 | 3.3M | 11.5x6 | VoV3D-M (32frames, Kinetics pretrained, single) | 2020-12-01 |
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention | | 52.6 | 81.3 | | | TSM+W3 (16 frames, ResNet50) | 2020-04-02 |
Action Keypoint Network for Efficient Video Recognition | | 52.5 | | | | AK-Net | 2022-01-17 |
MotionSqueeze: Neural Motion Feature Learning for Video Understanding | ✓ Link | 52.1 | 82.3 | | | MSNet-R50 (16 frames, ImageNet pretrained) | 2020-07-20 |
Video Classification with Channel-Separated Convolutional Networks | ✓ Link | 52.1 | | | | ir-CSN-152 (IG-65M pretraining) | 2019-04-04 |
Relational Self-Attention: What's Missing in Attention for Video Understanding | ✓ Link | 51.9 | 79.6 | | | RSANet-R50 (8 frames, ImageNet pretrained, a single clip) | 2021-11-02 |
Gate-Shift Networks for Video Action Recognition | ✓ Link | 51.68 | | | | GSM InceptionV3 (16 frames, ImageNet pretrained) | 2019-12-01 |
Video Classification with Channel-Separated Convolutional Networks | ✓ Link | 51.6 | | | | R(2+1)D-152 (IG-65M pretraining) | 2019-04-04 |
MotionSqueeze: Neural Motion Feature Learning for Video Understanding | ✓ Link | 50.9 | 80.3 | | | MSNet-R50 (8 frames, ImageNet pretrained) | 2020-07-20 |
TSM: Temporal Shift Module for Efficient Video Understanding | ✓ Link | 50.7 | | | | TSM (RGB + Flow) | 2018-11-20 |
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification | ✓ Link | 50.6 | 78.7 | 5.8M | 20.9x6 | VoV3D-L (32frames, from scratch, single) | 2020-12-01 |
Moments in Time Dataset: one million videos for event understanding | ✓ Link | 50 | | | | ResNet50 I3D (Moments pretrained) | 2018-01-09 |
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification | ✓ Link | 49.8 | 78.0 | 3.3M | 11.5x6 | VoV3D-M (32frames, from scratch, single) | 2020-12-01 |
TSM: Temporal Shift Module for Efficient Video Understanding | ✓ Link | 49.7 | 78.5 | | | TSMEn | 2018-11-20 |
Temporal Reasoning Graph for Activity Recognition | | 49.7 | | | | TRG (Inception-V3) | 2019-08-27 |
Temporal Reasoning Graph for Activity Recognition | | 49.5 | 86.1 | | | TRG (ResNet-50) | 2019-08-27 |
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification | ✓ Link | 49.5 | 78.0 | 5.8M | 9.3x6 | VoV3D-L (16frames, from scratch, single) | 2020-12-01 |
Video Classification with Channel-Separated Convolutional Networks | ✓ Link | 49.3 | | | | ir-CSN-152 | 2019-04-04 |
Recurrent Space-time Graph Neural Networks | ✓ Link | 49.2 | | | | RSTG (Kinetics pretrained) | 2019-04-11 |
Moments in Time Dataset: one million videos for event understanding | ✓ Link | 48.6 | | | | ResNet50 I3D (Kinetics pretrained) | 2018-01-09 |
Video Classification with Channel-Separated Convolutional Networks | ✓ Link | 48.4 | | | | ir-CSN-101 | 2019-04-04 |
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification | ✓ Link | 48.2 | 78.7 | | | S3D-G (ImageNet pretrained) | 2017-12-13 |
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification | ✓ Link | 48.1 | 76.9 | 3.3M | 5.7x6 | VoV3D-M (16frames, from scratch, single) | 2020-12-01 |
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification | ✓ Link | 47.3 | 78.1 | | | S3D | 2017-12-13 |
TSM: Temporal Shift Module for Efficient Video Understanding | ✓ Link | 47.2 | 77.1 | | | TSM | 2018-11-20 |
ECO: Efficient Convolutional Network for Online Video Understanding | ✓ Link | 46.4 | | | | ECO-Net (ImageNet pretrained) | 2018-04-24 |
ECO: Efficient Convolutional Network for Online Video Understanding | ✓ Link | 46.4 | | | | ECO-Net | 2018-04-24 |
Videos as Space-Time Region Graphs | | 46.1 | | | | NL I3D + GCN | 2018-06-05 |
Non-local Neural Networks | ✓ Link | 44.4 | | | | NL I3D | 2017-11-21 |
Motion Feature Network: Fixed Motion Filter for Action Recognition | | 43.9 | | | | Motion Feature Net | 2018-07-26 |
Temporal Relational Reasoning in Videos | ✓ Link | 42.01 | | | | 2-Stream TRN | 2017-11-22 |
Hierarchical Feature Aggregation Networks for Video Action Recognition | | 41.97 | | | | HF-TSN (ImageNet pretraining) | 2019-05-29 |
MARS: Motion-Augmented RGB Stream for Action Recognition | ✓ Link | 40.4 | | | | MARS+RGB+Flow (16 frames, Kinetics pretrained) | 2019-06-01 |
Temporal Relational Reasoning in Videos | ✓ Link | 34.4 | | | | M-TRN | 2017-11-22 |