OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning | | 53.1 | | OmniVec2 | 2024-01-01 |
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | ✓ Link | 50.9 | | InternVideo2-1B | 2024-03-22 |
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 48.7 | 78.2 | UMT-L (ViT-L/16) | 2023-03-28 |
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer | ✓ Link | 47.8 | 76.9 | UniFormerV2-L | 2022-09-22 |
Multiview Transformers for Video Recognition | ✓ Link | 47.2 | 75.7 | MTV-H (WTS 60M) | 2022-01-12 |
Co-training Transformer with Videos and Images Improves Action Recognition | | 46.1 | 75.4 | CoVeR(JFT-3B) | 2021-12-14 |
Co-training Transformer with Videos and Images Improves Action Recognition | | 45.0 | 73.9 | CoVeR(JFT-300M) | 2021-12-14 |
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text | ✓ Link | 41.1 | 67.7 | VATT-Large | 2021-04-22 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 40.2 | | MoViNet-A6 | 2021-03-21 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 39.1 | | MoViNet-A5 | 2021-03-21 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 37.9 | | MoViNet-A4 | 2021-03-21 |
Video Transformer Network | ✓ Link | 37.4 | 65.4 | VTN | 2021-02-01 |
Attention Bottlenecks for Multimodal Fusion | ✓ Link | 37.3 | 61.2 | MBT (AV) | 2021-06-30 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 35.6 | | MoViNet-A3 | 2021-03-21 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 34.3 | | MoViNet-A2 | 2021-03-21 |
AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures | ✓ Link | 34.27% | 62.71% | AssembleNet | 2019-05-30 |
Learn to cycle: Time-consistent feature discovery for action recognition | ✓ Link | 33.56 | 58.49 | SRTG r3d-101 | 2020-06-15 |
Collaborative Spatiotemporal Feature Learning for Video Action Recognition | ✓ Link | 32.4% | 60.0% | CoST (ResNet-101, 32 frames) | 2019-06-01 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 32.0 | | MoViNet-A1 | 2021-03-21 |
Evolving Space-Time Neural Architectures for Videos | | 31.8% | | EvaNet | 2018-11-26 |
Learn to cycle: Time-consistent feature discovery for action recognition | ✓ Link | 31.60 | 56.80 | SRTG r(2+1)d-50 | 2020-06-15 |
Learn to cycle: Time-consistent feature discovery for action recognition | ✓ Link | 30.72 | 55.65 | SRTG r3d-50 | 2020-06-15 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset | ✓ Link | 29.51% | 56.06% | I3D | 2017-05-22 |
Learn to cycle: Time-consistent feature discovery for action recognition | ✓ Link | 28.97 | 54.18 | SRTG r(2+1)d-34 | 2020-06-15 |
Learn to cycle: Time-consistent feature discovery for action recognition | ✓ Link | 28.55 | 52.35 | SRTG r3d-34 | 2020-06-15 |
Temporal Relational Reasoning in Videos | ✓ Link | 28.27 | 53.87 | TRN-Multiscale | 2017-11-22 |
MoViNets: Mobile Video Networks for Efficient Video Recognition | ✓ Link | 27.5 | | MoViNet-A0 | 2021-03-21 |
ViViT: A Video Vision Transformer | ✓ Link | | 64.9 | ViViT-L/16x2 | 2021-03-29 |
Temporal Segment Networks for Action Recognition in Videos | ✓ Link | | 50.10% | TSN-2Stream | 2017-05-08 |