VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending | | 0.496 | VLAB | 2023-05-22 |
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks | ✓ Link | 0.495 | MaMMUT | 2023-03-29 |
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | ✓ Link | 0.480 | mPLUG-2 | 2023-02-01 |
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling | | 0.478 | MuLTI | 2023-03-10 |
Flamingo: a Visual Language Model for Few-Shot Learning | ✓ Link | 0.474 | Flamingo | 2022-04-29 |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 0.471 | InternVideo | 2022-12-06 |
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 0.471 | UMT-L (ViT-L/16) | 2023-03-28 |
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | ✓ Link | 0.470 | FrozenBiLM+ | 2023-08-18 |
vid-TLDR: Training Free Token merging for Light-weight Video Transformer | ✓ Link | 0.470 | vid-TLDR (UMT-L) | 2024-03-20 |
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners | | 0.463 | VideoCoCa | 2022-12-09 |
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning | ✓ Link | 0.462 | HBI | 2023-03-25 |
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | | 0.459 | HiTeA | 2022-12-30 |
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations | ✓ Link | 0.458 | EMCL-Net | 2022-11-21 |
Video Question Answering with Iterative Video-Text Co-Tokenization | | .457 | Co-Tokenization | 2022-08-01 |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 0.455 | X2-VLM (large) | 2022-11-22 |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 0.45 | X2-VLM (base) | 2022-11-22 |
All in One: Exploring Unified Video-Language Pre-training | ✓ Link | 0.443 | All-in-one-B | 2022-03-14 |
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | | 0.441 | OmniVL | 2022-09-15 |
Clover: Towards A Unified Video-Language Alignment and Fusion Model | ✓ Link | 0.441 | Clover | 2022-07-16 |
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models | ✓ Link | 0.440 | AIO+MIF | 2023-07-09 |
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models | ✓ Link | 0.438 | AIO+MDF | 2023-07-09 |
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models | ✓ Link | 0.423 | GIT+MDF | 2023-07-09 |
Align and Prompt: Video-and-Language Pre-training with Entity Prompts | ✓ Link | 0.421 | ALPRO | 2021-12-17 |
Lightweight Recurrent Cross-modal Encoder for Video Question Answering | ✓ Link | 0.42 | LRCE | 2023-06-30 |
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | ✓ Link | 0.418 | JustAsk+ | 2023-08-18 |
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | ✓ Link | 0.395 | All-in-one+ | 2023-08-18 |
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling | ✓ Link | 0.374 | CLIPBERT | 2021-02-11 |
Hierarchical Conditional Relation Networks for Video Question Answering | ✓ Link | 0.356 | HCRN | 2020-02-25 |
DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering | ✓ Link | 0.355 | DualVGR | 2021-07-10 |
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering | ✓ Link | 0.33 | HMEMA | 2019-04-08 |
Motion-Appearance Co-Memory Networks for Video Question Answering | | 0.32 | Co-Mem | 2018-03-29 |
Flamingo: a Visual Language Model for Few-Shot Learning | ✓ Link | 0.310 | Flamingo (32-shot) | 2022-04-29 |
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering | ✓ Link | 0.309 | ST-VQA | 2017-04-14 |
Flamingo: a Visual Language Model for Few-Shot Learning | ✓ Link | 0.174 | Flamingo (0-shot) | 2022-04-29 |