VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending | | 0.61 | VLAB | 2023-05-22 |
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding | ✓ Link | 0.606 | MA-LMM | 2024-04-08 |
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks | ✓ Link | .602 | MaMMUT (ours) | 2023-03-29 |
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | ✓ Link | 0.60 | VALOR | 2023-04-17 |
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | ✓ Link | 0.60 | VAST | 2023-05-29 |
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model | ✓ Link | 0.60 | COSA | 2023-06-15 |
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | ✓ Link | 0.581 | mPLUG-2 | 2023-02-01 |
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners | | 0.569 | VideoCoCa | 2022-12-09 |
GIT: A Generative Image-to-text Transformer for Vision and Language | ✓ Link | 0.568 | GIT | 2022-05-27 |
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | ✓ Link | 0.558 | FrozenBiLM+ | 2023-08-18 |
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | | 0.556 | HiTeA | 2022-12-30 |
InternVideo: General Video Foundation Models via Generative and Discriminative Learning | ✓ Link | 0.555 | InternVideo | 2022-12-06 |
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | ✓ Link | 0.552 | UMT-L (ViT-L/16) | 2023-03-28 |
vid-TLDR: Training Free Token merging for Light-weight Video Transformer | ✓ Link | 0.549 | vid-TLDR (UMT-L) | 2024-03-20 |
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | ✓ Link | 0.547 | VIOLETv2 | 2022-09-04 |
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling | | 0.547 | MuLTI | 2023-03-10 |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 0.546 | X2-VLM (large) | 2022-11-22 |
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | ✓ Link | 0.528 | X2-VLM (base) | 2022-11-22 |
Clover: Towards A Unified Video-Language Alignment and Fusion Model | ✓ Link | 0.524 | Clover | 2022-07-16 |
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | ✓ Link | 0.517 | VIOLET + MELTR | 2023-03-23 |
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | | 0.510 | OmniVL | 2022-09-15 |
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | ✓ Link | 0.495 | VIOLET+ | 2023-08-18 |
Video Question Answering with Iterative Video-Text Co-Tokenization | | .486 | Co-Tokenization | 2022-08-01 |
All in One: Exploring Unified Video-Language Pre-training | ✓ Link | 0.483 | All-in-one-B | 2022-03-14 |
Lightweight Recurrent Cross-modal Encoder for Video Question Answering | ✓ Link | 0.478 | LRCE | 2023-06-30 |
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | ✓ Link | 0.477 | JustAsk+ | 2023-08-18 |
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models | ✓ Link | 0.469 | GIT+MDF | 2023-07-09 |
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models | ✓ Link | 0.467 | AIO+MIF | 2023-07-09 |
Align and Prompt: Video-and-Language Pre-training with Entity Prompts | ✓ Link | 0.459 | ALPRO | 2021-12-17 |
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | ✓ Link | 0.438 | All-in-one+ | 2023-08-18 |
DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering | ✓ Link | 0.390 | DualVGR | 2021-07-10 |
Hierarchical Conditional Relation Networks for Video Question Answering | ✓ Link | 0.361 | HCRN | 2020-02-25 |
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning | ✓ Link | 0.351 | SSML | 2020-03-06 |
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering | ✓ Link | 0.337 | HMEMA | 2019-04-08 |
Motion-Appearance Co-Memory Networks for Video Question Answering | | 0.317 | Co-Mem | 2018-03-29 |
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering | ✓ Link | 0.313 | ST-VQA | 2017-04-14 |