Combining Global and Local Attention with Positional Encoding for Video Summarization | ✓ Link | 57.1 | | | | PGL-SUM (maximum learning capacity) | 2021-12-01 |
Combining Global and Local Attention with Positional Encoding for Video Summarization | ✓ Link | 55.6 | | | | PGL-SUM | 2021-12-01 |
Align and Attend: Multimodal Summarization with Dual Contrastive Losses | ✓ Link | 55.0 | | 0.108 | 0.129 | A2Summ | 2023-03-13 |
Joint Video Summarization and Moment Localization by Cross-Task Sample Transfer | | 54.5 | 56.9 | 0.101 | 0.119 | iPTNet | 2022-01-01 |
Query Twice: Dual Mixture Attention Meta Learning for Video Summarization | | 54.3 | | 0.063 | 0.089 | DMASum | 2020-08-19 |
CLIP-It! Language-Guided Video Summarization | ✓ Link | 54.2 | 56.4 | | | CLIP-It | 2021-07-01 |
Relational Reasoning Over Spatial-Temporal Graphs for Video Summarization | | 53.4 | 54.8 | 0.211 | 0.234 | RR-STG | 2022-04-06 |
Supervised Video Summarization via Multiple Feature Sets with Parallel Attention | ✓ Link | 53.4 | | 0.200 | 0.230 | MSVA | 2021-04-23 |
Supervised Video Summarization via Multiple Feature Sets with Parallel Attention | ✓ Link | 51.6 | | | | MC-VSA [DBLP:journals/corr/abs-2006-01410] | 2021-04-23 |
Progressive Video Summarization via Multimodal Self-supervised Learning | ✓ Link | 50.7 | | 0.192 | 0.257 | SSPVS(+Text) | 2022-01-07 |
Video Joint Modelling Based on Hierarchical Transformer for Co-summarization | ✓ Link | 50.6 | 51.7 | 0.106 | 0.108 | VJMHT | 2021-12-27 |
DSNet: A Flexible Detect-to-Summarize Network for Video Summarization | ✓ Link | 50.2 | 50.7 | | | DSNet | 2020-12-01 |
Progressive Video Summarization via Multimodal Self-supervised Learning | ✓ Link | 48.7 | 50.4 | 0.178 | 0.240 | SSPVS | 2022-01-07 |
Discriminative Feature Learning for Unsupervised Video Summarization | ✓ Link | 48.6 | 48.7 | | | CSNet | 2018-11-24 |
Supervised Video Summarization via Multiple Feature Sets with Parallel Attention | ✓ Link | 48 | | 0.160 | 0.170 | VASNet [DBLP:conf/accv/FajtlSAMR18] | 2021-04-23 |
Supervised Video Summarization via Multiple Feature Sets with Parallel Attention | ✓ Link | 44.9 | | | | re-SEQ2SEQ [DBLP:conf/eccv/ZhangGS18] | 2021-04-23 |
Supervised Video Summarization via Multiple Feature Sets with Parallel Attention | ✓ Link | 44.4 | | | | M-AVS [DBLP:journals/corr/abs-1708-09545] | 2021-04-23 |
Hierarchical Multimodal Transformer to Summarize Videos | | 44.1 | 44.8 | 0.079 | 0.080 | HMT | 2021-09-22 |
Supervised Video Summarization via Multiple Feature Sets with Parallel Attention | ✓ Link | 43.1 | | | | MAVS [DBLP:conf/mm/FengLKZ18] | 2021-04-23 |
Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward | ✓ Link | 42.1 | 43.9 | | | DR-DSN | 2017-12-29 |
CSTA: CNN-based Spatiotemporal Attention for Video Summarization | ✓ Link | | | 0.246 | 0.274 | CSTA | 2024-05-20 |