Language Models are Unsupervised Multitask Learners | ✓ Link | 0.93 | 1542M | GPT-2 (48 layers, h=1600) | 2019-02-14 |
Dynamic Evaluation of Transformer Language Models | ✓ Link | 0.940 | 277M | Transformer-XL (24 layers, RMS dynamic eval, decay) | 2019-04-17 |
Focus Your Attention (with Adaptive IIR Filters) | | 0.940 | 22M | Focus | 2023-05-24 |
Not All Memories are Created Equal: Learning to Forget by Expiring | ✓ Link | 0.95 | 208M | Expire-Span (24 layers) | 2021-05-13 |
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute | ✓ Link | 0.95 | 195M | SRU++ Large | 2021-02-24 |
Addressing Some Limitations of Transformers with Feedback Memory | ✓ Link | 0.96 | 77M | Feedback Transformer | 2020-02-21 |
Improving Transformer Models by Reordering their Sublayers | ✓ Link | 0.968 | 209M | Sandwich Transformer (adaptive span) | 2019-11-10 |
Compressive Transformers for Long-Range Sequence Modelling | ✓ Link | 0.97 | 277M | Compressive Transformer (24 layers) | 2019-11-13 |
Long-Short Transformer: Efficient Transformers for Language and Vision | ✓ Link | 0.97 | 110M | Transformer-LS (large) | 2021-07-05 |
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute | ✓ Link | 0.97 | 108M | SRU++ Base | 2021-02-24 |
Adaptive Attention Span in Transformers | ✓ Link | 0.98 | 209M | Transformer (24 layers, 8k adaptive span) | 2019-05-19 |
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | ✓ Link | 0.99 | 277M | Transformer-XL (24 layers) | 2019-01-09 |
Longformer: The Long-Document Transformer | ✓ Link | 0.99 | 102M | Longformer (30 layers, h=512) | 2020-04-10 |
Generating Long Sequences with Sparse Transformers | ✓ Link | 0.99 | 95M | Sparse Transformer (30 layers, fixed attn) | 2019-04-23 |
Efficient Content-Based Sparse Attention with Routing Transformers | ✓ Link | 0.99 | | Routing Transformer (12 layers) | 2020-03-12 |
Long-Short Transformer: Efficient Transformers for Language and Vision | ✓ Link | 0.99 | | Transformer-LS (small) | 2021-07-05 |
Hierarchical Transformers Are More Efficient Language Models | ✓ Link | 0.997 | | Hourglass | 2021-10-26 |
Longformer: The Long-Document Transformer | ✓ Link | 1.00 | 41M | Longformer (12 layers, h=512) | 2020-04-10 |
Augmenting Self-attention with Persistent Memory | ✓ Link | 1.01 | 39M | All-attention network (18 layers) | 2019-07-02 |
Adaptive Attention Span in Transformers | ✓ Link | 1.02 | 39M | Transformer (12 layers, 8k adaptive span) | 2019-05-19 |
BP-Transformer: Modelling Long-Range Context via Binary Partitioning | ✓ Link | 1.02 | 38M | BP-Transformer (12 layers) | 2019-11-11 |
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles | ✓ Link | 1.024 | | Transformer+SSA | 2023-06-02 |
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | ✓ Link | 1.03 | 88M | Transformer-XL (18 layers) | 2019-01-09 |
Memory-efficient Stochastic methods for Memory-based Transformers | ✓ Link | 1.033 | 41M | Skip Cross-Head Transformer-XL | 2023-11-14 |
Character-Level Language Modeling with Deeper Self-Attention | ✓ Link | 1.06 | 235M | Transformer (64 layers) | 2018-08-09 |
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | ✓ Link | 1.06 | 41M | Transformer-XL (12 layers) | 2019-01-09 |
Single Headed Attention RNN: Stop Thinking With Your Head | ✓ Link | 1.068 | 54M | SHA-RNN (4 layers, h=1024, attention head per layer) | 2019-11-26 |
Single Headed Attention RNN: Stop Thinking With Your Head | ✓ Link | 1.076 | 52M | SHA-RNN (4 layers, h=1024, single attention head) | 2019-11-26 |
Character-Level Language Modeling with Deeper Self-Attention | ✓ Link | 1.11 | 44M | 64-layer Character Transformer Model | 2018-08-09 |
Mogrifier LSTM | ✓ Link | 1.146 | 48M | Mogrifier LSTM | 2019-09-04 |
Mogrifier LSTM | ✓ Link | 1.195 | 48M | LSTM | 2019-09-04 |
Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding | | 1.22 | | Cluster-Former (#C=512) | 2020-09-13 |
An Analysis of Neural Language Modeling at Multiple Scales | ✓ Link | 1.232 | 47M | AWD-LSTM (3 layers) | 2018-03-22 |
Multiplicative LSTM for sequence modelling | ✓ Link | 1.24 | 46M | Large mLSTM | 2016-09-26 |
Fast-Slow Recurrent Neural Networks | ✓ Link | 1.25 | 47M | Large FS-LSTM-4 | 2017-05-24 |
Recurrent Highway Networks | ✓ Link | 1.27 | 46M | Recurrent Highway Networks | 2016-07-12 |
Neural Machine Translation in Linear Time | ✓ Link | 1.31 | | ByteNet | 2016-10-31 |
Hierarchical Multiscale Recurrent Neural Networks | ✓ Link | 1.32 | 35M | LN HM-LSTM | 2016-09-06 |
Single Headed Attention RNN: Stop Thinking With Your Head | ✓ Link | 1.33 | 51M | SHA-LSTM (4 layers, h=1024, no attention head) | 2019-11-26 |
HyperNetworks | ✓ Link | 1.34 | 27M | Hypernetworks | 2016-09-27 |
Generating Sequences With Recurrent Neural Networks | ✓ Link | 1.67 | | LSTM (7 layers) | 2013-08-04 |
Augmenting Self-attention with Persistent Memory | ✓ Link | | 114M | All-attention network (36 layers) | 2019-07-02 |