Language Models are Unsupervised Multitask Learners | ✓ Link | 0.98 | 1542M | GPT-2 | 2019-02-14 |
Focus Your Attention (with Adaptive IIR Filters) | | 0.98 | 22M | Focus | 2023-05-24 |
Dynamic Evaluation of Transformer Language Models | ✓ Link | 1.038 | 277M | Transformer-XL + RMS dynamic eval + decay | 2019-04-17 |
Adaptive Attention Span in Transformers | ✓ Link | 1.07 | 209M | 24L Transformer + 8K adaptive span | 2019-05-19 |
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | ✓ Link | 1.08 | 277M | Transformer-XL - 24 layers | 2019-01-09 |
Augmenting Self-attention with Persistent Memory | ✓ Link | 1.08 | 114M | All-attention network - 36 layers | 2019-07-02 |
Long-Short Transformer: Efficient Transformers for Language and Vision | ✓ Link | 1.09 | | Transformer-LS (small) | 2021-07-05 |
Adaptive Attention Span in Transformers | ✓ Link | 1.11 | 38M | 12L Transformer + 8K adaptive span | 2019-05-19 |
Augmenting Self-attention with Persistent Memory | ✓ Link | 1.11 | 38M | All-attention network - 18 layers | 2019-07-02 |
BP-Transformer: Modelling Long-Range Context via Binary Partitioning | ✓ Link | 1.11 | | BP-Transformer - 12 Layers | 2019-11-11 |
Character-Level Language Modeling with Deeper Self-Attention | ✓ Link | 1.13 | 235M | 64-layer Character Transformer Model | 2018-08-09 |
Recurrent Highway Networks with Grouped Auxiliary Memory | ✓ Link | 1.157 | 44.7M | GAM-RHN-10 | 2019-12-13 |
Character-Level Language Modeling with Deeper Self-Attention | ✓ Link | 1.18 | 44M | 12-layer Character Transformer Model | 2018-08-09 |
Pay Attention when Required | ✓ Link | 1.18 | | PAR Transformer 24B | 2020-09-09 |
Dynamic Evaluation of Neural Sequence Models | ✓ Link | 1.19 | 45M | mLSTM + dynamic eval | 2017-09-21 |
Discrete Flows: Invertible Generative Models of Discrete Data | ✓ Link | 1.23 | | Bipartite flows (8 flows) | 2019-05-24 |
Recurrent Highway Networks | ✓ Link | 1.27 | 46M | Large RHN | 2016-07-12 |
Multiplicative LSTM for sequence modelling | ✓ Link | 1.27 | 45M | Large mLSTM +emb +WN +VD | 2016-09-26 |
Hierarchical Multiscale Recurrent Neural Networks | ✓ Link | 1.29 | 35M | LayerNorm HM-LSTM | 2016-09-06 |
Recurrent Batch Normalization | ✓ Link | 1.36 | 16M | BN LSTM | 2016-03-30 |
Multiplicative LSTM for sequence modelling | ✓ Link | 1.40 | 45M | Unregularised mLSTM | 2016-09-26 |
Bayesian Flow Networks | ✓ Link | 1.41 | | BFN | 2023-08-14 |
Architectural Complexity Measures of Recurrent Neural Networks | | 1.49 | | td-LSTM-large | 2016-02-26 |
Architectural Complexity Measures of Recurrent Neural Networks | | 1.63 | | td-LSTM (Zhang et al., 2016) | 2016-02-26 |