| Language Models are Few-Shot Learners | ✓ Link | 20.5 | | 175000M | GPT-3 (Zero-Shot) | 2020-05-28 |
| Language Models with Transformers | ✓ Link | 31.3 | 36.1 | 395M | BERT-Large-CAS | 2019-04-20 |
| Language Models are Unsupervised Multitask Learners | ✓ Link | 35.76 | | 1542M | GPT-2 | 2019-02-14 |
| Mogrifier LSTM | ✓ Link | 44.9 | 44.8 | 24M | Mogrifier LSTM + dynamic eval | 2019-09-04 |
| Improving Neural Language Modeling via Adversarial Training | ✓ Link | 46.01 | 46.63 | 22M | adversarial + AWD-LSTM-MoS + dynamic eval | 2019-06-10 |
| Gradual Learning of Recurrent Neural Networks | ✓ Link | 46.34 | 46.64 | 26M | GL-LWGC + AWD-MoS-LSTM + dynamic eval | 2017-08-29 |
| FRAGE: Frequency-Agnostic Word Representation | ✓ Link | 46.54 | 47.38 | 22M | FRAGE + AWD-LSTM-MoS + dynamic eval | 2018-09-18 |
| Direct Output Connection for a High-Rank Language Model | ✓ Link | 47.17 | 48.63 | 185M | AWD-LSTM-DOC x5 | 2018-08-30 |
| Improved Language Modeling by Decoding the Past | | 47.3 | 48.0 | 22M | Past Decode Reg. + AWD-LSTM-MoS + dyn. eval. | 2018-08-14 |
| Advancing State of the Art in Language Modeling | ✓ Link | 47.31 | 48.92 | | Ensemble of All | 2023-11-28 |
| Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | ✓ Link | 47.69 | 48.33 | 22M | AWD-LSTM-MoS + dynamic eval | 2017-11-10 |
| Deep Residual Output Layers for Neural Language Generation | ✓ Link | 49.4 | 49.5 | 24M | AWD-LSTM-DRILL + dynamic eval | 2019-05-14 |
| Deep Independently Recurrent Neural Network (IndRNN) | ✓ Link | 50.97 | | | Dense IndRNN+dynamic eval | 2019-10-11 |
| Dynamic Evaluation of Neural Sequence Models | ✓ Link | 51.1 | 51.6 | 24M | AWD-LSTM + dynamic eval | 2017-09-21 |
| Partially Shuffling the Training Data to Improve Language Models | ✓ Link | 52.0 | 53.79 | 23M | AWD-LSTM-DOC + Partial Shuffle | 2019-03-11 |
| Direct Output Connection for a High-Rank Language Model | ✓ Link | 52.38 | 54.12 | 23M | AWD-LSTM-DOC | 2018-08-30 |
| Regularizing and Optimizing LSTM Language Models | ✓ Link | 52.8 | 53.9 | 24M | AWD-LSTM + continuous cache pointer | 2017-08-07 |
| Partially Shuffling the Training Data to Improve Language Models | ✓ Link | 53.92 | 55.89 | 22M | AWD-LSTM-MoS + Partial Shuffle | 2019-03-11 |
| Trellis Networks for Sequence Modeling | ✓ Link | 54.19 | | | Trellis Network | 2018-10-15 |
| Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | ✓ Link | 54.44 | 56.54 | 22M | AWD-LSTM-MoS | 2017-11-10 |
| Learning Associative Inference Using Fast Weight Memory | ✓ Link | 54.48 | 56.76 | 24M | AWD-FWM Schlag et al. (2020) | 2020-11-16 |
| Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | ✓ Link | 54.55 | 56.72 | 24M | Transformer-XL | 2019-01-09 |
| AutoDropout: Learning Dropout Patterns to Regularize Deep Networks | ✓ Link | 54.9 | 58.1 | | Transformer-XL + AutoDropout | 2021-01-05 |
| Pushing the bounds of dropout | ✓ Link | 55.3 | 57.1 | 24M | 2-layer skip-LSTM + dropout tuning | 2018-05-23 |
| Deep Residual Output Layers for Neural Language Generation | ✓ Link | 55.7 | 58.2 | 24M | AWD-LSTM-DRILL | 2019-05-14 |
| DARTS: Differentiable Architecture Search | ✓ Link | 56.1 | 58.3 | 23M | Differentiable NAS | 2018-06-24 |
| Deep Independently Recurrent Neural Network (IndRNN) | ✓ Link | 56.37 | | | Dense IndRNN | 2019-10-11 |
| Fraternal Dropout | ✓ Link | 56.8 | 58.9 | 24M | AWD-LSTM 3-layer with Fraternal dropout | 2017-10-31 |
| Deep Equilibrium Models | ✓ Link | 57.1 | | 24M | DEQ-TrellisNet | 2019-09-03 |
| Regularizing and Optimizing LSTM Language Models | ✓ Link | 57.3 | 60.0 | 24M | AWD-LSTM | 2017-08-07 |
| Efficient Neural Architecture Search via Parameter Sharing | ✓ Link | 58.6 | 60.8 | 24M | Efficient NAS | 2018-02-09 |
| Neural Architecture Search with Reinforcement Learning | ✓ Link | 64.0 | | 25M | NAS-RL | 2016-11-05 |
| Recurrent Highway Networks | ✓ Link | 65.4 | 67.9 | 23M | Recurrent highway networks | 2016-07-12 |
| Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling | ✓ Link | 66.0 | 68.1 | | Inan et al. (2016) - Variational RHN | 2016-11-04 |
| A Theoretically Grounded Application of Dropout in Recurrent Neural Networks | ✓ Link | 75.2 | 77.9 | | Gal & Ghahramani (2016) - Variational LSTM (large) | 2015-12-16 |
| Recurrent Neural Network Regularization | ✓ Link | 78.4 | 82.2 | | Zaremba et al. (2014) - LSTM (large) | 2014-09-08 |
| An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling | ✓ Link | 78.93 | | | LSTM (Bai et al., 2018) | 2018-03-04 |
| A Theoretically Grounded Application of Dropout in Recurrent Neural Networks | ✓ Link | 79.7 | 81.9 | | Gal & Ghahramani (2016) - Variational LSTM (medium) | 2015-12-16 |
| Recurrent Neural Network Regularization | ✓ Link | 82.7 | 86.2 | | Zaremba et al. (2014) - LSTM (medium) | 2014-09-08 |
| R-Transformer: Recurrent Neural Network Enhanced Transformer | ✓ Link | 84.38 | | | R-Transformer | 2019-07-12 |
| An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling | ✓ Link | 92.48 | | | GRU (Bai et al., 2018) | 2018-03-04 |
| Seq-U-Net: A One-Dimensional Causal U-Net for Efficient Sequence Modelling | ✓ Link | 107.95 | | 14.9M | Seq-U-Net | 2019-11-14 |
| Seq-U-Net: A One-Dimensional Causal U-Net for Efficient Sequence Modelling | ✓ Link | 108.47 | | 14.7M | TCN | 2019-11-14 |