Language Models are Few-Shot Learners | ✓ Link | 20.5 | | 175000M | GPT-3 (Zero-Shot) | 2020-05-28 |
Language Models with Transformers | ✓ Link | 31.3 | 36.1 | 395M | BERT-Large-CAS | 2019-04-20 |
Language Models are Unsupervised Multitask Learners | ✓ Link | 35.76 | | 1542M | GPT-2 | 2019-02-14 |
Mogrifier LSTM | ✓ Link | 44.9 | 44.8 | 24M | Mogrifier LSTM + dynamic eval | 2019-09-04 |
Improving Neural Language Modeling via Adversarial Training | ✓ Link | 46.01 | 46.63 | 22M | adversarial + AWD-LSTM-MoS + dynamic eval | 2019-06-10 |
Gradual Learning of Recurrent Neural Networks | ✓ Link | 46.34 | 46.64 | 26M | GL-LWGC + AWD-MoS-LSTM + dynamic eval | 2017-08-29 |
FRAGE: Frequency-Agnostic Word Representation | ✓ Link | 46.54 | 47.38 | 22M | FRAGE + AWD-LSTM-MoS + dynamic eval | 2018-09-18 |
Direct Output Connection for a High-Rank Language Model | ✓ Link | 47.17 | 48.63 | 185M | AWD-LSTM-DOC x5 | 2018-08-30 |
Improved Language Modeling by Decoding the Past | | 47.3 | 48.0 | 22M | Past Decode Reg. + AWD-LSTM-MoS + dyn. eval. | 2018-08-14 |
Advancing State of the Art in Language Modeling | ✓ Link | 47.31 | 48.92 | | Ensemble of All | 2023-11-28 |
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | ✓ Link | 47.69 | 48.33 | 22M | AWD-LSTM-MoS + dynamic eval | 2017-11-10 |
Deep Residual Output Layers for Neural Language Generation | ✓ Link | 49.4 | 49.5 | 24M | AWD-LSTM-DRILL + dynamic eval | 2019-05-14 |
Deep Independently Recurrent Neural Network (IndRNN) | ✓ Link | 50.97 | | | Dense IndRNN+dynamic eval | 2019-10-11 |
Dynamic Evaluation of Neural Sequence Models | ✓ Link | 51.1 | 51.6 | 24M | AWD-LSTM + dynamic eval | 2017-09-21 |
Partially Shuffling the Training Data to Improve Language Models | ✓ Link | 52.0 | 53.79 | 23M | AWD-LSTM-DOC + Partial Shuffle | 2019-03-11 |
Direct Output Connection for a High-Rank Language Model | ✓ Link | 52.38 | 54.12 | 23M | AWD-LSTM-DOC | 2018-08-30 |
Regularizing and Optimizing LSTM Language Models | ✓ Link | 52.8 | 53.9 | 24M | AWD-LSTM + continuous cache pointer | 2017-08-07 |
Partially Shuffling the Training Data to Improve Language Models | ✓ Link | 53.92 | 55.89 | 22M | AWD-LSTM-MoS + Partial Shuffle | 2019-03-11 |
Trellis Networks for Sequence Modeling | ✓ Link | 54.19 | | | Trellis Network | 2018-10-15 |
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | ✓ Link | 54.44 | 56.54 | 22M | AWD-LSTM-MoS | 2017-11-10 |
Learning Associative Inference Using Fast Weight Memory | ✓ Link | 54.48 | 56.76 | 24M | AWD-FWM Schlag et al. (2020) | 2020-11-16 |
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | ✓ Link | 54.55 | 56.72 | 24M | Transformer-XL | 2019-01-09 |
AutoDropout: Learning Dropout Patterns to Regularize Deep Networks | ✓ Link | 54.9 | 58.1 | | Transformer-XL + AutoDropout | 2021-01-05 |
Pushing the bounds of dropout | ✓ Link | 55.3 | 57.1 | 24M | 2-layer skip-LSTM + dropout tuning | 2018-05-23 |
Deep Residual Output Layers for Neural Language Generation | ✓ Link | 55.7 | 58.2 | 24M | AWD-LSTM-DRILL | 2019-05-14 |
DARTS: Differentiable Architecture Search | ✓ Link | 56.1 | 58.3 | 23M | Differentiable NAS | 2018-06-24 |
Deep Independently Recurrent Neural Network (IndRNN) | ✓ Link | 56.37 | | | Dense IndRNN | 2019-10-11 |
Fraternal Dropout | ✓ Link | 56.8 | 58.9 | 24M | AWD-LSTM 3-layer with Fraternal dropout | 2017-10-31 |
Deep Equilibrium Models | ✓ Link | 57.1 | | 24M | DEQ-TrellisNet | 2019-09-03 |
Regularizing and Optimizing LSTM Language Models | ✓ Link | 57.3 | 60.0 | 24M | AWD-LSTM | 2017-08-07 |
Efficient Neural Architecture Search via Parameter Sharing | ✓ Link | 58.6 | 60.8 | 24M | Efficient NAS | 2018-02-09 |
Neural Architecture Search with Reinforcement Learning | ✓ Link | 64.0 | | 25M | NAS-RL | 2016-11-05 |
Recurrent Highway Networks | ✓ Link | 65.4 | 67.9 | 23M | Recurrent highway networks | 2016-07-12 |
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling | ✓ Link | 66.0 | 68.1 | | Inan et al. (2016) - Variational RHN | 2016-11-04 |
A Theoretically Grounded Application of Dropout in Recurrent Neural Networks | ✓ Link | 75.2 | 77.9 | | Gal & Ghahramani (2016) - Variational LSTM (large) | 2015-12-16 |
Recurrent Neural Network Regularization | ✓ Link | 78.4 | 82.2 | | Zaremba et al. (2014) - LSTM (large) | 2014-09-08 |
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling | ✓ Link | 78.93 | | | LSTM (Bai et al., 2018) | 2018-03-04 |
A Theoretically Grounded Application of Dropout in Recurrent Neural Networks | ✓ Link | 79.7 | 81.9 | | Gal & Ghahramani (2016) - Variational LSTM (medium) | 2015-12-16 |
Recurrent Neural Network Regularization | ✓ Link | 82.7 | 86.2 | | Zaremba et al. (2014) - LSTM (medium) | 2014-09-08 |
R-Transformer: Recurrent Neural Network Enhanced Transformer | ✓ Link | 84.38 | | | R-Transformer | 2019-07-12 |
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling | ✓ Link | 92.48 | | | GRU (Bai et al., 2018) | 2018-03-04 |
Seq-U-Net: A One-Dimensional Causal U-Net for Efficient Sequence Modelling | ✓ Link | 107.95 | | 14.9M | Seq-U-Net | 2019-11-14 |
Seq-U-Net: A One-Dimensional Causal U-Net for Efficient Sequence Modelling | ✓ Link | 108.47 | | 14.7M | TCN | 2019-11-14 |