SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | ✓ Link | 8.21 | | | SparseGPT (175B, 50% Sparsity) | 2023-01-02 |
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | ✓ Link | 8.34 | | | OPT-175B | 2023-01-02 |
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | ✓ Link | 8.45 | | | SparseGPT (175B, 4:8 Sparsity) | 2023-01-02 |
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | ✓ Link | 8.73 | | | SparseGPT (175B, 2:4 Sparsity) | 2023-01-02 |
Hydra: A System for Large Multi-Model Deep Learning | ✓ Link | 15.17 | 15.69 | 1542M | GPT-2 (fine-tuned) | 2021-10-16 |
Language Models are Unsupervised Multitask Learners | ✓ Link | 18.34 | | 1542M | GPT-2 | 2019-02-14 |
Language Models are Unsupervised Multitask Learners | ✓ Link | 19.93 | | 762M | GPT-2 (large) | 2019-02-14 |
Language Models are Unsupervised Multitask Learners | ✓ Link | 22.76 | | 345M | GPT-2 (medium) | 2019-02-14 |
Language Models are Unsupervised Multitask Learners | ✓ Link | 29.41 | | 117M | GPT-2 (small) | 2019-02-14 |
Language Models with Transformers | ✓ Link | 34.1 | 37.7 | 395M | BERT-Large-CAS | 2019-04-20 |
Mogrifier LSTM | ✓ Link | 38.6 | 40.2 | 35M | Mogrifier LSTM + dynamic eval | 2019-09-04 |
Improving Neural Language Modeling via Adversarial Training | ✓ Link | 38.65 | 40.27 | 35M | adversarial + AWD-LSTM-MoS + dynamic eval | 2019-06-10 |
FRAGE: Frequency-Agnostic Word Representation | ✓ Link | 39.14 | 40.85 | 35M | FRAGE + AWD-LSTM-MoS + dynamic eval | 2018-09-18 |
Improved Language Modeling by Decoding the Past | | 40.3 | 42.0 | 35M | Past Decode Reg. + AWD-LSTM-MoS + dyn. eval. | 2018-08-14 |
Gradual Learning of Recurrent Neural Networks | ✓ Link | 40.46 | 42.19 | 38M | GL-LWGC + AWD-MoS-LSTM + dynamic eval | 2017-08-29 |
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | ✓ Link | 40.68 | 42.41 | 35M | AWD-LSTM-MoS + dynamic eval | 2017-11-10 |
Deep Residual Output Layers for Neural Language Generation | ✓ Link | 42.0 | 43.9 | 34M | AWD-LSTM-DRILL + dynamic eval | 2019-05-14 |
Dynamic Evaluation of Neural Sequence Models | ✓ Link | 44.3 | 46.4 | 33M | AWD-LSTM + dynamic eval | 2017-09-21 |
Regularizing and Optimizing LSTM Language Models | ✓ Link | 52.0 | 53.8 | 33M | AWD-LSTM + continuous cache pointer | 2017-08-07 |
Direct Output Connection for a High-Rank Language Model | ✓ Link | 53.09 | 54.19 | 185M | AWD-LSTM-DOC x5 | 2018-08-30 |
Advancing State of the Art in Language Modeling | ✓ Link | 53.73 | 55.4 | | Ensemble of All | 2023-11-28 |
Mogrifier LSTM | ✓ Link | 55.1 | 57.3 | 35M | Mogrifier LSTM | 2019-09-04 |
Partially Shuffling the Training Data to Improve Language Models | ✓ Link | 57.85 | 60.16 | 37M | AWD-LSTM-DOC + Partial Shuffle | 2019-03-11 |
Direct Output Connection for a High-Rank Language Model | ✓ Link | 58.03 | 60.29 | 37M | AWD-LSTM-DOC | 2018-08-30 |
Partially Shuffling the Training Data to Improve Language Models | ✓ Link | 59.98 | 62.38 | 35M | AWD-LSTM-MoS + Partial Shuffle | 2019-03-11 |
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | ✓ Link | 61.45 | 63.88 | 35M | AWD-LSTM-MoS | 2017-11-10 |
Learning Associative Inference Using Fast Weight Memory | ✓ Link | 61.65 | 54.48 | 37M | AWD-FWM Schlag et al. (2020) | 2020-11-16 |
Deep Residual Output Layers for Neural Language Generation | ✓ Link | 61.9 | 64.9 | 34M | AWD-LSTM-DRILL | 2019-05-14 |
Fraternal Dropout | ✓ Link | 64.1 | 66.8 | 34M | AWD-LSTM 3-layer with Fraternal dropout | 2017-10-31 |
Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes | ✓ Link | 64.73 | 67.47 | 33M | AWD-LSTM + ATOI | 2019-09-18 |
Regularizing and Optimizing LSTM Language Models | ✓ Link | 65.8 | 68.6 | 33M | AWD-LSTM | 2017-08-07 |
On the State of the Art of Evaluation in Neural Language Models | ✓ Link | 65.9 | 69.3 | 24M | Melis et al. (2017) - 1-layer LSTM (tied) | 2017-07-18 |
Improving Neural Language Models with a Continuous Cache | ✓ Link | 68.9 | | | Grave et al. (2016) - LSTM + continuous cache pointer | 2016-12-13 |
Efficient recurrent architectures through activity sparsity and sparse back-propagation through time | ✓ Link | 68.9 | | | EGRU | 2022-06-13 |
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling | ✓ Link | 87.0 | 91.5 | | Inan et al. (2016) - Variational LSTM (tied) (h=650) + augmented loss | 2016-11-04 |
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling | ✓ Link | 87.7 | 92.3 | | Inan et al. (2016) - Variational LSTM (tied) (h=650) | 2016-11-04 |
Improving Neural Language Models with a Continuous Cache | ✓ Link | 99.3 | | | Grave et al. (2016) - LSTM | 2016-12-13 |
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | ✓ Link | 234.77 | | | OPT-175B (50% Sparsity) | 2023-01-02 |