Simple and Effective Masked Diffusion Language Models | ✓ Link | 20.09 | 110M | | MDLM (AR baseline) | 2024-06-11 |
OmniNet: Omnidirectional Representations from Transformers | ✓ Link | 21.5 | 100M | | OmniNetT (Large) | 2021-03-01 |
OmniNet: Omnidirectional Representations from Transformers | ✓ Link | 21.6 | 100M | | OmniNetP (Large) | 2021-03-01 |
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | ✓ Link | 21.8 | 0.8B | | Transformer-XL Large | 2019-01-09 |
OmniNet: Omnidirectional Representations from Transformers | ✓ Link | 22 | | | OmniNetB (Large) | 2021-03-01 |
Simple and Effective Masked Diffusion Language Models | ✓ Link | 23.00 | 110M | | MDLM | 2024-06-11 |
Adaptive Input Representations for Neural Language Modeling | ✓ Link | 23.02 | 1.0B | 22.92 | Adaptive Input Very Large | 2018-09-28 |
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | ✓ Link | 23.5 | 0.46B | | Transformer-XL Base | 2019-01-09 |
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute | ✓ Link | 23.5 | 465M | | SRU++ Large | 2021-02-24 |
Exploring the Limits of Language Modeling | ✓ Link | 23.7 | 43B | | 10 LSTM+CNN inputs + SNM10-SKIP (ensemble) | 2016-02-07 |
Adaptive Input Representations for Neural Language Modeling | ✓ Link | 23.91 | 0.46B | 23.83 | Adaptive Input Large | 2018-09-28 |
Mesh-TensorFlow: Deep Learning for Supercomputers | ✓ Link | 24.0 | 4.9B | | Mesh Tensorflow | 2018-11-05 |
[]() | | 25.06 | | | Cohere Large | |
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute | ✓ Link | 25.1 | 328M | | SRU++ | 2021-02-24 |
Pay Less Attention with Lightweight and Dynamic Convolutions | ✓ Link | 26.67 | 0.34B | | DynamicConv | 2019-01-29 |
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer | ✓ Link | 28.0 | 5B | | High-Budget MoE | 2017-01-23 |
The Evolved Transformer | ✓ Link | 28.6 | | | Evolved Transformer Big | 2019-01-30 |
Exploring the Limits of Language Modeling | ✓ Link | 30.0 | 1.04B | | LSTM-8192-1024 + CNN Input | 2016-02-07 |
Exploring the Limits of Language Modeling | ✓ Link | 30.6 | 1.8B | | LSTM-8192-1024 | 2016-02-07 |
Language Modeling with Gated Convolutional Networks | ✓ Link | 31.9 | | | GCNN-14 bottleneck | 2016-12-23 |
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer | ✓ Link | 34.1 | 5B | | Low-Budget MoE | 2017-01-23 |
Factorization tricks for LSTM networks | ✓ Link | 36.0 | | | BIG G-LSTM-2 | 2017-03-31 |
Language Models are Unsupervised Multitask Learners | ✓ Link | 42.16 | 1.54B | | GPT-2 | 2019-02-14 |
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling | ✓ Link | 51.3 | 20B | | RNN-1024 + 9 Gram | 2013-12-11 |
Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability Estimation | | 52.9 | 33B | | Sparse Non-Negative | 2014-12-03 |
H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences | ✓ Link | | 53M | 23.95 | H-Transformer-1D Nr=16 (Base) | 2021-07-25 |
H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences | ✓ Link | | 144M | 20.25 | H-Transformer-1D Nr=16 (Large) | 2021-07-25 |