OpenCodePapers

language-modelling-on-one-billion-word

Language Modelling
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodePPLNumber of paramsValidation perplexityModelNameReleaseDate
Simple and Effective Masked Diffusion Language Models✓ Link20.09110MMDLM (AR baseline)2024-06-11
OmniNet: Omnidirectional Representations from Transformers✓ Link21.5100MOmniNetT (Large)2021-03-01
OmniNet: Omnidirectional Representations from Transformers✓ Link21.6100MOmniNetP (Large)2021-03-01
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context✓ Link21.80.8BTransformer-XL Large2019-01-09
OmniNet: Omnidirectional Representations from Transformers✓ Link22OmniNetB (Large)2021-03-01
Simple and Effective Masked Diffusion Language Models✓ Link23.00110MMDLM2024-06-11
Adaptive Input Representations for Neural Language Modeling✓ Link23.021.0B22.92Adaptive Input Very Large2018-09-28
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context✓ Link23.50.46BTransformer-XL Base2019-01-09
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute✓ Link23.5465MSRU++ Large2021-02-24
Exploring the Limits of Language Modeling✓ Link23.743B10 LSTM+CNN inputs + SNM10-SKIP (ensemble)2016-02-07
Adaptive Input Representations for Neural Language Modeling✓ Link23.910.46B23.83Adaptive Input Large2018-09-28
Mesh-TensorFlow: Deep Learning for Supercomputers✓ Link 24.04.9BMesh Tensorflow2018-11-05
[]()25.06Cohere Large
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute✓ Link25.1328MSRU++2021-02-24
Pay Less Attention with Lightweight and Dynamic Convolutions✓ Link26.670.34BDynamicConv2019-01-29
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer✓ Link 28.05BHigh-Budget MoE2017-01-23
The Evolved Transformer✓ Link28.6Evolved Transformer Big2019-01-30
Exploring the Limits of Language Modeling✓ Link30.01.04BLSTM-8192-1024 + CNN Input2016-02-07
Exploring the Limits of Language Modeling✓ Link30.61.8BLSTM-8192-10242016-02-07
Language Modeling with Gated Convolutional Networks✓ Link31.9 GCNN-14 bottleneck2016-12-23
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer✓ Link34.15BLow-Budget MoE2017-01-23
Factorization tricks for LSTM networks✓ Link36.0BIG G-LSTM-22017-03-31
Language Models are Unsupervised Multitask Learners✓ Link42.161.54BGPT-22019-02-14
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling✓ Link51.320BRNN-1024 + 9 Gram2013-12-11
Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability Estimation52.933BSparse Non-Negative2014-12-03
H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences✓ Link53M23.95H-Transformer-1D Nr=16 (Base)2021-07-25
H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences✓ Link144M20.25H-Transformer-1D Nr=16 (Large)2021-07-25