OpenCodePapers

language-modelling-on-wikitext-103

Language Modelling
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeTest perplexityValidation perplexityNumber of paramsModelNameReleaseDate
Improving language models by retrieving from trillions of tokens✓ Link2.47532MRETRO (7.5B)2021-12-08
Hungry Hungry Hippos: Towards Language Modeling with State Space Models✓ Link10.62700MHybrid H3 (2.7B)2022-12-28
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism✓ Link10.818300MMegatron-LM2019-09-17
GLM: General Language Model Pretraining with Autoregressive Blank Infilling✓ Link11.3310000MGLM-XXLarge (bidirectional)2021-03-18
GLM: General Language Model Pretraining with Autoregressive Blank Infilling✓ Link12.2210000MGLM-XXLarge (unidirectional)2021-03-18
Hungry Hungry Hippos: Towards Language Modeling with State Space Models✓ Link12.51300MHybrid H3 (1.3B)2022-12-28
Advancing State of the Art in Language Modeling✓ Link13.2913.11Ensemble of All2023-11-28
GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling✓ Link13.4125MGateLoop (125M)2023-11-03
You can't pick your neighbors, or can you? When and how to rely on retrieval in the $k$NN-LM✓ Link15.515.72247MkNN-LM w/ Adaptive Coefficient2022-10-28
Generalization through Memorization: Nearest Neighbor Language Models✓ Link15.7915.81247MkNN-LM w/ Continuous Cache2019-11-01
Efficient Content-Based Sparse Attention with Routing Transformers✓ Link15.8Routing Transformer2020-03-12
Generalization through Memorization: Nearest Neighbor Language Models✓ Link16.1216.06247MkNN-LM2019-11-01
Dynamic Evaluation of Transformer Language Models✓ Link16.415.8257MTransformer-XL (RMS dynamic eval)2019-04-17
$\infty$-former: Infinite Memory Transformer✓ Link16.61[?]-former (SM)2021-09-01
$\infty$-former: Infinite Memory Transformer✓ Link16.61-former (SM)2021-09-01
$\infty$-former: Infinite Memory Transformer✓ Link16.61∞-former (Sticky memories + initialized GPT-2 Small)2021-09-01
$\infty$-former: Infinite Memory Transformer✓ Link16.64∞-former (initialized GPT-2 Small)2021-09-01
Hungry Hungry Hippos: Towards Language Modeling with State Space Models✓ Link16.9355MHybrid H3 (355M)2022-12-28
Dynamic Evaluation of Transformer Language Models✓ Link17.016.3257MTransformer-XL (SGD dynamic eval)2019-04-17
Compressive Transformers for Long-Range Sequence Modelling✓ Link17.116.0Compressive Transformer (18L, M=1024)2019-11-13
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute✓ Link17.116.4234MSRU++ Large2021-02-24
Segatron: Segment-Aware Transformer for Language Modeling and Understanding✓ Link17.1257MSegaTransformer-XL2020-04-30
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles✓ Link17.1816.54Transformer+SSA+Self-ensemble2023-06-02
Improving Neural Language Models by Segmenting, Attending, and Predicting the Future✓ Link17.4257MTransformer-XL Large + Phrase Induction2019-06-04
Language Models are Unsupervised Multitask Learners✓ Link17.481542MGPT-2 Full2019-02-14
Shortformer: Better Language Modeling using Shorter Inputs✓ Link17.5616.89247MStaged Training2020-12-31
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles✓ Link17.6016.91Transformer+SSA2023-06-02
Improving Transformer Models by Reordering their Sublayers✓ Link17.96247MSandwich Transformer2019-11-10
Differentiable Model Compression via Pseudo Quantization Noise✓ Link18.0DIFFQ (λ=1, g=16)2021-04-20
Mega: Moving Average Equipped Gated Attention✓ Link18.07252MMega2022-09-21
Shortformer: Better Language Modeling using Shorter Inputs✓ Link18.1517.47247MShortformer2020-12-31
Addressing Some Limitations of Transformers with Feedback Memory✓ Link18.217.5139MFeedback Transformer (8 layers)2020-02-21
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute✓ Link18.317.5148MSRU++ Base2021-02-24
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context✓ Link18.318.2257MTransformer-XL Large2019-01-09
Pay Attention when Required✓ Link18.4PAR Transformer Large2020-09-09
General-purpose, long-context autoregressive modeling with Perceiver AR✓ Link18.4Perceiver AR 358M2022-02-15
Hyena Hierarchy: Towards Larger Convolutional Language Models✓ Link18.5Hyena-3-slim2023-02-21
Hungry Hungry Hippos: Towards Language Modeling with State Space Models✓ Link18.5Hybrid H3 125M2022-12-28
Hyena Hierarchy: Towards Larger Convolutional Language Models✓ Link18.6Hyena-32023-02-21
Adaptive Input Representations for Neural Language Modeling✓ Link18.7017.97247MTransformer (Adaptive inputs)2018-09-28
Finetuning Pretrained Transformers into RNNs✓ Link19.619T2R + Pretrain2021-03-24
Subformer: A Parameter Reduced Transformer20.3996MSubformer2021-01-01
Language Models with Transformers✓ Link20.419.6395MBERT-Large-CAS2019-04-20
Augmenting Self-attention with Persistent Memory✓ Link20.619.7133MAll-attention network (36 layers)2019-07-02
Efficiently Modeling Long Sequences with Structured State Spaces✓ Link21.28249MS42021-10-31
Language Models are Unsupervised Multitask Learners✓ Link22.05774MGPT-2 Large2019-02-14
Addressing Some Limitations of Transformers with Feedback Memory✓ Link22.421.444MFeedback Transformer (4 layers)2020-02-21
Pay Attention when Required✓ Link22.7PAR Transformer Base2020-09-09
Memory-efficient Stochastic methods for Memory-based Transformers✓ Link22.9121.87122MSkip Cross-Head Transformer-XL2023-11-14
Deep Equilibrium Models✓ Link23.2110MDEQ-Transformer (medium, adaptive embed)2019-09-03
Time-aware Large Kernel Convolutions✓ Link23.3240MTaLK Convolutions2020-02-08
Random Feature Attention23.522Rfa-Gate-Gaussian-Stateful (Big)2021-03-03
Hungry Hungry Hippos: Towards Language Modeling with State Space Models✓ Link23.7125MHybrid H3 (125M)2022-12-28
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context✓ Link24.023.1151MTransformer-XL Standard2019-01-09
DeLighT: Deep and Light-weight Transformer✓ Link24.1499MDeLighT2020-08-03
$\infty$-former: Infinite Memory Transformer✓ Link24.22[?]-former (Sticky memories)2021-09-01
$\infty$-former: Infinite Memory Transformer✓ Link24.22\infty-former (Sticky memories)2021-09-01
$\infty$-former: Infinite Memory Transformer✓ Link24.22∞-former (Sticky memories)2021-09-01
Revisiting Simple Neural Probabilistic Language Models✓ Link25.224.1148MTransformer-N2021-04-08
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention✓ Link25.6Linear Attention 125M2020-06-29
FNetAR: Mixing Tokens with Autoregressive Fourier Transforms✓ Link25.81144.4MFNetAR Medium2021-07-22
Reformer: The Efficient Transformer✓ Link26.0Reformer 125M2020-01-13
Language Models are Unsupervised Multitask Learners✓ Link26.37355MGPT-2 Medium2019-02-14
Rethinking Attention with Performers✓ Link26.8Performer 125M2020-09-30
Improving Neural Language Modeling via Adversarial Training✓ Link28.027.2AdvSoft (+ 4 layer QRNN + dynamic eval)2019-06-10
Deep Equilibrium Models✓ Link29.0180MDEQ-TrellisNet2019-09-03
Trellis Networks for Sequence Modeling✓ Link29.19Trellis Network2018-10-15
Fast Parametric Learning with Activation Memorization29.229.0LSTM (Hebbian, Cache, MbPA)2018-03-27
Fast Parametric Learning with Activation Memorization29.729.9LSTM (Hebbian, Cache)2018-03-27
Random Feature Attention30.529.4Rfa-Gate-Gaussian-Stateful (Small)2021-03-03
Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation✓ Link31.0Primal.+Trans.2023-05-31
Relational recurrent neural networks✓ Link31.630.8LSTM (RMC)2018-06-05
Deep Equilibrium Models✓ Link32.4138MDEQ-Transformer (small)2019-09-03
Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes✓ Link32.8531.92AWD-LSTM-MoS + ATOI2019-09-18
An Analysis of Neural Language Modeling at Multiple Scales✓ Link33.032.0151M4 layer QRNN2018-03-22
Fast Parametric Learning with Activation Memorization34.334.1LSTM (Hebbian)2018-03-27
Fast Parametric Learning with Activation Memorization36.436.0LSTM2018-03-27
Language Modeling with Gated Convolutional Networks✓ Link37.2-GCNN-82016-12-23
Language Models are Unsupervised Multitask Learners✓ Link37.50124MGPT-2 Small2019-02-14
Improving Neural Language Models with a Continuous Cache✓ Link40.8Neural cache model (size = 2,000)2016-12-13
Improving Neural Language Models with a Continuous Cache✓ Link44.8Neural cache model (size = 100)2016-12-13
Language Modeling with Gated Convolutional Networks✓ Link44.9GCNN-82016-12-23
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling✓ Link45.19TCN2018-03-04
Convolutional Sequence Modeling Revisited45.2-Temporal CNN2018-01-01
Improving Neural Language Models with a Continuous Cache✓ Link48.7LSTM2016-12-13
On the adequacy of untuned warmup for adaptive optimization✓ Link19.5Transformer (Adaptive inputs)2019-10-09
How much complexity does an RNN architecture need to learn syntax-sensitive dependencies?✓ Link52.73LSTM2020-05-17
How much complexity does an RNN architecture need to learn syntax-sensitive dependencies?✓ Link53.78GRU2020-05-17
How much complexity does an RNN architecture need to learn syntax-sensitive dependencies?✓ Link76.67Decay RNN2020-05-17