language-modelling-on-wikitext-103

Language Modelling

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Test perplexity	Validation perplexity	Number of params	ModelName	ReleaseDate
Improving language models by retrieving from trillions of tokens	✓ Link	2.4		7532M	RETRO (7.5B)	2021-12-08
Hungry Hungry Hippos: Towards Language Modeling with State Space Models	✓ Link	10.6		2700M	Hybrid H3 (2.7B)	2022-12-28
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism	✓ Link	10.81		8300M	Megatron-LM	2019-09-17
GLM: General Language Model Pretraining with Autoregressive Blank Infilling	✓ Link	11.33		10000M	GLM-XXLarge (bidirectional)	2021-03-18
GLM: General Language Model Pretraining with Autoregressive Blank Infilling	✓ Link	12.22		10000M	GLM-XXLarge (unidirectional)	2021-03-18
Hungry Hungry Hippos: Towards Language Modeling with State Space Models	✓ Link	12.5		1300M	Hybrid H3 (1.3B)	2022-12-28
Advancing State of the Art in Language Modeling	✓ Link	13.29	13.11		Ensemble of All	2023-11-28
GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling	✓ Link	13.4		125M	GateLoop (125M)	2023-11-03
You can't pick your neighbors, or can you? When and how to rely on retrieval in the $k$NN-LM	✓ Link	15.5	15.72	247M	kNN-LM w/ Adaptive Coefficient	2022-10-28
Generalization through Memorization: Nearest Neighbor Language Models	✓ Link	15.79	15.81	247M	kNN-LM w/ Continuous Cache	2019-11-01
Efficient Content-Based Sparse Attention with Routing Transformers	✓ Link	15.8			Routing Transformer	2020-03-12
Generalization through Memorization: Nearest Neighbor Language Models	✓ Link	16.12	16.06	247M	kNN-LM	2019-11-01
Dynamic Evaluation of Transformer Language Models	✓ Link	16.4	15.8	257M	Transformer-XL (RMS dynamic eval)	2019-04-17
$\infty$-former: Infinite Memory Transformer	✓ Link	16.61			[?]-former (SM)	2021-09-01
$\infty$-former: Infinite Memory Transformer	✓ Link	16.61			-former (SM)	2021-09-01
$\infty$-former: Infinite Memory Transformer	✓ Link	16.61			∞-former (Sticky memories + initialized GPT-2 Small)	2021-09-01
$\infty$-former: Infinite Memory Transformer	✓ Link	16.64			∞-former (initialized GPT-2 Small)	2021-09-01
Hungry Hungry Hippos: Towards Language Modeling with State Space Models	✓ Link	16.9		355M	Hybrid H3 (355M)	2022-12-28
Dynamic Evaluation of Transformer Language Models	✓ Link	17.0	16.3	257M	Transformer-XL (SGD dynamic eval)	2019-04-17
Compressive Transformers for Long-Range Sequence Modelling	✓ Link	17.1	16.0		Compressive Transformer (18L, M=1024)	2019-11-13
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute	✓ Link	17.1	16.4	234M	SRU++ Large	2021-02-24
Segatron: Segment-Aware Transformer for Language Modeling and Understanding	✓ Link	17.1		257M	SegaTransformer-XL	2020-04-30
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles	✓ Link	17.18	16.54		Transformer+SSA+Self-ensemble	2023-06-02
Improving Neural Language Models by Segmenting, Attending, and Predicting the Future	✓ Link	17.4		257M	Transformer-XL Large + Phrase Induction	2019-06-04
Language Models are Unsupervised Multitask Learners	✓ Link	17.48		1542M	GPT-2 Full	2019-02-14
Shortformer: Better Language Modeling using Shorter Inputs	✓ Link	17.56	16.89	247M	Staged Training	2020-12-31
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles	✓ Link	17.60	16.91		Transformer+SSA	2023-06-02
Improving Transformer Models by Reordering their Sublayers	✓ Link	17.96		247M	Sandwich Transformer	2019-11-10
Differentiable Model Compression via Pseudo Quantization Noise	✓ Link	18.0			DIFFQ (λ=1, g=16)	2021-04-20
Mega: Moving Average Equipped Gated Attention	✓ Link	18.07		252M	Mega	2022-09-21
Shortformer: Better Language Modeling using Shorter Inputs	✓ Link	18.15	17.47	247M	Shortformer	2020-12-31
Addressing Some Limitations of Transformers with Feedback Memory	✓ Link	18.2	17.5	139M	Feedback Transformer (8 layers)	2020-02-21
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute	✓ Link	18.3	17.5	148M	SRU++ Base	2021-02-24
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context	✓ Link	18.3	18.2	257M	Transformer-XL Large	2019-01-09
Pay Attention when Required	✓ Link	18.4			PAR Transformer Large	2020-09-09
General-purpose, long-context autoregressive modeling with Perceiver AR	✓ Link	18.4			Perceiver AR 358M	2022-02-15
Hyena Hierarchy: Towards Larger Convolutional Language Models	✓ Link	18.5			Hyena-3-slim	2023-02-21
Hungry Hungry Hippos: Towards Language Modeling with State Space Models	✓ Link	18.5			Hybrid H3 125M	2022-12-28
Hyena Hierarchy: Towards Larger Convolutional Language Models	✓ Link	18.6			Hyena-3	2023-02-21
Adaptive Input Representations for Neural Language Modeling	✓ Link	18.70	17.97	247M	Transformer (Adaptive inputs)	2018-09-28
Finetuning Pretrained Transformers into RNNs	✓ Link	19.6	19		T2R + Pretrain	2021-03-24
Subformer: A Parameter Reduced Transformer		20.39		96M	Subformer	2021-01-01
Language Models with Transformers	✓ Link	20.4	19.6	395M	BERT-Large-CAS	2019-04-20
Augmenting Self-attention with Persistent Memory	✓ Link	20.6	19.7	133M	All-attention network (36 layers)	2019-07-02
Efficiently Modeling Long Sequences with Structured State Spaces	✓ Link	21.28		249M	S4	2021-10-31
Language Models are Unsupervised Multitask Learners	✓ Link	22.05		774M	GPT-2 Large	2019-02-14
Addressing Some Limitations of Transformers with Feedback Memory	✓ Link	22.4	21.4	44M	Feedback Transformer (4 layers)	2020-02-21
Pay Attention when Required	✓ Link	22.7			PAR Transformer Base	2020-09-09
Memory-efficient Stochastic methods for Memory-based Transformers	✓ Link	22.91	21.87	122M	Skip Cross-Head Transformer-XL	2023-11-14
Deep Equilibrium Models	✓ Link	23.2		110M	DEQ-Transformer (medium, adaptive embed)	2019-09-03
Time-aware Large Kernel Convolutions	✓ Link	23.3		240M	TaLK Convolutions	2020-02-08
Random Feature Attention		23.5	22		Rfa-Gate-Gaussian-Stateful (Big)	2021-03-03
Hungry Hungry Hippos: Towards Language Modeling with State Space Models	✓ Link	23.7		125M	Hybrid H3 (125M)	2022-12-28
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context	✓ Link	24.0	23.1	151M	Transformer-XL Standard	2019-01-09
DeLighT: Deep and Light-weight Transformer	✓ Link	24.14		99M	DeLighT	2020-08-03
$\infty$-former: Infinite Memory Transformer	✓ Link	24.22			[?]-former (Sticky memories)	2021-09-01
$\infty$-former: Infinite Memory Transformer	✓ Link	24.22			\infty-former (Sticky memories)	2021-09-01
$\infty$-former: Infinite Memory Transformer	✓ Link	24.22			∞-former (Sticky memories)	2021-09-01
Revisiting Simple Neural Probabilistic Language Models	✓ Link	25.2	24.1	148M	Transformer-N	2021-04-08
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention	✓ Link	25.6			Linear Attention 125M	2020-06-29
FNetAR: Mixing Tokens with Autoregressive Fourier Transforms	✓ Link	25.81		144.4M	FNetAR Medium	2021-07-22
Reformer: The Efficient Transformer	✓ Link	26.0			Reformer 125M	2020-01-13
Language Models are Unsupervised Multitask Learners	✓ Link	26.37		355M	GPT-2 Medium	2019-02-14
Rethinking Attention with Performers	✓ Link	26.8			Performer 125M	2020-09-30
Improving Neural Language Modeling via Adversarial Training	✓ Link	28.0	27.2		AdvSoft (+ 4 layer QRNN + dynamic eval)	2019-06-10
Deep Equilibrium Models	✓ Link	29.0		180M	DEQ-TrellisNet	2019-09-03
Trellis Networks for Sequence Modeling	✓ Link	29.19			Trellis Network	2018-10-15
Fast Parametric Learning with Activation Memorization		29.2	29.0		LSTM (Hebbian, Cache, MbPA)	2018-03-27
Fast Parametric Learning with Activation Memorization		29.7	29.9		LSTM (Hebbian, Cache)	2018-03-27
Random Feature Attention		30.5	29.4		Rfa-Gate-Gaussian-Stateful (Small)	2021-03-03
Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation	✓ Link	31.0			Primal.+Trans.	2023-05-31
Relational recurrent neural networks	✓ Link	31.6	30.8		LSTM (RMC)	2018-06-05
Deep Equilibrium Models	✓ Link	32.4		138M	DEQ-Transformer (small)	2019-09-03
Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes	✓ Link	32.85	31.92		AWD-LSTM-MoS + ATOI	2019-09-18
An Analysis of Neural Language Modeling at Multiple Scales	✓ Link	33.0	32.0	151M	4 layer QRNN	2018-03-22
Fast Parametric Learning with Activation Memorization		34.3	34.1		LSTM (Hebbian)	2018-03-27
Fast Parametric Learning with Activation Memorization		36.4	36.0		LSTM	2018-03-27
Language Modeling with Gated Convolutional Networks	✓ Link	37.2	-		GCNN-8	2016-12-23
Language Models are Unsupervised Multitask Learners	✓ Link	37.50		124M	GPT-2 Small	2019-02-14
Improving Neural Language Models with a Continuous Cache	✓ Link	40.8			Neural cache model (size = 2,000)	2016-12-13
Improving Neural Language Models with a Continuous Cache	✓ Link	44.8			Neural cache model (size = 100)	2016-12-13
Language Modeling with Gated Convolutional Networks	✓ Link	44.9			GCNN-8	2016-12-23
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling	✓ Link	45.19			TCN	2018-03-04
Convolutional Sequence Modeling Revisited		45.2	-		Temporal CNN	2018-01-01
Improving Neural Language Models with a Continuous Cache	✓ Link	48.7			LSTM	2016-12-13
On the adequacy of untuned warmup for adaptive optimization	✓ Link		19.5		Transformer (Adaptive inputs)	2019-10-09
How much complexity does an RNN architecture need to learn syntax-sensitive dependencies?	✓ Link		52.73		LSTM	2020-05-17
How much complexity does an RNN architecture need to learn syntax-sensitive dependencies?	✓ Link		53.78		GRU	2020-05-17
How much complexity does an RNN architecture need to learn syntax-sensitive dependencies?	✓ Link		76.67		Decay RNN	2020-05-17

OpenCodePapers

language-modelling-on-wikitext-103