OpenCodePapers

language-modelling-on-enwiki8

Language Modelling
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeBit per Character (BPC)Number of paramsModelNameReleaseDate
Language Models are Unsupervised Multitask Learners✓ Link0.931542MGPT-2 (48 layers, h=1600)2019-02-14
Dynamic Evaluation of Transformer Language Models✓ Link0.940277MTransformer-XL (24 layers, RMS dynamic eval, decay)2019-04-17
Focus Your Attention (with Adaptive IIR Filters)0.94022MFocus2023-05-24
Not All Memories are Created Equal: Learning to Forget by Expiring✓ Link0.95208MExpire-Span (24 layers)2021-05-13
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute✓ Link0.95195MSRU++ Large2021-02-24
Addressing Some Limitations of Transformers with Feedback Memory✓ Link0.9677MFeedback Transformer2020-02-21
Improving Transformer Models by Reordering their Sublayers✓ Link0.968209MSandwich Transformer (adaptive span)2019-11-10
Compressive Transformers for Long-Range Sequence Modelling✓ Link0.97277MCompressive Transformer (24 layers)2019-11-13
Long-Short Transformer: Efficient Transformers for Language and Vision✓ Link0.97110MTransformer-LS (large)2021-07-05
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute✓ Link0.97108MSRU++ Base2021-02-24
Adaptive Attention Span in Transformers✓ Link0.98209MTransformer (24 layers, 8k adaptive span)2019-05-19
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context✓ Link0.99277MTransformer-XL (24 layers)2019-01-09
Longformer: The Long-Document Transformer✓ Link0.99102MLongformer (30 layers, h=512)2020-04-10
Generating Long Sequences with Sparse Transformers✓ Link0.9995MSparse Transformer (30 layers, fixed attn)2019-04-23
Efficient Content-Based Sparse Attention with Routing Transformers✓ Link0.99Routing Transformer (12 layers)2020-03-12
Long-Short Transformer: Efficient Transformers for Language and Vision✓ Link0.99Transformer-LS (small)2021-07-05
Hierarchical Transformers Are More Efficient Language Models✓ Link0.997Hourglass2021-10-26
Longformer: The Long-Document Transformer✓ Link1.0041MLongformer (12 layers, h=512)2020-04-10
Augmenting Self-attention with Persistent Memory✓ Link1.0139MAll-attention network (18 layers)2019-07-02
Adaptive Attention Span in Transformers✓ Link1.0239MTransformer (12 layers, 8k adaptive span)2019-05-19
BP-Transformer: Modelling Long-Range Context via Binary Partitioning✓ Link1.0238MBP-Transformer (12 layers)2019-11-11
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles✓ Link1.024Transformer+SSA2023-06-02
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context✓ Link1.0388MTransformer-XL (18 layers)2019-01-09
Memory-efficient Stochastic methods for Memory-based Transformers✓ Link1.03341MSkip Cross-Head Transformer-XL2023-11-14
Character-Level Language Modeling with Deeper Self-Attention✓ Link1.06235MTransformer (64 layers)2018-08-09
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context✓ Link1.0641MTransformer-XL (12 layers)2019-01-09
Single Headed Attention RNN: Stop Thinking With Your Head✓ Link1.06854MSHA-RNN (4 layers, h=1024, attention head per layer)2019-11-26
Single Headed Attention RNN: Stop Thinking With Your Head✓ Link1.07652MSHA-RNN (4 layers, h=1024, single attention head)2019-11-26
Character-Level Language Modeling with Deeper Self-Attention✓ Link1.1144M64-layer Character Transformer Model2018-08-09
Mogrifier LSTM✓ Link1.14648MMogrifier LSTM2019-09-04
Mogrifier LSTM✓ Link1.19548MLSTM2019-09-04
Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding1.22Cluster-Former (#C=512)2020-09-13
An Analysis of Neural Language Modeling at Multiple Scales✓ Link1.23247MAWD-LSTM (3 layers)2018-03-22
Multiplicative LSTM for sequence modelling✓ Link1.2446MLarge mLSTM2016-09-26
Fast-Slow Recurrent Neural Networks✓ Link 1.2547MLarge FS-LSTM-42017-05-24
Recurrent Highway Networks✓ Link1.2746MRecurrent Highway Networks2016-07-12
Neural Machine Translation in Linear Time✓ Link1.31ByteNet2016-10-31
Hierarchical Multiscale Recurrent Neural Networks✓ Link1.3235MLN HM-LSTM2016-09-06
Single Headed Attention RNN: Stop Thinking With Your Head✓ Link1.3351MSHA-LSTM (4 layers, h=1024, no attention head)2019-11-26
HyperNetworks✓ Link1.3427MHypernetworks2016-09-27
Generating Sequences With Recurrent Neural Networks✓ Link1.67LSTM (7 layers)2013-08-04
Augmenting Self-attention with Persistent Memory✓ Link114MAll-attention network (36 layers)2019-07-02