Improving language models by retrieving from trillions of tokens | ✓ Link | 2.4 | | 7532M | RETRO (7.5B) | 2021-12-08 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 10.6 | | 2700M | Hybrid H3 (2.7B) | 2022-12-28 |
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism | ✓ Link | 10.81 | | 8300M | Megatron-LM | 2019-09-17 |
GLM: General Language Model Pretraining with Autoregressive Blank Infilling | ✓ Link | 11.33 | | 10000M | GLM-XXLarge (bidirectional) | 2021-03-18 |
GLM: General Language Model Pretraining with Autoregressive Blank Infilling | ✓ Link | 12.22 | | 10000M | GLM-XXLarge (unidirectional) | 2021-03-18 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 12.5 | | 1300M | Hybrid H3 (1.3B) | 2022-12-28 |
Advancing State of the Art in Language Modeling | ✓ Link | 13.29 | 13.11 | | Ensemble of All | 2023-11-28 |
GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling | ✓ Link | 13.4 | | 125M | GateLoop (125M) | 2023-11-03 |
You can't pick your neighbors, or can you? When and how to rely on retrieval in the $k$NN-LM | ✓ Link | 15.5 | 15.72 | 247M | kNN-LM w/ Adaptive Coefficient | 2022-10-28 |
Generalization through Memorization: Nearest Neighbor Language Models | ✓ Link | 15.79 | 15.81 | 247M | kNN-LM w/ Continuous Cache | 2019-11-01 |
Efficient Content-Based Sparse Attention with Routing Transformers | ✓ Link | 15.8 | | | Routing Transformer | 2020-03-12 |
Generalization through Memorization: Nearest Neighbor Language Models | ✓ Link | 16.12 | 16.06 | 247M | kNN-LM | 2019-11-01 |
Dynamic Evaluation of Transformer Language Models | ✓ Link | 16.4 | 15.8 | 257M | Transformer-XL (RMS dynamic eval) | 2019-04-17 |
$\infty$-former: Infinite Memory Transformer | ✓ Link | 16.61 | | | [?]-former (SM) | 2021-09-01 |
$\infty$-former: Infinite Memory Transformer | ✓ Link | 16.61 | | | -former (SM) | 2021-09-01 |
$\infty$-former: Infinite Memory Transformer | ✓ Link | 16.61 | | | ∞-former (Sticky memories + initialized GPT-2 Small) | 2021-09-01 |
$\infty$-former: Infinite Memory Transformer | ✓ Link | 16.64 | | | ∞-former (initialized GPT-2 Small) | 2021-09-01 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 16.9 | | 355M | Hybrid H3 (355M) | 2022-12-28 |
Dynamic Evaluation of Transformer Language Models | ✓ Link | 17.0 | 16.3 | 257M | Transformer-XL (SGD dynamic eval) | 2019-04-17 |
Compressive Transformers for Long-Range Sequence Modelling | ✓ Link | 17.1 | 16.0 | | Compressive Transformer (18L, M=1024) | 2019-11-13 |
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute | ✓ Link | 17.1 | 16.4 | 234M | SRU++ Large | 2021-02-24 |
Segatron: Segment-Aware Transformer for Language Modeling and Understanding | ✓ Link | 17.1 | | 257M | SegaTransformer-XL | 2020-04-30 |
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles | ✓ Link | 17.18 | 16.54 | | Transformer+SSA+Self-ensemble | 2023-06-02 |
Improving Neural Language Models by Segmenting, Attending, and Predicting the Future | ✓ Link | 17.4 | | 257M | Transformer-XL Large + Phrase Induction | 2019-06-04 |
Language Models are Unsupervised Multitask Learners | ✓ Link | 17.48 | | 1542M | GPT-2 Full | 2019-02-14 |
Shortformer: Better Language Modeling using Shorter Inputs | ✓ Link | 17.56 | 16.89 | 247M | Staged Training | 2020-12-31 |
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles | ✓ Link | 17.60 | 16.91 | | Transformer+SSA | 2023-06-02 |
Improving Transformer Models by Reordering their Sublayers | ✓ Link | 17.96 | | 247M | Sandwich Transformer | 2019-11-10 |
Differentiable Model Compression via Pseudo Quantization Noise | ✓ Link | 18.0 | | | DIFFQ (λ=1, g=16) | 2021-04-20 |
Mega: Moving Average Equipped Gated Attention | ✓ Link | 18.07 | | 252M | Mega | 2022-09-21 |
Shortformer: Better Language Modeling using Shorter Inputs | ✓ Link | 18.15 | 17.47 | 247M | Shortformer | 2020-12-31 |
Addressing Some Limitations of Transformers with Feedback Memory | ✓ Link | 18.2 | 17.5 | 139M | Feedback Transformer (8 layers) | 2020-02-21 |
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute | ✓ Link | 18.3 | 17.5 | 148M | SRU++ Base | 2021-02-24 |
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | ✓ Link | 18.3 | 18.2 | 257M | Transformer-XL Large | 2019-01-09 |
Pay Attention when Required | ✓ Link | 18.4 | | | PAR Transformer Large | 2020-09-09 |
General-purpose, long-context autoregressive modeling with Perceiver AR | ✓ Link | 18.4 | | | Perceiver AR 358M | 2022-02-15 |
Hyena Hierarchy: Towards Larger Convolutional Language Models | ✓ Link | 18.5 | | | Hyena-3-slim | 2023-02-21 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 18.5 | | | Hybrid H3 125M | 2022-12-28 |
Hyena Hierarchy: Towards Larger Convolutional Language Models | ✓ Link | 18.6 | | | Hyena-3 | 2023-02-21 |
Adaptive Input Representations for Neural Language Modeling | ✓ Link | 18.70 | 17.97 | 247M | Transformer (Adaptive inputs) | 2018-09-28 |
Finetuning Pretrained Transformers into RNNs | ✓ Link | 19.6 | 19 | | T2R + Pretrain | 2021-03-24 |
Subformer: A Parameter Reduced Transformer | | 20.39 | | 96M | Subformer | 2021-01-01 |
Language Models with Transformers | ✓ Link | 20.4 | 19.6 | 395M | BERT-Large-CAS | 2019-04-20 |
Augmenting Self-attention with Persistent Memory | ✓ Link | 20.6 | 19.7 | 133M | All-attention network (36 layers) | 2019-07-02 |
Efficiently Modeling Long Sequences with Structured State Spaces | ✓ Link | 21.28 | | 249M | S4 | 2021-10-31 |
Language Models are Unsupervised Multitask Learners | ✓ Link | 22.05 | | 774M | GPT-2 Large | 2019-02-14 |
Addressing Some Limitations of Transformers with Feedback Memory | ✓ Link | 22.4 | 21.4 | 44M | Feedback Transformer (4 layers) | 2020-02-21 |
Pay Attention when Required | ✓ Link | 22.7 | | | PAR Transformer Base | 2020-09-09 |
Memory-efficient Stochastic methods for Memory-based Transformers | ✓ Link | 22.91 | 21.87 | 122M | Skip Cross-Head Transformer-XL | 2023-11-14 |
Deep Equilibrium Models | ✓ Link | 23.2 | | 110M | DEQ-Transformer (medium, adaptive embed) | 2019-09-03 |
Time-aware Large Kernel Convolutions | ✓ Link | 23.3 | | 240M | TaLK Convolutions | 2020-02-08 |
Random Feature Attention | | 23.5 | 22 | | Rfa-Gate-Gaussian-Stateful (Big) | 2021-03-03 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models | ✓ Link | 23.7 | | 125M | Hybrid H3 (125M) | 2022-12-28 |
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | ✓ Link | 24.0 | 23.1 | 151M | Transformer-XL Standard | 2019-01-09 |
DeLighT: Deep and Light-weight Transformer | ✓ Link | 24.14 | | 99M | DeLighT | 2020-08-03 |
$\infty$-former: Infinite Memory Transformer | ✓ Link | 24.22 | | | [?]-former (Sticky memories) | 2021-09-01 |
$\infty$-former: Infinite Memory Transformer | ✓ Link | 24.22 | | | \infty-former (Sticky memories) | 2021-09-01 |
$\infty$-former: Infinite Memory Transformer | ✓ Link | 24.22 | | | ∞-former (Sticky memories) | 2021-09-01 |
Revisiting Simple Neural Probabilistic Language Models | ✓ Link | 25.2 | 24.1 | 148M | Transformer-N | 2021-04-08 |
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention | ✓ Link | 25.6 | | | Linear Attention 125M | 2020-06-29 |
FNetAR: Mixing Tokens with Autoregressive Fourier Transforms | ✓ Link | 25.81 | | 144.4M | FNetAR Medium | 2021-07-22 |
Reformer: The Efficient Transformer | ✓ Link | 26.0 | | | Reformer 125M | 2020-01-13 |
Language Models are Unsupervised Multitask Learners | ✓ Link | 26.37 | | 355M | GPT-2 Medium | 2019-02-14 |
Rethinking Attention with Performers | ✓ Link | 26.8 | | | Performer 125M | 2020-09-30 |
Improving Neural Language Modeling via Adversarial Training | ✓ Link | 28.0 | 27.2 | | AdvSoft (+ 4 layer QRNN + dynamic eval) | 2019-06-10 |
Deep Equilibrium Models | ✓ Link | 29.0 | | 180M | DEQ-TrellisNet | 2019-09-03 |
Trellis Networks for Sequence Modeling | ✓ Link | 29.19 | | | Trellis Network | 2018-10-15 |
Fast Parametric Learning with Activation Memorization | | 29.2 | 29.0 | | LSTM (Hebbian, Cache, MbPA) | 2018-03-27 |
Fast Parametric Learning with Activation Memorization | | 29.7 | 29.9 | | LSTM (Hebbian, Cache) | 2018-03-27 |
Random Feature Attention | | 30.5 | 29.4 | | Rfa-Gate-Gaussian-Stateful (Small) | 2021-03-03 |
Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation | ✓ Link | 31.0 | | | Primal.+Trans. | 2023-05-31 |
Relational recurrent neural networks | ✓ Link | 31.6 | 30.8 | | LSTM (RMC) | 2018-06-05 |
Deep Equilibrium Models | ✓ Link | 32.4 | | 138M | DEQ-Transformer (small) | 2019-09-03 |
Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes | ✓ Link | 32.85 | 31.92 | | AWD-LSTM-MoS + ATOI | 2019-09-18 |
An Analysis of Neural Language Modeling at Multiple Scales | ✓ Link | 33.0 | 32.0 | 151M | 4 layer QRNN | 2018-03-22 |
Fast Parametric Learning with Activation Memorization | | 34.3 | 34.1 | | LSTM (Hebbian) | 2018-03-27 |
Fast Parametric Learning with Activation Memorization | | 36.4 | 36.0 | | LSTM | 2018-03-27 |
Language Modeling with Gated Convolutional Networks | ✓ Link | 37.2 | - | | GCNN-8 | 2016-12-23 |
Language Models are Unsupervised Multitask Learners | ✓ Link | 37.50 | | 124M | GPT-2 Small | 2019-02-14 |
Improving Neural Language Models with a Continuous Cache | ✓ Link | 40.8 | | | Neural cache model (size = 2,000) | 2016-12-13 |
Improving Neural Language Models with a Continuous Cache | ✓ Link | 44.8 | | | Neural cache model (size = 100) | 2016-12-13 |
Language Modeling with Gated Convolutional Networks | ✓ Link | 44.9 | | | GCNN-8 | 2016-12-23 |
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling | ✓ Link | 45.19 | | | TCN | 2018-03-04 |
Convolutional Sequence Modeling Revisited | | 45.2 | - | | Temporal CNN | 2018-01-01 |
Improving Neural Language Models with a Continuous Cache | ✓ Link | 48.7 | | | LSTM | 2016-12-13 |
On the adequacy of untuned warmup for adaptive optimization | ✓ Link | | 19.5 | | Transformer (Adaptive inputs) | 2019-10-09 |
How much complexity does an RNN architecture need to learn syntax-sensitive dependencies? | ✓ Link | | 52.73 | | LSTM | 2020-05-17 |
How much complexity does an RNN architecture need to learn syntax-sensitive dependencies? | ✓ Link | | 53.78 | | GRU | 2020-05-17 |
How much complexity does an RNN architecture need to learn syntax-sensitive dependencies? | ✓ Link | | 76.67 | | Decay RNN | 2020-05-17 |