| SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | 93.7% | 91.7 | MT-DNN-SMART | 2019-11-08 |
| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations | ✓ Link | 93.4% | | ALBERT | 2019-09-26 |
| RoBERTa: A Robustly Optimized BERT Pretraining Approach | ✓ Link | 92.3% | | RoBERTa (ensemble) | 2019-07-26 |
| StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding | | 91.5% | 93.6% | StructBERTRoBERTa ensemble | 2019-08-13 |
| Learning to Encode Position for Transformer with Continuous Dynamical Model | ✓ Link | 91.4% | | FLOATER-large | 2020-03-13 |
| SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | 91.3% | | SMART | 2019-11-08 |
| LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale | ✓ Link | 91.0% | | RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned) | 2022-08-15 |
| SpanBERT: Improving Pre-training by Representing and Predicting Spans | ✓ Link | 90.9% | | SpanBERT | 2019-07-24 |
| XLNet: Generalized Autoregressive Pretraining for Language Understanding | ✓ Link | 90.8% | | XLNet (single model) | 2019-06-19 |
| AutoBERT-Zero: Evolving BERT Backbone from Scratch | | 90.7% | | AutoBERT-Zero (Large) | 2021-07-15 |
| CLEAR: Contrastive Learning for Sentence Representation | | 90.6% | | MLM+ del-word+ reorder | 2020-12-31 |
| AutoBERT-Zero: Evolving BERT Backbone from Scratch | | 90.5% | | AutoBERT-Zero (Base) | 2021-07-15 |
| A Statistical Framework for Low-bitwidth Training of Deep Neural Networks | ✓ Link | 90.4 | | PSQ (Chen et al., 2020) | 2020-10-27 |
| DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter | ✓ Link | 90.2% | | DistilBERT 66M | 2019-10-02 |
| Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 90.0% | 91.9 | T5-11B | 2019-10-23 |
| Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 89.9% | 92.4 | T5-Large | 2019-10-23 |
| Q8BERT: Quantized 8Bit BERT | ✓ Link | 89.7 | | Q8BERT (Zafrir et al., 2019) | 2019-10-14 |
| []() | | 89.6% | | ELECTRA | |
| Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 89.2% | 92.5 | T5-3B | 2019-10-23 |
| MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices | ✓ Link | 88.8% | | MobileBERT | 2020-04-06 |
| ERNIE: Enhanced Language Representation with Informative Entities | ✓ Link | 88.2% | | ERNIE | 2019-05-17 |
| Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT | | 88.2 | | Q-BERT (Shen et al., 2020) | 2019-09-12 |
| FNet: Mixing Tokens with Fourier Transforms | ✓ Link | 88% | | FNet-Large | 2021-05-09 |
| SqueezeBERT: What can computer vision teach NLP about efficient neural networks? | ✓ Link | 87.8% | | SqueezeBERT | 2020-06-19 |
| Charformer: Fast Character Transformers via Gradient-based Subword Tokenization | ✓ Link | 87.5% | 91.4 | Charformer-Tall | 2021-06-23 |
| Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 87.5% | 90.7 | T5-Base | 2019-10-23 |
| How to Train BERT with an Academic Budget | ✓ Link | 87.5% | | 24hBERT | 2021-04-15 |
| ERNIE 2.0: A Continual Pre-training Framework for Language Understanding | ✓ Link | 87.4% | | ERNIE 2.0 Large | 2019-07-29 |
| TinyBERT: Distilling BERT for Natural Language Understanding | ✓ Link | 87.3% | | TinyBERT-6 67M | 2019-09-23 |
| RealFormer: Transformer Likes Residual Attention | ✓ Link | 87.01% | 90.91% | RealFormer | 2020-12-21 |
| SubRegWeigh: Effective and Efficient Annotation Weighing with Subword Regularization | ✓ Link | 86.82% | | RoBERTa + SubRegWeigh (K-means) | 2024-09-10 |
| Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 86.6% | 89.7 | T5-Small | 2019-10-23 |
| TinyBERT: Distilling BERT for Natural Language Understanding | ✓ Link | 86.4% | | TinyBERT-4 14.5M | 2019-09-23 |
| ERNIE 2.0: A Continual Pre-training Framework for Language Understanding | ✓ Link | 86.1% | | ERNIE 2.0 Base | 2019-07-29 |
| Discriminative Improvements to Distributional Sentence Similarity | | 80.4% | 85.9% | TF-KLD | 2013-10-01 |
| Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning | ✓ Link | 78.6% | 84.4% | GenSen | 2018-03-30 |
| Supervised Learning of Universal Sentence Representations from Natural Language Inference Data | ✓ Link | 76.2% | 83.1% | InferSent | 2017-05-05 |
| Big Bird: Transformers for Longer Sequences | ✓ Link | | 91.5 | BigBird | 2020-07-28 |
| Entailment as Few-Shot Learner | ✓ Link | | 91.0 | RoBERTa-large 355M + Entailment as Few-shot Learner | 2021-04-29 |
| BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | ✓ Link | | 89.3 | BERT-LARGE | 2018-10-11 |
| Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention | ✓ Link | | 88.1% | Nyströmformer | 2021-02-07 |
| Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning | ✓ Link | | | BERT-Base | 2020-12-22 |
| Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning | ✓ Link | | | BERT-Large | 2020-12-22 |
| SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | | | SMART-BERT | 2019-11-08 |
| SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | | | SMARTRoBERTa | 2019-11-08 |