[]() | | 92.6 | 92.4 | | | | Turing NLR v5 XXL 5.4B (fine-tuned) | |
First Train to Generate, then Generate to Train: UnitedSynT5 for Few-Shot NLI | | 92.6 | | | | | UnitedSynT5 (3B) | 2024-12-12 |
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | 92.0 | 91.7 | | | | T5 | 2019-11-08 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 92.0 | | | | | T5-XXL 11B (fine-tuned) | 2019-10-23 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 91.4 | 91.2 | | | | T5-3B | 2019-10-23 |
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations | ✓ Link | 91.3 | | | | | ALBERT | 2019-09-26 |
DeBERTa: Decoding-enhanced BERT with Disentangled Attention | ✓ Link | 91.1 | 91.1 | | | | DeBERTa (large) | 2020-06-05 |
StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding | | 91.1 | 90.7 | | | | Adv-RoBERTa ensemble | 2019-08-13 |
RoBERTa: A Robustly Optimized BERT Pretraining Approach | ✓ Link | 90.8 | | | | | RoBERTa | 2019-07-26 |
XLNet: Generalized Autoregressive Pretraining for Language Understanding | ✓ Link | 90.8 | | | | | XLNet (single model) | 2019-06-19 |
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale | ✓ Link | 90.2 | | | | | RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned) | 2022-08-15 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 89.9 | | | | | T5-Large | 2019-10-23 |
A Statistical Framework for Low-bitwidth Training of Deep Neural Networks | ✓ Link | 89.9 | | | | | PSQ (Chen et al., 2020) | 2020-10-27 |
First Train to Generate, then Generate to Train: UnitedSynT5 for Few-Shot NLI | | 89.8 | | | | | UnitedSynT5 (335M) | 2024-12-12 |
ERNIE 2.0: A Continual Pre-training Framework for Language Understanding | ✓ Link | 88.7 | 88.8 | | | | ERNIE 2.0 Large | 2019-07-29 |
SpanBERT: Improving Pre-training by Representing and Predicting Spans | ✓ Link | 88.1 | | | | | SpanBERT | 2019-07-24 |
FNet: Mixing Tokens with Fourier Transforms | ✓ Link | 88 | 88 | | | | BERT-Large | 2021-05-09 |
Adversarial Self-Attention for Language Understanding | ✓ Link | 88 | | | | | ASA + RoBERTa | 2022-06-25 |
Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding | ✓ Link | 87.9 | 87.4 | | | | MT-DNN-ensemble | 2019-04-20 |
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT | | 87.8 | | | | | Q-BERT (Shen et al., 2020) | 2019-09-12 |
Training Complex Models with Multi-Task Weak Supervision | ✓ Link | 87.6 | 87.2 | | | | Snorkel MeTaL (ensemble) | 2018-10-05 |
Big Bird: Transformers for Longer Sequences | ✓ Link | 87.5 | | | | | BigBird | 2020-07-28 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 87.1 | 86.2 | | | | T5-Base | 2019-10-23 |
Multi-Task Deep Neural Networks for Natural Language Understanding | ✓ Link | 86.7 | 86.0 | | | | MT-DNN | 2019-01-31 |
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | ✓ Link | 86.7 | 85.9 | | | | BERT-LARGE | 2018-10-11 |
RealFormer: Transformer Likes Residual Attention | ✓ Link | 86.28 | 86.34 | | | | RealFormer | 2020-12-21 |
Pay Attention to MLPs | ✓ Link | 86.2 | 86.5 | | | | gMLP-large | 2021-05-17 |
ERNIE 2.0: A Continual Pre-training Framework for Language Understanding | ✓ Link | 86.1 | 85.5 | | | | ERNIE 2.0 Base | 2019-07-29 |
Q8BERT: Quantized 8Bit BERT | ✓ Link | 85.6 | | | | | Q8BERT (Zafrir et al., 2019) | 2019-10-14 |
Adversarial Self-Attention for Language Understanding | ✓ Link | 85 | | | | | ASA + BERT-base | 2022-06-25 |
TinyBERT: Distilling BERT for Natural Language Understanding | ✓ Link | 84.6 | 83.2 | | | | TinyBERT-6 67M | 2019-09-23 |
Not all layers are equally as important: Every Layer Counts BERT | | 84.4 | 84.5 | | | | ELC-BERT-base 98M (zero init) | 2023-11-03 |
How to Train BERT with an Academic Budget | ✓ Link | 84.4 | 83.8 | | | | 24hBERT | 2021-04-15 |
ERNIE: Enhanced Language Representation with Informative Entities | ✓ Link | 84.0 | 83.2 | | | | ERNIE | 2019-05-17 |
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization | ✓ Link | 83.7 | 84.4 | | | | Charformer-Tall | 2021-06-23 |
Not all layers are equally as important: Every Layer Counts BERT | | 83 | 83.4 | | | | LTG-BERT-base 98M | 2023-11-03 |
TinyBERT: Distilling BERT for Natural Language Understanding | ✓ Link | 82.5 | 81.8 | | | | TinyBERT-4 14.5M | 2019-09-23 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 82.4 | 82.3 | | | | T5-Small | 2019-10-23 |
What Do Questions Exactly Ask? MFAE: Duplicate Question Identification with Multi-Fusion Asking Emphasis | ✓ Link | 82.31 | 81.43 | | | | MFAE | 2020-05-07 |
[]() | | 82.1 | 81.4 | | | | Finetuned Transformer LM | |
Improving Language Understanding by Generative Pre-Training | ✓ Link | 82.1 | 81.4 | | | | Finetuned Transformer LM | 2018-06-11 |
SqueezeBERT: What can computer vision teach NLP about efficient neural networks? | ✓ Link | 82.0 | 81.1 | | | | SqueezeBERT | 2020-06-19 |
Generative Pretrained Structured Transformers: Unsupervised Syntactic Language Models at Scale | ✓ Link | 81.8 | 82.0 | | | | GPST(unsupervised generative syntactic LM) | 2024-03-13 |
Not all layers are equally as important: Every Layer Counts BERT | | 79.2 | 79.9 | | | | ELC-BERT-small 24M | 2023-11-03 |
Not all layers are equally as important: Every Layer Counts BERT | | 78 | 78.8 | | | | LTG-BERT-small 24M | 2023-11-03 |
FNet: Mixing Tokens with Fourier Transforms | ✓ Link | 78 | 76 | | | | FNet-Large | 2021-05-09 |
Attention Boosted Sequential Inference Model | | 73.9 | 73.9 | | | | aESIM | 2018-12-05 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 72.4 | 72 | | | | T5-Large 738M | 2023-04-27 |
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding | ✓ Link | 72.2 | 72.1 | | | | Multi-task BiLSTM + Attn | 2018-04-20 |
Combining Similarity Features and Deep Representation Learning for Stance Detection in the Context of Checking Fake News | ✓ Link | 71.4 | 72.2 | | | | Stacked Bi-LSTMs (shortcut connections, max-pooling) | 2018-11-02 |
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning | ✓ Link | 71.4 | 71.3 | | | | GenSen | 2018-03-30 |
Combining Similarity Features and Deep Representation Learning for Stance Detection in the Context of Checking Fake News | ✓ Link | 70.7 | 71.1 | | | | Bi-LSTM sentence encoder (max-pooling) | 2018-11-02 |
Combining Similarity Features and Deep Representation Learning for Stance Detection in the Context of Checking Fake News | ✓ Link | 70.7 | 70.5 | | | | Stacked Bi-LSTMs (shortcut connections, max-pooling, attention) | 2018-11-02 |
Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms | ✓ Link | 68.2 | 67.7 | | | | SWEM-max | 2018-05-24 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 67.5 | 69.3 | | | | LaMini-GPT 1.5B | 2023-04-27 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 61.4 | 61 | | | | LaMini-F-T5 783M | 2023-04-27 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 54.7 | 55.8 | | | | LaMini-T5 738M | 2023-04-27 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 36.5 | 37 | | | | GPT-2-XL 1.5B | 2023-04-27 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | | 91.7 | | | | T5-11B | 2019-10-23 |
RoBERTa: A Robustly Optimized BERT Pretraining Approach | ✓ Link | | 90.2 | | | | RoBERTa (ensemble) | 2019-07-26 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | | 89.6 | | | | T5-Large 770M | 2019-10-23 |
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | | | 85.7 | | | MT-DNN-SMARTv0 | 2019-11-08 |
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | | | 85.7 | | | MT-DNN-SMART | 2019-11-08 |
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | | | 85.6 | | | SMART+BERT-BASE | 2019-11-08 |
LM-CPPF: Paraphrasing-Guided Data Augmentation for Contrastive Prompt-Based Few-Shot Fine-Tuning | ✓ Link | | | 68.4 | | | LM-CPPF RoBERTa-base | 2023-05-29 |
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | | | | 91.1 | 91.3 | SMARTRoBERTa | 2019-11-08 |
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization | ✓ Link | | | | 85.6 | 86.0 | SMART-BERT | 2019-11-08 |