Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE | | 95.9 | 96.4 | Turing NLR v5 XXL 5.4B (fine-tuned) | 2022-12-04 |
ST-MoE: Designing Stable and Transferable Sparse Expert Models | ✓ Link | 95.1 | | ST-MoE-32B 269B (fine-tuned) | 2022-02-17 |
DeBERTa: Decoding-enhanced BERT with Disentangled Attention | ✓ Link | 94.1 | 94.5 | DeBERTa-1.5B | 2020-06-05 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 94.0 | 94.6 | PaLM 540B (finetuned) | 2022-04-05 |
Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE | | 93.9 | 94.4 | Vega v2 6B (fine-tuned) | 2022-12-04 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | 93.4 | | T5-XXL 11B (fine-tuned) | 2019-10-23 |
Integrating a Heterogeneous Graph with Entity-aware Self-attention using Relative Position Labels for Reading Comprehension Model | | 91.7 | 92.2 | GESA 500M | 2023-07-19 |
LUKE-Graph: A Transformer-based Approach with Gated Relational Graph Attention for Cloze-style Reading Comprehension | | 91.2 | 91.5 | LUKE-Graph | 2023-03-12 |
[]() | | 90.640 | 91.209 | LUKE (single model) | |
LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention | ✓ Link | 90.6 | 91.2 | LUKE 483M | 2020-10-02 |
KELM: Knowledge Enhanced Pre-Trained Language Representations with Message Passing on Hierarchical Relational Graphs | ✓ Link | 89.1 | 89.6 | KELM (finetuning RoBERTa-large based single model) | 2021-09-09 |
ST-MoE: Designing Stable and Transferable Sparse Expert Models | ✓ Link | 88.9 | | ST-MoE-L 4.1B (fine-tuned) | 2022-02-17 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 85.1 | | FLAN 137B (prompt-tuned) | 2021-09-03 |
[]() | | 83.090 | 83.737 | XLNet + MTL + Verifier (ensemble) | |
Language Models are Few-Shot Learners | ✓ Link | 82.1 | | GPT-3 Large 760M (0-shot) | 2020-05-28 |
[]() | | 81.780 | 82.584 | CSRLM (single model) | |
Pingan Smart Health and SJTU at COIN - Shared Task: utilizing Pre-trained Language Models and Common-sense Knowledge in Machine Reading Tasks | | 81.5 | 82.7 | XLNet + Verifier | 2019-11-01 |
[]() | | 81.460 | 82.664 | XLNet + MTL + Verifier (single model) | |
Efficient Language Modeling with Sparse all-MLP | | 79.9 | | Switch Transformer 9B | 2022-03-14 |
[]() | | 79.480 | 80.038 | {SKG-NET} (single model) | |
KELM: Knowledge Enhanced Pre-Trained Language Representations with Message Passing on Hierarchical Relational Graphs | ✓ Link | 76.2 | 76.7 | KELM (finetuning BERT-large based single model) | 2021-09-09 |
Efficient Language Modeling with Sparse all-MLP | | 73.4 | | sMLP – deterministic 9.4B (0-shot) | 2022-03-14 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 72.5 | | FLAN 137B (zero-shot) | 2021-09-03 |
Efficient Language Modeling with Sparse all-MLP | | 72.4 | | Gshard 9B | 2022-03-14 |
[]() | | 72.240 | 72.778 | SKG-BERT (single model) | |
[]() | | 71.600 | 73.620 | KT-NET (single model) | |
[]() | | 69.490 | 71.138 | DCReader+BERT (single model) | |
Efficient Language Modeling with Sparse all-MLP | | 67.2 | | HASH Layers 10B (0-shot) | 2022-03-14 |
[]() | | 60.800 | 62.986 | GraphBert (single) | |
Efficient Language Modeling with Sparse all-MLP | | 60.7 | | Base Layers 10B (0-shot) | 2022-03-14 |
[]() | | 59.860 | 61.885 | GraphBert-WordNet (single) | |
[]() | | 59.410 | 61.515 | GraphBert-NELL (single) | |
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | ✓ Link | 54.040 | 56.065 | BERT-Base (single model) | 2018-10-11 |
ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension | | 45.4 | 46.7 | DocQA + ELMo | 2018-10-30 |
N-Grammer: Augmenting Transformers with latent n-grams | ✓ Link | 28.9 | 29.9 | N-Grammer 343M | 2022-07-13 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | ✓ Link | | 94.1 | T5-11B | 2019-10-23 |
PaLM 2 Technical Report | ✓ Link | | 93.8 | PaLM 2-L (one-shot) | 2023-05-17 |
PaLM 2 Technical Report | ✓ Link | | 92.4 | PaLM 2-M (one-shot) | 2023-05-17 |
PaLM 2 Technical Report | ✓ Link | | 92.1 | PaLM 2-S (one-shot) | 2023-05-17 |
Large Language Models are Zero-Shot Reasoners | ✓ Link | | 90.2 | GPT-3 175B (one-shot) | 2022-05-24 |
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model | ✓ Link | | 88.4 | AlexaTM 20B | 2022-08-02 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | | 82.8 | Bloomberg GPT 50B (1-shot) | 2023-03-30 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | | 82.5 | OPT 66B (1-shot) | 2023-03-30 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | | 78 | BLOOM 176B (1-shot) | 2023-03-30 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | | 67.9 | GPT-NeoX 20B (1-shot) | 2023-03-30 |