ST-MoE: Designing Stable and Transferable Sparse Expert Models | ✓ Link | 96.1 | ST-MoE-32B 269B (fine-tuned) | 2022-02-17 |
UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark | ✓ Link | 91.3 | Unicorn 11B (fine-tuned) | 2021-03-24 |
Task Compass: Scaling Multi-task Pre-training with Task Prefix | ✓ Link | 90.5 | CompassMTL 567M with Tailor | 2022-10-12 |
Task Compass: Scaling Multi-task Pre-training with Task Prefix | ✓ Link | 89.6 | CompassMTL 567M | 2022-10-12 |
UnifiedQA: Crossing Format Boundaries With a Single QA System | ✓ Link | 89.4 | UnifiedQA 11B (fine-tuned) | 2020-05-02 |
The Claude 3 Model Family: Opus, Sonnet, Haiku | | 88.5 | Claude 3 Opus (5-shot) | 2024-03-04 |
GPT-4 Technical Report | ✓ Link | 87.5 | GPT-4 (5-shot) | 2023-03-15 |
Task Compass: Scaling Multi-task Pre-training with Task Prefix | ✓ Link | 87 | ExDeBERTa 567M | 2022-10-12 |
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts | ✓ Link | 86.3 | LLaMA-2 13B + MixLoRA | 2024-04-22 |
Mixture-of-Subspaces in Low-Rank Adaptation | ✓ Link | 85.8 | LLaMA3 8B+MoSLoRA | 2024-06-16 |
PaLM 2 Technical Report | ✓ Link | 83.0 | PaLM 2-L (1-shot) | 2023-05-17 |
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts | ✓ Link | 82.1 | LLaMA-3 8B + MixLoRA | 2024-04-22 |
ST-MoE: Designing Stable and Transferable Sparse Expert Models | ✓ Link | 81.7 | ST-MoE-L 4.1B (fine-tuned) | 2022-02-17 |
GPT-4 Technical Report | ✓ Link | 81.6 | GPT-3.5 (5-shot) | 2023-03-15 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 81.1 | PaLM 540B (0-shot) | 2022-04-05 |
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks | ✓ Link | 80.9 | Camelidae-8×34B | 2024-01-05 |
PaLM 2 Technical Report | ✓ Link | 79.2 | PaLM 2-M (1-shot) | 2023-05-17 |
WinoGrande: An Adversarial Winograd Schema Challenge at Scale | ✓ Link | 79.1 | RoBERTa-Winogrande 355M (fine-tuned) | 2019-07-24 |
PaLM 2 Technical Report | ✓ Link | 77.9 | PaLM 2-S (1-shot) | 2023-05-17 |
Mixtral of Experts | ✓ Link | 77.2 | Mixtral 8x7B (0-shot) | 2024-01-08 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 77.0 | PaLM 62B (0-shot) | 2022-04-05 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 77.0 | PaLM-cont 62B (0-shot) | 2022-04-05 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 77.0 | LLaMA 65B (0-shot) | 2023-02-27 |
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts | ✓ Link | 76.8 | LLaMA-2 7B + MixLoRA | 2024-04-22 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 76.0 | LLaMA 33B (0-shot) | 2023-02-27 |
Mistral 7B | ✓ Link | 75.3 | Mistral 7B (0-shot) | 2023-10-10 |
The Claude 3 Model Family: Opus, Sonnet, Haiku | | 75.1 | Claude 3 Sonnet (5-shot) | 2024-03-04 |
Training Compute-Optimal Large Language Models | ✓ Link | 74.9 | Chinchilla 70B (0-shot) | 2022-03-29 |
The Claude 3 Model Family: Opus, Sonnet, Haiku | | 74.2 | Claude 3 Haiku (5-shot) | 2024-03-04 |
Mixtral of Experts | ✓ Link | 74.2 | Mistral 7B (0-shot) | 2024-01-08 |
Textbooks Are All You Need II: phi-1.5 technical report | ✓ Link | 74.0 | phi-1.5-web 1.3B (zero-shot) | 2023-09-11 |
UnifiedQA: Crossing Format Boundaries With a Single QA System | ✓ Link | 73.3 | Unified QA 406M (fine-tuned) | 2020-05-02 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 73.0 | LLaMA 13B (0-shot) | 2023-02-27 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 72.8 | FLAN 137B (few-shot, k=16) | 2021-09-03 |
Generative Data Augmentation for Commonsense Reasoning | ✓ Link | 71.4 | G-DAUG-Combo + RoBERTa-Large | 2020-04-24 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 71.2 | FLAN 137B (0-shot) | 2021-09-03 |
[]() | | 70.8 | RWKV v5 Eagle 7B | |
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM | ✓ Link | 70.6 | Branch-Train-MiX 4x7B (sampling top-1 expert) | 2024-03-12 |
Language Models are Few-Shot Learners | ✓ Link | 70.2 | GPT-3 175B (0-shot) | 2020-05-28 |
Scaling Language Models: Methods, Analysis & Insights from Training Gopher | ✓ Link | 70.1 | Gopher 280B (0-shot) | 2021-12-08 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 70.1 | LLaMA 7B (0-shot) | 2023-02-27 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 67 | BLOOM 176B (1-shot) | 2023-03-30 |
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling | ✓ Link | 66.6 | Pythia 12B (5-shot) | 2023-04-03 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 66.1 | OPT 66B (1-shot) | 2023-03-30 |
WinoGrande: An Adversarial Winograd Schema Challenge at Scale | ✓ Link | 64.9 | BERT-Winogrande 345M (fine-tuned) | 2019-07-24 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 64.1 | Bloomberg GPT (one-shot) | 2023-03-30 |
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling | ✓ Link | 63.9 | Pythia 12B (0-shot) | 2023-04-03 |
Exploring the Benefits of Training Expert Language Models over Instruction Tuning | ✓ Link | 61.60 | RoE-3B | 2023-02-07 |
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling | ✓ Link | 60.9 | Pythia 6.9B (0-shot) | 2023-04-03 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 60.6 | GPT-NeoX (one-shot) | 2023-03-30 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 59.9 | FLAN-T5-Large 783M | 2023-04-27 |
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling | ✓ Link | 59.4 | Pythia 2.8B (0-shot) | 2023-04-03 |
WinoGrande: An Adversarial Winograd Schema Challenge at Scale | ✓ Link | 58.9 | RoBERTa-DPR 355M (0-shot) | 2019-07-24 |
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema | | 58.7 | ALBERT-xxlarge 235M | 2021-04-16 |
Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners | ✓ Link | 58.56 | Flipped-3B | 2022-10-06 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 58.3 | GPT-2-XL 1.5B | 2023-04-27 |
The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning | ✓ Link | 57.5 | T0-3B (CoT fine-tuned) | 2023-05-23 |
Language Models are Few-Shot Learners | ✓ Link | 57.4 | GPT-3 Large 760M (0-shot) | 2020-05-28 |
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema | | 56.3 | RoBERTa-base 125M | 2021-04-16 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 56 | LaMini-F-T5 783M | 2023-04-27 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 56 | LaMini-GPT 1.5B | 2023-04-27 |
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema | | 55.6 | BERT-large 345M | 2021-04-16 |
Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language Models | | 55.30 | KiC-770M | 2022-10-28 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 55.2 | T5-Large 738M | 2023-04-27 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 54.9 | LaMini-T5 738M | 2023-04-27 |
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema | | 54.9 | RoBERTa-large 355M | 2021-04-16 |
Efficient Language Modeling with Sparse all-MLP | | 54.3 | sMLP – deterministic 9.4B (0-shot) | 2022-03-14 |
Efficient Language Modeling with Sparse all-MLP | | 53.4 | Switch Transformer 9B (0-shot) | 2022-03-14 |
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema | | 53.1 | BERT-base 110M | 2021-04-16 |
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema | | 52.8 | ALBERT-base 11M | 2021-04-16 |
WinoGrande: An Adversarial Winograd Schema Challenge at Scale | ✓ Link | 51.9 | BERT-large 345M (0-shot) | 2019-07-24 |
Efficient Language Modeling with Sparse all-MLP | | 51.7 | HASH Layers 10B (0-shot) | 2022-03-14 |
Efficient Language Modeling with Sparse all-MLP | | 51.1 | Gshard 9B (0-shot) | 2022-03-14 |
Efficient Language Modeling with Sparse all-MLP | | 51 | Base Layers 10B (0-shot) | 2022-03-14 |
WinoGrande: An Adversarial Winograd Schema Challenge at Scale | ✓ Link | 51 | BERT-DPR 345M (0-shot) | 2019-07-24 |
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema | | 50 | Random baseline | 2021-04-16 |
WinoGrande: An Adversarial Winograd Schema Challenge at Scale | ✓ Link | 50 | RoBERTa-large 355M (0-shot) | 2019-07-24 |