Task Compass: Scaling Multi-task Pre-training with Task Prefix | ✓ Link | 96.1 | CompassMTL 567M with Tailor | 2022-10-12 |
Task Compass: Scaling Multi-task Pre-training with Task Prefix | ✓ Link | 95.6 | CompassMTL 567M | 2022-10-12 |
Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering | ✓ Link | 95.6 | DeBERTa-Large 304M (classification-based) | 2022-10-29 |
GPT-4 Technical Report | ✓ Link | 95.3 | GPT-4 (10-shot) | 2023-03-15 |
Mixture-of-Subspaces in Low-Rank Adaptation | ✓ Link | 95.0 | LLaMA3+MoSLoRA | 2024-06-16 |
Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering | ✓ Link | 94.7 | DeBERTa-Large 304M | 2022-10-29 |
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts | ✓ Link | 94.7 | LLaMA-2 13B + MixLoRA | 2024-04-22 |
UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark | ✓ Link | 93.9 | Unicorn 11B (fine-tuned) | 2021-03-24 |
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts | ✓ Link | 93.3 | LLaMA-3 8B + MixLoRA | 2024-04-22 |
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts | ✓ Link | 93.1 | LLaMA-2 7B + MixLoRA | 2024-04-22 |
DeBERTa: Decoding-enhanced BERT with Disentangled Attention | ✓ Link | 93 | DeBERTa++ | 2020-06-05 |
DiscoSense: Commonsense Reasoning with Discourse Connectives | ✓ Link | 91.5 | ELECTRA-Large 335M (fine-tuned on DiscoSense and HellaSwag) | 2022-10-22 |
[]() | | 89 | DBRX Instruct 132B (10-shot) | |
[]() | | 88.3 | TheBloke/llama-2-70b-Guanaco-QLoRA-fp16 (10-shot) | |
[]() | | 88 | ALBERT-XXL 235M | |
PaLM 2 Technical Report | ✓ Link | 87.4 | PaLM 2-L (1-shot) | 2023-05-17 |
DiscoSense: Commonsense Reasoning with Discourse Connectives | ✓ Link | 86.9 | ELECTRA-Large 335M (fine-tuned on HellaSwag) | 2022-10-22 |
PaLM 2 Technical Report | ✓ Link | 86.7 | PaLM 2-M (1-shot) | 2023-05-17 |
Muppet: Massive Multi-task Representations with Pre-Finetuning | ✓ Link | 86.4 | MUPPET Roberta Large | 2021-01-26 |
Stay on topic with Classifier-Free Guidance | | 86.3 | LLaMA 65B + CFG (0-shot) | 2023-06-30 |
The Falcon Series of Open Language Models | | 85.9 | Falcon-180B (0-shot) | 2023-11-28 |
PaLM 2 Technical Report | ✓ Link | 85.6 | PaLM 2-S (1-shot) | 2023-05-17 |
GPT-4 Technical Report | ✓ Link | 85.5 | GPT-3.5 (10-shot) | 2023-03-15 |
RoBERTa: A Robustly Optimized BERT Pretraining Approach | ✓ Link | 85.5 | RoBERTa-Large Ensemble | 2019-07-26 |
Stay on topic with Classifier-Free Guidance | | 85.3 | LLaMA 30B + CFG (0-shot) | 2023-06-30 |
Llama 2: Open Foundation and Fine-Tuned Chat Models | ✓ Link | 85.3 | LLaMA 2 70B (0-shot) | 2023-07-18 |
Towards Generalizable Neuro-Symbolic Systems for Commonsense Question Answering | | 85.0 | HyKAS+CSKG | 2019-10-30 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 84.2 | LLaMA 65B (0-shot) | 2023-02-27 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 83.8 | PaLM-540B (Few-Shot) | 2022-04-05 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 83.6 | PaLM-540B (1-shot) | 2022-04-05 |
Task Compass: Scaling Multi-task Pre-training with Task Prefix | ✓ Link | 83.6 | ExDeBERTa 567M | 2022-10-12 |
PaLM: Scaling Language Modeling with Pathways | ✓ Link | 83.4 | PaLM-540B (0-shot) | 2022-04-05 |
Llama 2: Open Foundation and Fine-Tuned Chat Models | ✓ Link | 83.3 | LLaMA 2 34B (0-shot) | 2023-07-18 |
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks | ✓ Link | 83.2 | Camelidae-8×34B (10-shot) | 2024-01-05 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 82.8 | LLaMA 33B (0-shot) | 2023-02-27 |
The Falcon Series of Open Language Models | | 82.7 | Falcon-40B (0-shot) | 2023-11-28 |
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model | ✓ Link | 82.4 | Megatron-Turing NLG 530B (Few-Shot) | 2022-01-28 |
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks | ✓ Link | 82.3 | Qwen2idae-16x14B (10-shot) | 2024-01-05 |
Stay on topic with Classifier-Free Guidance | | 82.1 | LLaMA 13B + CFG (0-shot) | 2023-06-30 |
RoBERTa: A Robustly Optimized BERT Pretraining Approach | ✓ Link | 81.7 | RoBERTa-Large 355M | 2019-07-26 |
Mistral 7B | ✓ Link | 81.3 | Mistral 7B (0-shot) | 2023-10-10 |
Training Compute-Optimal Large Language Models | ✓ Link | 80.8 | Chinchilla 70B (0-shot) | 2022-03-29 |
Llama 2: Open Foundation and Fine-Tuned Chat Models | ✓ Link | 80.7 | LLaMA 2 13B (0-shot) | 2023-07-18 |
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model | ✓ Link | 80.2 | Megatron-Turing NLG 530B (1-shot) | 2022-01-28 |
Language Models are Few-Shot Learners | ✓ Link | 79.3 | GPT-3 175B (few-shot, k=32) | 2020-05-28 |
Scaling Language Models: Methods, Analysis & Insights from Training Gopher | ✓ Link | 79.2 | Gopher 280B (0-shot) | 2021-12-08 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 79.2 | LLaMA 13B (0-shot) | 2023-02-27 |
Language Models are Few-Shot Learners | ✓ Link | 78.9 | GPT-3 (0-shot) | 2020-05-28 |
Llama 2: Open Foundation and Fine-Tuned Chat Models | ✓ Link | 77.2 | LLaMA 2 7B (0-shot) | 2023-07-18 |
The Falcon Series of Open Language Models | | 76.3 | Falcon-7B (0-shot) | 2023-11-28 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 76.1 | LLaMA 7B (0-shot) | 2023-02-27 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 73.9 | BlooombergGPT 50B (1-shot) | 2023-03-30 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 73.5 | OPT 66B (1-shot) | 2023-03-30 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 73.2 | BLOOM 176B (1-shot) | 2023-03-30 |
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning | ✓ Link | 70.8 | Sheared-LLaMA-2.7B (50B) | 2023-10-10 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 68.4 | GPT-NeoX 20B (1-shot) | 2023-03-30 |
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning | ✓ Link | 67.6 | Open-LLaMA-3B-v2 | 2023-10-10 |
Mamba: Linear-Time Sequence Modeling with Selective State Spaces | ✓ Link | 66.1 | Mamba-2.8B | 2023-12-01 |
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning | ✓ Link | 60.7 | Sheared-LLaMA-1.3B (50B) | 2023-10-10 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 59.2 | FLAN 137B (3-shot) | 2021-09-03 |
Mamba: Linear-Time Sequence Modeling with Selective State Spaces | ✓ Link | 59.1 | Mamba-1.4B | 2023-12-01 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 56.7 | FLAN 137B (0-shot) | 2021-09-03 |
Efficient Language Modeling with Sparse all-MLP | | 54.5 | sMLP – deterministic 9.4B (0-shot) | 2022-03-14 |
Efficient Language Modeling with Sparse all-MLP | | 52.5 | Switch Transformer 9B | 2022-03-14 |
Language Models are Few-Shot Learners | ✓ Link | 51.0 | GPT-3 Large 760M (0-shot) | 2020-05-28 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 50.9 | GPT-2-XL 1.5B | 2023-04-27 |
LLM in a flash: Efficient Large Language Model Inference with Limited Memory | | 50.3 | OPT-6.7B | 2023-12-12 |
LLM in a flash: Efficient Large Language Model Inference with Limited Memory | | 49.8 | LLM in a Flash (OPT-6.7B with Predictor) | 2023-12-12 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 48.7 | FLAN-T5-Large 783M | 2023-04-27 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 48.3 | LaMini-GPT 1.5B | 2023-04-27 |
HellaSwag: Can a Machine Really Finish Your Sentence? | ✓ Link | 47.3 | BERT-Large 340M | 2019-05-19 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 43.7 | LaMini-F-T5 783M | 2023-04-27 |
HellaSwag: Can a Machine Really Finish Your Sentence? | ✓ Link | 41.7 | GPT-1 117M | 2019-05-19 |
Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners | ✓ Link | 41.6 | Flipped-3B | 2022-10-06 |
The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning | ✓ Link | 41.1 | T0-3B (CoT fine-tuned) | 2023-05-23 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 40.6 | LaMini-T5 738M | 2023-04-27 |
HellaSwag: Can a Machine Really Finish Your Sentence? | ✓ Link | 40.5 | BERT-Base 110M | 2019-05-19 |
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions | ✓ Link | 38.9 | T5-Large 738M | 2023-04-27 |
Efficient Language Modeling with Sparse all-MLP | | 38 | Gshard 9B | 2022-03-14 |
HellaSwag: Can a Machine Really Finish Your Sentence? | ✓ Link | 36.2 | LSTM + BERT-Base | 2019-05-19 |
Exploring the Benefits of Training Expert Language Models over Instruction Tuning | ✓ Link | 34.6 | RoE-3B | 2023-02-07 |
HellaSwag: Can a Machine Really Finish Your Sentence? | ✓ Link | 33.3 | ESIM + ElMo | 2019-05-19 |
Efficient Language Modeling with Sparse all-MLP | | 33 | HASH Layers 10B (0-shot) | 2022-03-14 |
HellaSwag: Can a Machine Really Finish Your Sentence? | ✓ Link | 31.7 | LSTM + GloVe | 2019-05-19 |
HellaSwag: Can a Machine Really Finish Your Sentence? | ✓ Link | 31.6 | fastText | 2019-05-19 |
HellaSwag: Can a Machine Really Finish Your Sentence? | ✓ Link | 31.4 | LSTM + ElMo | 2019-05-19 |
Efficient Language Modeling with Sparse all-MLP | | 30.2 | Base Layers 10B (0-shot) | 2022-03-14 |
Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language Models | | 29.6 | KiC-770M | 2022-10-28 |
HellaSwag: Can a Machine Really Finish Your Sentence? | ✓ Link | 25 | Random chance baseline | 2019-05-19 |