ST-MoE: Designing Stable and Transferable Sparse Expert Models | ✓ Link | 95.2 | ST-MoE-32B 269B (fine-tuned) | 2022-02-17 |
Mixture-of-Subspaces in Low-Rank Adaptation | ✓ Link | 90.5 | LLaMA 3 8B+MoSLoRA (fine-tuned) | 2024-06-16 |
PaLM 2 Technical Report | ✓ Link | 89.7 | PaLM 2-L (1-shot) | 2023-05-17 |
PaLM 2 Technical Report | ✓ Link | 88.0 | PaLM 2-M (1-shot) | 2023-05-17 |
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts | ✓ Link | 86.5 | LLaMA-3 8B + MixLoRA | 2024-04-22 |
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks | ✓ Link | 86.2 | Camelidae-8×34B | 2024-01-05 |
PaLM 2 Technical Report | ✓ Link | 85.6 | PaLM 2-S (1-shot) | 2023-05-17 |
Stay on topic with Classifier-Free Guidance | | 84.2 | LLaMA 65B + CFG (0-shot) | 2023-06-30 |
Galactica: A Large Language Model for Science | ✓ Link | 83.8 | GAL 120B (0-shot) | 2022-11-16 |
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts | ✓ Link | 83.5 | LLaMA-2 13B + MixLoRA | 2024-04-22 |
Stay on topic with Classifier-Free Guidance | | 83.2 | LLaMA 30B + CFG (0-shot) | 2023-06-30 |
Mixtral of Experts | ✓ Link | 83.1 | Mixtral 8x7B (0-shot) | 2024-01-08 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 80.7 | FLAN 137B (few-shot, k=14) | 2021-09-03 |
Mixtral of Experts | ✓ Link | 80.5 | Mistral 7B (0-shot) | 2024-01-08 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 80.0 | LLaMA 33B (0-shot) | 2023-02-27 |
Mistral 7B | ✓ Link | 80.0 | Mistral 7B (0-shot) | 2023-10-10 |
Finetuned Language Models Are Zero-Shot Learners | ✓ Link | 79.6 | FLAN 137B (0-shot) | 2021-09-03 |
Stay on topic with Classifier-Free Guidance | | 79.1 | LLaMA 13B + CFG (0-shot) | 2023-06-30 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 78.9 | LLaMA 65B (0-shot) | 2023-02-27 |
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts | ✓ Link | 77.7 | LLaMA-2 7B + MixLoRA | 2024-04-22 |
Textbooks Are All You Need II: phi-1.5 technical report | ✓ Link | 76.1 | phi-1.5-web 1.3B (0-shot) | 2023-09-11 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 75.93 | BLOOM 176B (1-shot) | 2023-03-30 |
ST-MoE: Designing Stable and Transferable Sparse Expert Models | ✓ Link | 75.4 | ST-MoE-L 4.1B (fine-tuned) | 2022-02-17 |
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts | | 74.8 | GLaM (64B/64E) (5-shot) | 2021-12-13 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 74.8 | LLaMA 13B (0-shot) | 2023-02-27 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 73.99 | Bloomberg GPT 50B (1-shot) | 2023-03-30 |
LLaMA: Open and Efficient Foundation Language Models | ✓ Link | 72.8 | LLaMA 7B (0-shot) | 2023-02-27 |
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling | ✓ Link | 71.5 | Pythia 12B (5-shot) | 2023-04-03 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 71.25 | OPT 66B (1-shot) | 2023-03-30 |
Language Models are Few-Shot Learners | ✓ Link | 71.2 | GPT-3 175B (1 shot) | 2020-05-28 |
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | ✓ Link | 71.04 | OPT-175B | 2023-01-02 |
BloombergGPT: A Large Language Model for Finance | ✓ Link | 70.79 | GPT-NeoX 20B (1-shot) | 2023-03-30 |
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling | ✓ Link | 70.2 | Pythia 12B (0-shot) | 2023-04-03 |
UL2: Unifying Language Learning Paradigms | ✓ Link | 69.8 | UL2 20B (chain-of-thought + self-consistency) | 2022-05-10 |
Mamba: Linear-Time Sequence Modeling with Selective State Spaces | ✓ Link | 69.7 | Mamba-2.8B (0-shot) | 2023-12-01 |
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | ✓ Link | 69.65 | SparseGPT 175B (50% sparsity) | 2023-01-02 |
Galactica: A Large Language Model for Science | ✓ Link | 68.8 | GPT-3 (zero-shot) | 2022-11-16 |
Language Models are Few-Shot Learners | ✓ Link | 68.8 | GPT-3 175B (0-shot) | 2020-05-28 |
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | ✓ Link | 68.35 | SparseGPT (175B, 4:8 Sparsity) | 2023-01-02 |
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts | | 68.0 | GLaM 64B/64E (0-shot) | 2021-12-13 |
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | ✓ Link | 67.08 | SparseGPT 175B (2:4 sparsity) | 2023-01-02 |
Stay on topic with Classifier-Free Guidance | | 58.9 | LLaMA 7B + CFG (0-shot) | 2023-06-30 |
Galactica: A Large Language Model for Science | ✓ Link | 40.7 | BLOOM (5-shot) | 2022-11-16 |
UL2: Unifying Language Learning Paradigms | ✓ Link | 38.4 | UL2 20B (chain-of-thought) | 2022-05-10 |
Galactica: A Large Language Model for Science | ✓ Link | 37.4 | OPT (5-shot) | 2022-11-16 |
UL2: Unifying Language Learning Paradigms | ✓ Link | 32.2 | UL2 20B (0-shot) | 2022-05-10 |
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | ✓ Link | 28.03 | OPT 175B (50% Sparsity) | 2023-01-02 |