multi-task-language-understanding-on-mmlu

Multi-Task LearningMulti-task Language Understanding

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Average (%)	ModelName	ReleaseDate
GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data		87	GPT-4 o1(300b)	2024-10-03
Llama 3 Meets MoE: Efficient Upcycling	✓ Link	86.6	Llama 3.1 (405B)	2024-12-13
Llama 3 Meets MoE: Efficient Upcycling	✓ Link	86.0	Llama 3.1 (70B)	2024-12-13
[]()		83.7	Gemini Ultra (5-shot)
The Claude 3 Model Family: Opus, Sonnet, Haiku		79	Claude 3 Sonnet (5-shot)	2024-03-04
[]()		77.5	Qwen1.5 72B (5-shot)
The Claude 3 Model Family: Opus, Sonnet, Haiku		75.2	Claude 3 Haiku (5-shot)	2024-03-04
The Llama 3 Herd of Models	✓ Link	73.7	DBRX Instruct 132B (5-shot)	2024-07-31
Scaling Instruction-Finetuned Language Models	✓ Link	73.5	llama 2(65b)	2022-10-20
The Llama 3 Herd of Models	✓ Link	73.0	Llama 3.1 8B (CoT)	2024-07-31
Mixtral of Experts	✓ Link	70.6	Mixtral 8x7B (5-shot)	2024-01-08
GPT-4 Technical Report	✓ Link	70.0	GPT-3.5 Turbo	2023-03-15
LLaMA: Open and Efficient Foundation Language Models	✓ Link	68.9	LLaMA 65B (fine-tuned)	2023-02-27
Training Compute-Optimal Large Language Models	✓ Link	67.5	chatgpt/gpt3.5(20B)	2022-03-29
LLaMA: Open and Efficient Foundation Language Models	✓ Link	63.4	LLaMA 65B (5-shot)	2023-02-27
Llama 2: Open Foundation and Fine-Tuned Chat Models	✓ Link	62.6	LLaMA 2 34B (5-shot)	2023-07-18
Mixtral of Experts	✓ Link	62.5	Mistral 7B (5-shot)	2024-01-08
Mistral 7B	✓ Link	60.1	Mistral 7B (5-shot)	2023-10-10
Scaling Instruction-Finetuned Language Models	✓ Link	59.5	GPT-3 Davinci 175B (CoT)	2022-10-20
LLaMA: Open and Efficient Foundation Language Models	✓ Link	57.8	LLaMA 33B (5-shot)	2023-02-27
The Falcon Series of Open Language Models		57.0	Falcon 40B	2023-11-28
[]()		56.7	Qwen 7B (5-shot)
Llama 2: Open Foundation and Fine-Tuned Chat Models	✓ Link	54.8	LLaMA 2 13B (5-shot)	2023-07-18
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM	✓ Link	53.2	Branch-Train-MiX 4x7B (sampling top-1 experts)	2024-03-12
Galactica: A Large Language Model for Science	✓ Link	52.6	GAL 120B (zero-shot)	2022-11-16
Atlas: Few-shot Learning with Retrieval Augmented Language Models	✓ Link	47.9	Atlas (5-shot)	2022-08-05
Scaling Instruction-Finetuned Language Models	✓ Link	45.5	Flan-T5-XL 3B (CoT)	2022-10-20
Llama 2: Open Foundation and Fine-Tuned Chat Models	✓ Link	45.3	LLaMA 2 7B (5-shot)	2023-07-18
Scaling Instruction-Finetuned Language Models	✓ Link	45.1	Flan-T5-Large 780M	2022-10-20
GLM-130B: An Open Bilingual Pre-trained Model	✓ Link	44.8	GLM-130B	2022-10-05
Scaling Instruction-Finetuned Language Models	✓ Link	40.5	Flan-T5-Large 780M (CoT)	2022-10-20
Scaling Instruction-Finetuned Language Models	✓ Link	39.7	GPT-3 Davinci 175B (5-shot)	2022-10-20
BloombergGPT: A Large Language Model for Finance	✓ Link	39.2	Bloomberg GPT 50B (5-shot)	2023-03-30
UL2: Unifying Language Learning Paradigms	✓ Link	39.2	UL2 20B (5-shot)	2022-05-10
BloombergGPT: A Large Language Model for Finance	✓ Link	39.1	BLOOM 176B (5-shot)	2023-03-30
Textbooks Are All You Need II: phi-1.5 technical report	✓ Link	37.9	phi-1.5-web 1.3B	2023-09-11
BloombergGPT: A Large Language Model for Finance	✓ Link	36	OPT 66B (5-shot)	2023-03-30
Scaling Instruction-Finetuned Language Models	✓ Link	35.9	Flan-T5-Base 250M	2022-10-20
Scaling Instruction-Finetuned Language Models	✓ Link	33.7	Flan-T5-Base 250M (CoT)	2022-10-20
GPT-NeoX-20B: An Open-Source Autoregressive Language Model	✓ Link	33.6	GPT-NeoX 20B (5-shot)	2022-04-14
[]()		31	RWKV v5 Eagle 7B
MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models	✓ Link	29.68	LLaMA7B-MiLe-Loss(5-shot)	2023-10-30
Scaling Instruction-Finetuned Language Models	✓ Link	28.7	Flan-T5-Small 80M	2022-10-20
The Falcon Series of Open Language Models		28.0	Falcon 7B (5-shot)	2023-11-28

OpenCodePapers

multi-task-language-understanding-on-mmlu