arithmetic-reasoning-on-gsm8k

Arithmetic Reasoning

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy	Parameters (Billion)	ModelName	ReleaseDate
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles	✓ Link	97.72		Claude 3.5 Sonnet (HPT)	2024-06-18
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems	✓ Link	97.1		DUP prompt upon GPT-4	2024-04-23
Qwen2 Technical Report	✓ Link	96.7	72	Qwen2-Math-72B-Instruct (greedy)	2024-07-15
[]()		96.4	7	SFT-Mistral-7B (Metamath, OVM, Smart Ensemble)
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data	✓ Link	96.0		OpenMath2-Llama3.1-70B (majority@256)	2024-10-02
[]()		95.2	75	Jiutian-大模型
[]()		95.1	7	DAMOMath-7B(MetaMath, OVM, BS, Ensemble)
The Claude 3 Model Family: Opus, Sonnet, Haiku		95		Claude 3 Opus (0-shot chain-of-thought)	2024-03-04
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data	✓ Link	94.9		OpenMath2-Llama3.1-70B	2024-10-02
Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models	✓ Link	94.8		GPT-4 (Teaching-Inspired)	2024-10-10
[]()		94.13	7	SFT-Mistral-7B (Metamath + ovm +ensemble)
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data	✓ Link	94.1		OpenMath2-Llama3.1-8B (majority@256)	2024-10-02
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs	✓ Link	94.0		Qwen2-72B-Instruct-Step-DPO (0-shot CoT)	2024-06-26
[]()		93.2	7	DAMOMath-7B(MetaMath, OVM, Ensemble)
The Claude 3 Model Family: Opus, Sonnet, Haiku		92.3		Claude 3 Sonnet (0-shot chain-of-thought)	2024-03-04
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing	✓ Link	92	70	AlphaLLM (with MCTS)	2024-04-18
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data	✓ Link	91.7		OpenMath2-Llama3.1-8B	2024-10-02
PaLM 2 Technical Report	✓ Link	91.0		PaLM 2 (few-shot, k=8, SC)	2023-05-17
Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling	✓ Link	90.91		GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)	2024-06-18
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	90.8	70	OpenMath-CodeLlama-70B (w/ code, SC, k=50)	2024-02-15
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	90.4	70	DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)	2024-06-18
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	90.1	70	OpenMath-Llama2-70B (w/ code, SC, k=50)	2024-02-15
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	89.6	70	DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)	2024-06-18
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations	✓ Link	89.1	7	Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)	2023-12-14
[]()		89.0	13	Llama SFT (Metamath ToRA Ensemble)
Solving Quantitative Reasoning Problems with Language Models	✓ Link	89	62	Minerva 62B (maj5@100)	2022-06-29
The Claude 3 Model Family: Opus, Sonnet, Haiku		88.9		Claude 3 Haiku (0-shot chain-of-thought)	2024-03-04
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	✓ Link	88.3	70	ToRA-70B (SC, k=50)	2023-09-29
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models	✓ Link	88.2	7	DeepSeekMATH-RL-7B	2024-02-05
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	88.2	7	DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)	2024-06-18
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	88.0	34	OpenMath-CodeLlama-34B (w/ code, SC, k=50)	2024-02-15
Model Card and Evaluations for Claude Models		88		Claude 2 (0-shot chain-of-thought)	2023-07-11
[]()		87.41	4	Shivaay-4B (8-shot chain-of-thought)
Solving math word problems with process- and outcome-based feedback		87.3	70	DeepMind 70B Model (SFT+ORM-RL, ORM reranking)	2022-11-25
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning	✓ Link	87.2	7	MMOS-DeepSeekMath-7B(0-shot,k=50)	2024-02-23
Solving math word problems with process- and outcome-based feedback		87.1	70	DeepMind 70B Model (SFT+PRM-RL, PRM reranking)	2022-11-25
Sparks of Artificial General Intelligence: Early experiments with GPT-4	✓ Link	87.1		GPT-4	2023-03-22
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	86.9	7	OpenMath-Mistral-7B (w/ code, SC, k=50)	2024-02-15
Orca-Math: Unlocking the potential of SLMs in Grade School Math		86.8	7	Orca-Math 7B (fine-tuned)	2024-02-16
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	86.8	7	DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)	2024-06-18
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	86.8	13	OpenMath-CodeLlama-13B (w/ code, SC, k=50)	2024-02-15
Gemini: A Family of Highly Capable Multimodal Models	✓ Link	86.5		Gemini Pro (maj1@32)	2023-12-19
[]()		85.5		Codex (Self-Evaluation Guided Decoding, PAL, multiple reasoning chains, 9-shot gen, 5-shot eval)
Model Card and Evaluations for Claude Models		85.2		Claude 1.3 (0-shot chain-of-thought)	2023-07-11
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	✓ Link	85.1	34	ToRA-Code-34B (SC, k=50)	2023-09-29
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	84.8	7	OpenMath-CodeLlama-7B (w/ code, SC, k=50)	2024-02-15
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning	✓ Link	84.7	7	OVM-Mistral-7B (verify100@1)	2023-11-16
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	84.7	70	OpenMath-Llama2-70B (w/ code)	2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	84.6	70	OpenMath-CodeLlama-70B (w/ code)	2024-02-15
LEVER: Learning to Verify Language-to-Code Generation with Execution	✓ Link	84.5	175	code-davinci-002 175B (LEVER, 8-shot)	2023-02-16
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	✓ Link	84.3	70	ToRA 70B	2023-09-29
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations	✓ Link	84.1	7	Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)	2023-12-14
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning	✓ Link	83.9	70	MathCoder-L-70B	2023-10-05
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct	✓ Link	83.2	7	WizardMath-7B-V1.1	2023-08-18
Making Large Language Models Better Reasoners with Step-Aware Verifier		83.2	175	DIVERSE 175B (8-shot)	2022-06-06
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning	✓ Link	82.6	7	OVM-Mistral-7B (verify20@1)	2023-11-16
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	82.6	7	DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)	2024-06-18
The ART of LLM Refinement: Ask, Refine, and Trust		82.6		ChatGPT (Ask, Refine, Trust)	2023-11-14
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	82.5	8	DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)	2024-06-18
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models	✓ Link	82.3	70	MetaMath 70B	2023-09-21
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning	✓ Link	82.3	70	MuggleMATH 70B	2023-10-09
Large Language Models Can Self-Improve		82.1	540	PaLM 540B (Self Improvement, Self Consistency)	2022-10-20
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning	✓ Link	81.7	34	MathCoder-CL-34B	2023-10-05
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct	✓ Link	81.6	70	WizardMath-70B-V1.0	2023-08-18
TinyGSM: achieving >80% on GSM8k with small language models		81.5	2.6	Phi-GSM+V 1.3B+1.3B (verify48@1)	2023-12-14
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	81.1	7	DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)	2024-06-18
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	81.1	8	DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)	2024-06-18
Model Card and Evaluations for Claude Models		80.9		Claude Instant 1.1 (0-shot chain-of-thought)	2023-07-11
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	✓ Link	80.7	34	ToRA-Code 34B	2023-09-29
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	80.7	34	OpenMath-CodeLlama-34B (w/ code)	2024-02-15
PaLM 2 Technical Report	✓ Link	80.7		PaLM 2 (few-shot, k=8, CoT)	2023-05-17
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning	✓ Link	80.5	7	MMOS-DeepSeekMath-7B(0-shot)	2024-02-23
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning	✓ Link	80.4	34	MMOS-CODE-34B(0-shot)	2024-02-23
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	80.2	7	OpenMath-Mistral-7B (w/ code)	2024-02-15
[]()		80.2		Self-Evaluation Guided Decoding (Codex, PAL, single reasoning chain, 9-shot gen, 5-shot eval)
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	78.8	13	OpenMath-CodeLlama-13B (w/ code)	2024-02-15
Solving Quantitative Reasoning Problems with Language Models	✓ Link	78.5	540	Minerva 540B (CoT)	2022-06-29
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks	✓ Link	78.3		Camelidae-8×34B (5-shot)	2024-01-05
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks	✓ Link	77.8		Qwen2idae-16x14B (5-shot)	2024-01-05
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models	✓ Link	77.7	7	MetaMath-Mistral-7B	2023-09-21
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data	✓ Link	77.3	7	OpenChat-3.5 7B	2023-09-20
Solving math word problems with process- and outcome-based feedback		76.5	70	DeepMind 70B Model (STaR, maj1@96)	2022-11-25
[]()		76.4	7	Arithmo2-Mistral-7B
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	75.9	7	OpenMath-CodeLlama-7B (w/ code)	2024-02-15
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	✓ Link	75.8	13	ToRA-Code 13B	2023-09-29
[]()		74.7	7	Arithmo-Mistral-7B
Self-Consistency Improves Chain of Thought Reasoning in Language Models	✓ Link	74.4	540	PaLM 540B maj1@40 (8-shot)	2022-03-21
Large Language Models Can Self-Improve		74.4	540	PaLM 540B (Self Consistency)	2022-10-20
TinyGSM: achieving >80% on GSM8k with small language models		74.3	2.7	Phi-GSM 2.7B (fine-tuned)	2023-12-14
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning	✓ Link	74.1	7	MathCoder-CL-13B	2023-10-05
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning	✓ Link	74	13	MuggleMATH 13B	2023-10-09
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning	✓ Link	73.9	7	MMOS-CODE-7B(0-shot)	2024-02-23
CodeT5+: Open Code Large Language Models for Code Understanding and Generation	✓ Link	73.8	0.77	CodeT5+	2023-05-13
CAPO: Cost-Aware Prompt Optimization	✓ Link	73.73		Llama-3.3-70B + CAPO	2025-04-22
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning	✓ Link	73.7	7	OVM-Llama2-7B (verify100@1)	2023-11-16
Large Language Models Can Self-Improve		73.5	540	PaLM 540B (Self Improvement, CoT Prompting)	2022-10-20
KwaiYiiMath: Technical Report		73.3	13	KwaiYiiMath 13B	2023-10-11
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	✓ Link	72.6	7	ToRA-Code 7B	2023-09-29
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning	✓ Link	72.6	13	MathCoder-L-13B	2023-10-05
[]()		72.3		DBRX Base 132B
[]()		71.9		Self-Evaluation Guided Decoding (Codex, CoT, single reasoning chain, 9-shot gen, 5-shot eval)
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models	✓ Link	71.0	13	MetaMath 13B	2023-09-21
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning	✓ Link	69.8	7	MuggleMATH 7B	2023-10-09
LLaMA: Open and Efficient Foundation Language Models	✓ Link	69.7	65	LLaMA 65B-maj1@k	2023-02-27
Solving Quantitative Reasoning Problems with Language Models	✓ Link	68.5	62	Minerva 62B (maj1@100)	2022-06-29
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models	✓ Link	68.01	175	code-davinci-002 (Least-to-Most Prompting)	2022-05-21
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning	✓ Link	67.8	7	MathCoder-CL-7B	2023-10-05
[]()		66.9		DBRX Instruct 132B
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models	✓ Link	66.4	7	MetaMath 7B	2023-09-21
CAPO: Cost-Aware Prompt Optimization	✓ Link	65.07		Mistral-Small-24B + CAPO	2025-04-22
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models	✓ Link	64.8	79	RFT 70B	2023-08-03
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning	✓ Link	64.2	7	MathCoder-L-7B	2023-10-05
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct	✓ Link	63.9	13	WizardMath-13B-V1.0	2023-08-18
Solving Math Word Problems via Cooperative Reasoning induced Language Models	✓ Link	63.2	12	GPT-J (CoRe)	2022-10-28
The Unreasonable Effectiveness of Eccentric Automatic Prompts		61	70	Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting)	2024-02-09
CAPO: Cost-Aware Prompt Optimization	✓ Link	60.2		Qwen2.5-32B + CAPO	2025-04-22
Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning		59.59	70	LLaMA 2 70B (CoT-Influx)	2023-12-14
Orca 2: Teaching Small Language Models How to Reason		59.14	13	Orca 2 13B	2023-11-18
Transcending Scaling Laws with 0.1% Extra Compute		58.5	540	U-PaLM	2022-10-20
Large Language Models are Zero-Shot Reasoners	✓ Link	58.1	540	PaLM-540B (few-Shot-cot)	2022-05-24
GPT-4 Technical Report	✓ Link	57.1		GPT-3.5 (few-shot, k=5)	2023-03-15
Solving Quantitative Reasoning Problems with Language Models	✓ Link	56.8	8	Minerva 8B (maj5@100)	2022-06-29
Llama 2: Open Foundation and Fine-Tuned Chat Models	✓ Link	56.8	70	LLaMA 2 70B (on-shot)	2023-07-18
Solving Quantitative Reasoning Problems with Language Models	✓ Link	56.5	540	PaLM 540B (8-shot)	2022-06-29
Large Language Models Can Self-Improve		56.5	540	PaLM 540B (CoT Prompting)	2022-10-20
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models	✓ Link	55.3	13	RFT 13B	2023-08-03
Large Language Models are Zero-Shot Reasoners	✓ Link	55.0	175	Finetuned GPT-3 175B + verifier	2022-05-24
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct	✓ Link	54.9	7	WizardMath-7B-V1.0	2023-08-18
LLaMA: Open and Efficient Foundation Language Models	✓ Link	53.1	33	LLaMA 33B-maj1@k	2023-02-27
Solving Quantitative Reasoning Problems with Language Models	✓ Link	52.4	62	Minerva 62B (8-shot)	2022-06-29
Mistral 7B	✓ Link	52.2	7	Mistral 7B (maj@8)	2023-10-10
Llemma: An Open Language Model For Mathematics	✓ Link	51.5	34	Llemma 34B	2023-10-16
Large Language Models are Zero-Shot Reasoners	✓ Link	51.5	175	Text-davinci-002-175B (zero-plus-few-Shot-cot (8 samples))	2022-05-24
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models	✓ Link	51.2	7	RFT 7B	2023-08-03
LLaMA: Open and Efficient Foundation Language Models	✓ Link	50.9	65	LLaMA 65B	2023-02-27
Orca 2: Teaching Small Language Models How to Reason		47.23	7	Orca 2 7B	2023-11-18
The Unreasonable Effectiveness of Eccentric Automatic Prompts		43	13	Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting)	2024-02-09
Large Language Models are Zero-Shot Reasoners	✓ Link	41.3	175	text-davinci-002 175B (2-shot, CoT)	2022-05-24
The Unreasonable Effectiveness of Eccentric Automatic Prompts		41	7	Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting)	2024-02-09
Large Language Models are Zero-Shot Reasoners	✓ Link	40.7	175	text-davinci-002 175B (0-shot, CoT)	2022-05-24
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM	✓ Link	37.1		Branch-Train-MiX 4x7B (sampling top-2 experts)	2024-03-12
Llemma: An Open Language Model For Mathematics	✓ Link	36.4	7	Llemma 7B	2023-10-16
LLaMA: Open and Efficient Foundation Language Models	✓ Link	35.6	33	LLaMA 33B	2023-02-27
Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning	✓ Link	35.2	13	Vicuna (SYRELM)	2023-12-09
Solving Quantitative Reasoning Problems with Language Models	✓ Link	33.0	62	PaLM 62B (8-shot)	2022-06-29
Large Language Models Can Self-Improve		32.2	540	PaLM 540B (Self Improvement, Standard-Prompting)	2022-10-20
LLaMA: Open and Efficient Foundation Language Models	✓ Link	29.3	13	LLaMA 13B-maj1@k	2023-02-27
Solving Quantitative Reasoning Problems with Language Models	✓ Link	28.4	8	Minerva 8B-maj1@k (8-shot)	2022-06-29
Composing Ensembles of Pre-trained Models via Iterative Consensus		20.8	0.355	GPT-2-Medium 355M + question-solution classifier (BS=5)	2022-10-20
Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions	✓ Link	19.5	2.7	GPT-Neo-2.7B + Self-Sampling	2022-05-28
Composing Ensembles of Pre-trained Models via Iterative Consensus		18.3	0.355	GPT-2-Medium 355M (fine-tuned, BS=5)	2022-10-20
LLaMA: Open and Efficient Foundation Language Models	✓ Link	18.1	7	LLaMA 7B (maj1@k)	2023-02-27
Large Language Models are Zero-Shot Reasoners	✓ Link	17.9	540	PaLM 540B (few-shot)	2022-05-24
Large Language Models Can Self-Improve		17.9	540	PaLM 540B (Standard-Prompting)	2022-10-20
LLaMA: Open and Efficient Foundation Language Models	✓ Link	17.8	13	LLaMA 13B	2023-02-27
Composing Ensembles of Pre-trained Models via Iterative Consensus		16.8	0.355	GPT-2-Medium 355M + question-solution classifier (BS=1)	2022-10-20
Solving Quantitative Reasoning Problems with Language Models	✓ Link	16.2	8	Minerva 8B (8-shot)	2022-06-29
Composing Ensembles of Pre-trained Models via Iterative Consensus		12.2	0.355	GPT-2-Medium 355M (BS=5)	2022-10-20
LLaMA: Open and Efficient Foundation Language Models	✓ Link	11.0	7	LLaMA 7B	2023-02-27
Large Language Models are Zero-Shot Reasoners	✓ Link	10.4	175	Text-davinci-002-175B (0-shot)	2022-05-24
Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions	✓ Link	7.5	0.125	GPT-Neo 125M + Self-Sampling	2022-05-28
UL2: Unifying Language Learning Paradigms	✓ Link	4.4	20	UL2 20B (chain-of-thought)	2022-05-10
Solving Quantitative Reasoning Problems with Language Models	✓ Link	4.1	8	PaLM 8B (8-shot)	2022-06-29
UL2: Unifying Language Learning Paradigms	✓ Link	4.1	20	UL2 20B (0-shot)	2022-05-10

OpenCodePapers

arithmetic-reasoning-on-gsm8k