math-word-problem-solving-on-math

Mathematical ReasoningMath Word Problem Solving

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy	Parameters (Billions)	ModelName	ReleaseDate
[]()		89.7		Gemini 2.0 Flash Experimental
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement		88.1	72	Qwen2.5-Math-72B-Instruct(TIR,Greedy)	2024-09-18
MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems	✓ Link	87.920		GPT-4 Turbo (MACM, w/code, voting)	2024-04-06
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement		85.9	72	Qwen2.5-Math-72B-Instruct(COT,Greedy)	2024-09-18
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement		85.2	7	Qwen2.5-Math-7B-Instruct(TIR,Greedy)	2024-09-18
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification	✓ Link	84.3		GPT-4-code model (CSV, w/ code, SC, k=16)	2023-08-15
Qwen2 Technical Report	✓ Link	84.0	72	Qwen2-Math-72B-Instruct(greedy)	2024-07-15
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement		83.6	7	Qwen2.5-Math-7B-Instruct(COT,Greedy)	2024-09-18
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement		79.9	1.5	Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)	2024-09-18
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data	✓ Link	79.6		OpenMath2-Llama3.1-70B (majority@256)	2024-10-02
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data	✓ Link	76.1		OpenMath2-Llama3.1-8B (majority@256)	2024-10-02
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement		75.8	1.5	Qwen2.5-Math-1.5B-Instruct(COT,Greedy)	2024-09-18
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification	✓ Link	73.5		GPT-4-code model (CSV, w/ code)	2023-08-15
Cumulative Reasoning with Large Language Models	✓ Link	72.2		CR (GPT-4-turbo model, w/ code)	2023-08-08
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data	✓ Link	71.9		OpenMath2-Llama3.1-70B	2024-10-02
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification	✓ Link	71.2		LogicNet (with code interpreter)	2023-08-15
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs	✓ Link	70.8		Qwen2-72B-Instruct-Step-DPO (0-shot CoT, w/o code)	2024-06-26
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification	✓ Link	69.7		GPT-4-code model (w/ code)	2023-08-15
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data	✓ Link	67.8		OpenMath2-Llama3.1-8B	2024-10-02
AlphaMath Almost Zero: Process Supervision without Process	✓ Link	66.3		AlphaMath-7B-SBS@3	2024-05-06
Solving Quantitative Reasoning Problems with Language Models	✓ Link	64.9	62	Minerva 62B (maj5@256)	2022-06-29
[]()		64.5	7	DAMOMath-7B
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning	✓ Link	63.7	7	MMOS-DeepSeekMath-7B(0-shot,k=50)	2024-02-23
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification	✓ Link	60.8		GPT-4-code model (w/o code)	2023-08-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	60.4	70	OpenMath-CodeLlama-70B (w/ code, SC, k=50)	2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	60.2	34	OpenMath-CodeLlama-34B (w/ code, SC, k=50)	2024-02-15
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	✓ Link	60.0	34	ToRA-Code 34B model (w/ code, SC, k=50)	2023-09-29
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models	✓ Link	58.8	7	DeepSeekMATH-RL-7B (w/ code, greedy decoding)	2024-02-05
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	58.3	70	OpenMath-Llama2-70B (w/ code, SC, k=50)	2024-02-15
Cumulative Reasoning with Large Language Models	✓ Link	58.0		CR (GPT-4 model, w/o code)	2023-08-08
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	57.6	13	OpenMath-CodeLlama-13B (w/ code, SC, k=50)	2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	57.2	7	OpenMath-Mistral-7B (w/ code, SC, k=50)	2024-02-15
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	✓ Link	56.9	70	ToRA 70B (w/ code, SC, k=50)	2023-09-29
Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models		56.4		SKiC (GPT-4 model)	2023-08-01
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	56.1	70	DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)	2024-06-18
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	55.6	7	OpenMath-CodeLlama-7B (w/ code, SC, k=50)	2024-02-15
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning	✓ Link	55.0	7	MMOS-DeepSeekMath-7B(0-shot)	2024-02-23
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	54.9	70	DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)	2024-06-18
Progressive-Hint Prompting Improves Reasoning in Large Language Models	✓ Link	53.9		PHP (GPT-4 model)	2023-04-19
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	53.6	7	DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)	2024-06-18
Gemini: A Family of Highly Capable Multimodal Models	✓ Link	53.2		Gemini Ultra (4-shot)	2023-12-19
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	52.9	7	DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)	2024-06-18
PAL: Program-aided Language Models	✓ Link	51.8		GPT-4 model (w/ code, PAL)	2022-11-18
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models	✓ Link	51.7	7	DeepSeekMATH-RL-7B (greedy decoding)	2024-02-05
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing	✓ Link	51		AlphaLLM (with MCTS)	2024-04-18
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	✓ Link	50.8	34	ToRA-Code 34B (w/ code)	2023-09-29
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	50.7	70	OpenMath-CodeLlama-70B (w/ code)	2024-02-15
Solving Quantitative Reasoning Problems with Language Models	✓ Link	50.3		Minerva 540B (maj1@k, k=64)	2022-06-29
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	✓ Link	49.7	70	ToRA 70B (w/ code)	2023-09-29
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning	✓ Link	49.5	34	MMOS-CODE-34B(0-shot)	2024-02-23
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning		48.8	7	DeepSeekMath-7B-KPMath-Plus	2024-03-04
PaLM 2 Technical Report	✓ Link	48.8		PaLM 2 (few-shot, k=4, SC)	2023-05-17
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning		48.6	34	Llemma-34B-KPMath-Plus	2024-03-04
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	48.3	34	OpenMath-CodeLlama-34B (w/ code)	2024-02-15
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations	✓ Link	48.1	67	Shepherd + DeepSeek-67B (SFT on MetaMATH + PRM rerank, k=256)	2023-12-14
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	✓ Link	48.1	13	ToRA-Code 13B (w/ code)	2023-09-29
Solving Quantitative Reasoning Problems with Language Models	✓ Link	47.6	8	Minerva 8B (maj5@256)	2022-06-29
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning		46.8	7	Mistral-7B-KPMath-Plus	2024-03-04
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	46.6	8	DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)	2024-06-18
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	46.3	70	OpenMath-Llama2-70B (w/ code)	2024-02-15
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	45.5	13	OpenMath-CodeLlama-13B (w/ code)	2024-02-15
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	45.5	7	DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)	2024-06-18
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	45.3	8	DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)	2024-06-18
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning	✓ Link	45.2	34	MathCoder-CL-34B	2023-10-05
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning	✓ Link	45.1	34	MathCoder-L-34B	2023-10-05
Augmenting Math Word Problems via Iterative Question Composing	✓ Link	45.0	72	MMIQC-72B	2024-01-17
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	✓ Link	44.6	7	ToRA-Code 7B (w/ code)	2023-09-29
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	44.5	7	OpenMath-Mistral-7B (w/ code)	2024-02-15
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning	✓ Link	44.3	7	MMOS-CODE-7B(0-shot)	2024-02-23
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	✓ Link	43.6	7	OpenMath-CodeLlama-7B (w/ code)	2024-02-15
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations	✓ Link	43.5	7	Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256)	2023-12-14
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	✓ Link	43.5	7	DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)	2024-06-18
Solving Quantitative Reasoning Problems with Language Models	✓ Link	43.4	62	Minerva 62B (maj1@k, k=64)	2022-06-29
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	✓ Link	43.0	13	ToRA 13B (w/ code)	2023-09-29
Sparks of Artificial General Intelligence: Early experiments with GPT-4	✓ Link	42.5		GPT-4	2023-03-22
[]()		41.8	7	SFT-Mistral-7B
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning		41	13	Llama2-13B-KPMath-Plus	2024-03-04
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	✓ Link	40.1	7	ToRA 7B (w/ code)	2023-09-29
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning	✓ Link	35.9	13	MathCoder-CL-13B	2023-10-05
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning	✓ Link	35.6	70	MuggleMATH-70B	2023-10-09
PaLM 2 Technical Report	✓ Link	34.3		PaLM 2 (few-shot, k=4, CoT)	2023-05-17
Solving Quantitative Reasoning Problems with Language Models	✓ Link	33.6	540	Minerva 540B	2022-06-29
Galactica: A Large Language Model for Science	✓ Link	33.6	540	Minerva 540B (5-shot) mCoT	2022-11-16
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations	✓ Link	33.0	7	Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL)	2023-12-14
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct	✓ Link	33.0	7	WizardMath-7B-V1.1	2023-08-18
Gemini: A Family of Highly Capable Multimodal Models	✓ Link	32.6		Gemini Pro (4-shot)	2023-12-19
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning	✓ Link	30.7	13	MuggleMATH-13B	2023-10-09
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning	✓ Link	30.2	7	MathCoder-CL-7B	2023-10-05
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning	✓ Link	29.9	13	MathCoder-L-13B	2023-10-05
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks	✓ Link	29.9		Qwen2idae-16x14B (4-shot)	2024-01-05
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data	✓ Link	28.9	7	OpenChat-3.5-1210 7B	2023-09-20
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data	✓ Link	28.6	7	OpenChat-3.5 7B	2023-09-20
Mixtral of Experts	✓ Link	28.4		Mixtral 8x7B (maj@4)	2024-01-08
Solving Quantitative Reasoning Problems with Language Models	✓ Link	27.6	62	Minerva 62B (4-shot)	2022-06-29
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models	✓ Link	26.0	70	MetaMath 70B	2023-09-21
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning	✓ Link	25.8	7	MuggleMATH 7B	2023-10-09
Solving Quantitative Reasoning Problems with Language Models	✓ Link	25.4	8	Minerva 8B (maj1@k, k=64)	2022-06-29
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning	✓ Link	23.3	7	MathCoder-L-7B	2023-10-05
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct	✓ Link	22.7	70	WizardMath-70B-V1.0	2023-08-18
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks	✓ Link	22.6		Camelidae-8×34B (4-shot)	2024-01-05
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models	✓ Link	22.5	13	MetaMath 13B	2023-09-21
LLaMA: Open and Efficient Foundation Language Models	✓ Link	20.5	65	LLaMA 65B (maj1@k)	2023-02-27
Galactica: A Large Language Model for Science	✓ Link	20.4	120	GAL 120B (5-shot) mCoT	2022-11-16
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models	✓ Link	19.4	7	MetaMath 7B	2023-09-21
Solving Quantitative Reasoning Problems with Language Models	✓ Link	19.1	175	davinci-002 175B	2022-06-29
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM	✓ Link	17.8		Branch-Train-MiX 4x7B (sampling top-2 experts)	2024-03-12
Galactica: A Large Language Model for Science	✓ Link	16.6	120	GAL 120B <work>	2022-11-16
LLaMA: Open and Efficient Foundation Language Models	✓ Link	15.2	33	LLaMA 33B-maj1@k	2023-02-27
Solving Quantitative Reasoning Problems with Language Models	✓ Link	14.1	8	Minerva 8B	2022-06-29
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct	✓ Link	14.0	13	WizardMath-13B-V1.0	2023-08-18
Mistral 7B	✓ Link	13.1	7	Mistral 7B (maj@4)	2023-10-10
Galactica: A Large Language Model for Science	✓ Link	12.7	30	GAL 30B (5-shot) mCoT	2022-11-16
Mixtral of Experts	✓ Link	12.7	7	Mistral 7B (maj@4)	2024-01-08
Galactica: A Large Language Model for Science	✓ Link	11.4	30	GAL 30B <work>	2022-11-16
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct	✓ Link	10.7	7	WizardMath-7B-V1.0	2023-08-18
LLaMA: Open and Efficient Foundation Language Models	✓ Link	10.6	65	LLaMA 65B	2023-02-27
Solving Quantitative Reasoning Problems with Language Models	✓ Link	8.8	540	PaLM 540B	2022-06-29
Galactica: A Large Language Model for Science	✓ Link	8.8	540	PaLM 540B (5-shot) mCoT	2022-11-16
LLaMA: Open and Efficient Foundation Language Models	✓ Link	8.8	13	LLaMA 13B-maj1@k	2023-02-27
LLaMA: Open and Efficient Foundation Language Models	✓ Link	7.1	33	LLaMA 33B	2023-02-27
LLaMA: Open and Efficient Foundation Language Models	✓ Link	6.9	7	LLaMA 7B-maj1@k	2023-02-27
Measuring Mathematical Problem Solving With the MATH Dataset	✓ Link	6.9	1.5	GPT-2 (1.5B)	2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset	✓ Link	6.4	0.7	GPT-2 (0.7B)	2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset	✓ Link	6.2	0.3	GPT-2 (0.3B)	2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset	✓ Link	5.6	13	GPT-3 13B	2021-03-05
Solving Quantitative Reasoning Problems with Language Models	✓ Link	5.6	8	PaLM 8B (fine-tuned)	2022-06-29
Measuring Mathematical Problem Solving With the MATH Dataset	✓ Link	5.4	0.1	GPT-2 (0.1B)	2021-03-05
Measuring Mathematical Problem Solving With the MATH Dataset	✓ Link	5.2	175	GPT-3-175B (few-shot)	2021-03-05
Galactica: A Large Language Model for Science	✓ Link	5.2	175	GPT-3 175B (8-shot)	2022-11-16
Solving Quantitative Reasoning Problems with Language Models	✓ Link	4.4	62	PaLM 62B	2022-06-29
LLaMA: Open and Efficient Foundation Language Models	✓ Link	3.9	13	LLaMA 13B	2023-02-27
Measuring Mathematical Problem Solving With the MATH Dataset	✓ Link	3.0	13	GPT-3-13B (few-shot)	2021-03-05
LLaMA: Open and Efficient Foundation Language Models	✓ Link	2.9	7	LLaMA 7B	2023-02-27
Measuring Mathematical Problem Solving With the MATH Dataset	✓ Link	2.9	2.7	GPT-3 2.7B	2021-03-05
Solving Quantitative Reasoning Problems with Language Models	✓ Link	1.5	8	PaLM 8B	2022-06-29

OpenCodePapers

math-word-problem-solving-on-math