code-generation-on-mbpp

Code Generation

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Accuracy	ModelName	ReleaseDate
Execution Guided Line-by-Line Code Generation	✓ Link	96.6	EG-CFG (DeepSeek-V3-0324)	2025-06-12
QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks		94.2	QualityFlow (Sonnet-3.5)	2025-01-20
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving	✓ Link	93.2	o1-mini + MapCoder (Hamming.ai)	2024-05-18
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging	✓ Link	92.4	MGDebugger (DeepSeek-V3-0324)	2024-10-02
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation	✓ Link	91.8	GPT-4 + AgentCoder	2023-12-20
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging	✓ Link	90.7	CodeSim (GPT4o)	2025-02-08
[]()		90.0	Jiutian-大模型
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation	✓ Link	89.9	GPT-3.5 Turbo (ChatGPT) + AgentCoder	2023-12-20
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving	✓ Link	89.7	MapCoder (GPT-4o)	2024-05-18
How Does Naming Affect LLMs on Code Analysis Tasks?		87.5	GPT-4 (ChatGPT Plus)	2023-07-24
The Claude 3 Model Family: Opus, Sonnet, Haiku		86.4	Claude 3 Opus	2024-03-04
Planning-Driven Programming: A Large Language Model Programming Workflow	✓ Link	84.8	LPW (GPT-4o)	2024-11-21
SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents		83.8±0.6	GPT-3.5 Turbo + FlowGenScrum + Test	2024-03-23
AFlow: Automating Agentic Workflow Generation	✓ Link	83.4	AFlow(GPT-4o-mini)	2024-10-14
How Does Naming Affect LLMs on Code Analysis Tasks?		83.2	GPT-3.5 Turbo (ChatGPT)	2023-07-24
Execution Guided Line-by-Line Code Generation	✓ Link	83.2	EG-CFG (DeepSeek Coder 1.3b Instruct)	2025-06-12
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving	✓ Link	83.1	MapCoder (GPT-4)	2024-05-18
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models	✓ Link	82.3	o1-mini + Language Agent Tree Search (Hamming.ai)	2023-10-06
How Does Naming Affect LLMs on Code Analysis Tasks?		82	GPT-4 (Bing Chat)	2023-07-24
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models	✓ Link	81.1	GPT-3.5 Turbo + Language Agent Tree Search	2023-10-06
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging	✓ Link	80.8	MGDebugger (CodeQwen1.5)	2024-10-02
The Claude 3 Model Family: Opus, Sonnet, Haiku		80.4	Claude 3 Haiku	2024-03-04
Teaching Large Language Models to Self-Debug	✓ Link	80.2	GPT-4 (Self-Debugging with unit tests + trace)	2023-04-11
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence	✓ Link	80	GPT-4 (few-shot)	2024-01-25
The Claude 3 Model Family: Opus, Sonnet, Haiku		79.4	Claude 3 Sonnet	2024-03-04
How Does Naming Affect LLMs on Code Analysis Tasks?		76.2	Bard (PaLM 2/chat-bison-001)	2023-07-24
Teaching Large Language Models to Self-Debug	✓ Link	72.8	GPT-3.5 Turbo (Self-Debugging with unit tests + trace)	2023-04-11
How Does Naming Affect LLMs on Code Analysis Tasks?		71.4	Claude	2023-07-24
Teaching Large Language Models to Self-Debug	✓ Link	70.8	code-davinci-002 175B (Self-Debugging with unit tests + trace)	2023-04-11
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence	✓ Link	70.8	GPT-3.5 Turbo (few-shot)	2024-01-25
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence	✓ Link	70	DeepSeek-Coder-Instruct 33B (few-shot)	2024-01-25
INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair	✓ Link	69.8	GPT-3.5 Turbo + INTERVENOR	2023-11-16
LEVER: Learning to Verify Language-to-Code Generation with Execution	✓ Link	68.9	code-davinci-002 175B + LEVER	2023-02-16
CodeT: Code Generation with Generated Tests	✓ Link	67.7	code-davinci-002 175B + CodeT	2022-07-21
Teaching Large Language Models to Self-Debug	✓ Link	67.6	GPT-3.5 Turbo (3-shot)	2023-04-11
Coder Reviewer Reranking for Code Generation	✓ Link	66.9	code-davinci-002 175B + Reviewer	2022-11-29
Coder Reviewer Reranking for Code Generation	✓ Link	66.4	code-davinci-002 175B + Coder-Reviewer	2022-11-29
StarCoder 2 and The Stack v2: The Next Generation	✓ Link	66.2	StarCoder2-15B	2024-02-29
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence	✓ Link	66	DeepSeek-Coder-Base 33B (few-shot)	2024-01-25
Code Llama: Open Foundation Models for Code	✓ Link	65.5	Code Llama - Python 70B (3-shot)	2023-08-24
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence	✓ Link	65.4	DeepSeek-Coder-Instruct 6.7B (few-shot)	2024-01-25
Coder Reviewer Reranking for Code Generation	✓ Link	63	code-davinci-002 175B + MBR-Exec	2022-11-29
Code Llama: Open Foundation Models for Code	✓ Link	62.4	Code Llama 70B (3-shot)	2023-08-24
Code Llama: Open Foundation Models for Code	✓ Link	62.2	Code Llama - Instruct 70B (3-shot)	2023-08-24
CodeT: Code Generation with Generated Tests	✓ Link	61.9	code-davinci-001 175B + CodeT	2022-07-21
Teaching Large Language Models to Self-Debug	✓ Link	61.4	code-davinci-002 175B (3-shot)	2023-04-11
Code Llama: Open Foundation Models for Code	✓ Link	61.2	Unnatural Code Llama 34B (3-shot)	2023-08-24
Mixtral of Experts	✓ Link	60.7	Mixtral 8x7B (3-shot)	2024-01-08
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence	✓ Link	60.6	DeepSeek-Coder-Base 6.7B (few-shot)	2024-01-25
Natural Language to Code Translation with Execution	✓ Link	58.2	code-davinci-001 175B + MBR-Exec	2022-04-25
Code Llama: Open Foundation Models for Code	✓ Link	57	Code Llama - Instruct 34B (3-shot)	2023-08-24
Code Llama: Open Foundation Models for Code	✓ Link	56.2	Code Llama - Python 34B (3-shot)	2023-08-24
CodeT: Code Generation with Generated Tests	✓ Link	55.4	code-cushman-001 12B (CodeT)	2022-07-21
Code Llama: Open Foundation Models for Code	✓ Link	55	Code Llama 34B (3-shot)	2023-08-24
Teaching Large Language Models to Self-Debug	✓ Link	53.2	StarCoder 15.5B (Self-Debugging with unit tests + trace)	2023-04-11
StarCoder: may the source be with you!	✓ Link	52.7	StarCoder 15.5B	2023-05-09
Code Llama: Open Foundation Models for Code	✓ Link	52.2	GPT-3.5 Turbo	2023-08-24
WizardCoder: Empowering Code Large Language Models with Evol-Instruct	✓ Link	51.8	WizardCoder 15B	2023-06-14
PaLM 2 Technical Report	✓ Link	50	PaLM 2-S* (few-shot)	2023-05-17
CodeT: Code Generation with Generated Tests	✓ Link	49.5	CodeGen-Mono 16B + CodeT	2022-07-21
Code Llama: Open Foundation Models for Code	✓ Link	49.4	Code Llama - Instruct 13B (3-shot)	2023-08-24
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence	✓ Link	49.4	DeepSeek-Coder-Instruct 1.3B (few-shot)	2024-01-25
StarCoder: may the source be with you!	✓ Link	49	StarCoderBase 15.5B	2023-05-09
Code Llama: Open Foundation Models for Code	✓ Link	49	Code Llama - Python 13B (3-shot)	2023-08-24
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks	✓ Link	48.6	Qwen2idae-16x14B (4-shot)	2024-01-05
Coder Reviewer Reranking for Code Generation	✓ Link	48.3	code-cushman-001 12B + MBR-Exec	2022-11-29
Code Llama: Open Foundation Models for Code	✓ Link	47.6	Code Llama - Python 7B (3-shot)	2023-08-24
Mistral 7B	✓ Link	47.5	Mistral 7B (3-shot)	2023-10-10
Coder Reviewer Reranking for Code Generation	✓ Link	47.3	CodeGen 16B + MBR-Exec	2022-11-29
Teaching Large Language Models to Self-Debug	✓ Link	47.2	StarCoder 15.5B (3-shot)	2023-04-11
PaLM: Scaling Language Modeling with Pathways	✓ Link	47	PaLM Coder 540B	2022-04-05
Code Llama: Open Foundation Models for Code	✓ Link	47	Code Llama 13B (3-shot)	2023-08-24
Coder Reviewer Reranking for Code Generation	✓ Link	46.2	CodeGen 16B + Coder-Reviewer	2022-11-29
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence	✓ Link	46.2	DeepSeek-Coder-Base 1.3B (few-shot)	2024-01-25
INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair	✓ Link	45.4	GPT-3.5 Turbo (few-shot)	2023-11-16
Llama 2: Open Foundation and Fine-Tuned Chat Models	✓ Link	45	Llama 2 70B (zero-shot)	2023-07-18
Code Llama: Open Foundation Models for Code	✓ Link	44.4	Code Llama - Instruct 7B (3-shot)	2023-08-24
Coder Reviewer Reranking for Code Generation	✓ Link	44.1	CodeGen 16B + Reviewer	2022-11-29
Textbooks Are All You Need II: phi-1.5 technical report	✓ Link	43.5	phi-1.5-web 1.3B	2023-09-11
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM	✓ Link	42.6	Branch-Train-Merge 4x7B (top-2)	2024-03-12
Code Llama: Open Foundation Models for Code	✓ Link	41.4	Code Llama 7B (3-shot)	2023-08-24
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks	✓ Link	41.4	Camelidae-8×34B (4-shot)	2024-01-05
INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair	✓ Link	39.8	GPT-3.5 Turbo (0-shot)	2023-11-16
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM	✓ Link	39.4	Branch-Train-MiX 4x7B (sampling top-2 experts)	2024-03-12
LLaMA: Open and Efficient Foundation Language Models	✓ Link	37.7	LLaMA 65B (0-shot)	2023-02-27
PaLM: Scaling Language Modeling with Pathways	✓ Link	36.8	PaLM 540B	2022-04-05
StarCoder: may the source be with you!	✓ Link	35	SantaCoder 1.1B	2023-05-09
CodeT: Code Generation with Generated Tests	✓ Link	34.4	InCoder 6.7B + CodeT	2022-07-21
Llama 2: Open Foundation and Fine-Tuned Chat Models	✓ Link	33	Llama 2 34B (0-shot)	2023-07-18
Llama 2: Open Foundation and Fine-Tuned Chat Models	✓ Link	30.6	Llama 2 13B (0-shot)	2023-07-18
LLaMA: Open and Efficient Foundation Language Models	✓ Link	30.2	LLaMA 33B (0-shot)	2023-02-27
Coder Reviewer Reranking for Code Generation	✓ Link	26.7	InCoder 6.7B + MBR-Exec	2022-11-29
Coder Reviewer Reranking for Code Generation	✓ Link	26.1	InCoder 6.7B + Coder-Reviewer	2022-11-29
Coder Reviewer Reranking for Code Generation	✓ Link	24.4	InCoder 6.7B + Reviewer	2022-11-29
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X	✓ Link	24.4	CodeGeeX-13B	2023-03-30
LLaMA: Open and Efficient Foundation Language Models	✓ Link	22	LLaMA 13B (0-shot)	2023-02-27
Llama 2: Open Foundation and Fine-Tuned Chat Models	✓ Link	20.8	Llama 2 7B (0-shot)	2023-07-18
InCoder: A Generative Model for Code Infilling and Synthesis	✓ Link	19.4	InCoder 6.7B (0-shot)	2022-04-12
LLaMA: Open and Efficient Foundation Language Models	✓ Link	17.7	LLaMA 7B (0-shot)	2023-02-27

OpenCodePapers

code-generation-on-mbpp