on-cybench

Code Information Retrieval

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Unguided Performance	ModelName	ReleaseDate
US AISI and UK AISI Joint Pre-Deployment Test: Anthropic’s Claude 3.5 Sonnet (October 2024 Release)		35%	Claude 3.5 Sonnet (old, US AISI scaffold, pass@10)	2024-11-19
US AISI and UK AISI Joint Pre-Deployment Test: Anthropic’s Claude 3.5 Sonnet (October 2024 Release)		35%	o1-preview (US AISI scaffold, pass@10)	2024-11-19
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models	✓ Link	17.5%	Claude 3.5 Sonnet	2024-08-15

OpenCodePapers