OpenCodePapers

on-cybench

Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeUnguided PerformanceModelNameReleaseDate
US AISI and UK AISI Joint Pre-Deployment Test: Anthropic’s Claude 3.5 Sonnet (October 2024 Release)35%Claude 3.5 Sonnet (old, US AISI scaffold, pass@10)2024-11-19
US AISI and UK AISI Joint Pre-Deployment Test: Anthropic’s Claude 3.5 Sonnet (October 2024 Release)35%o1-preview (US AISI scaffold, pass@10)2024-11-19
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models✓ Link17.5%Claude 3.5 Sonnet2024-08-15