Skip to main content
108
# Tasks
32 (604)
# MCP servers (# tools)
7 (16)
# Local toolkits (# tools)
38.6%
Best Pass@1 Score
ModelTypePass@1Pass@3Pass^3# Turns
Claude Claude-4.5-SonnetProprietary38.6± 2.751.920.420.2
OpenAI icon GPT-5Proprietary30.6± 1.543.516.718.7
Claude Claude-4-SonnetProprietary29.9± 1.641.717.627.3
OpenAI icon GPT-5-highProprietary29.0± 3.142.616.719.0
Grok Grok-4Proprietary27.5± 1.738.916.720.3
Claude Claude-4.5-haikuProprietary26.2± 1.939.813.021.9
DeepSeek DeepSeek-V3.2-ExpOpen-Source20.1± 1.227.812.026.0
ChatGLM GLM-4.6Open-Source18.8± 2.229.69.327.9
Grok Grok-Code-Fast-1Proprietary18.5± 2.030.69.320.2
Grok Grok-4-FastProprietary18.5± 2.032.45.615.9
OpenAI icon o3Proprietary17.0± 0.925.09.319.4
OpenAI icon o4-miniProprietary14.8± 0.826.93.716.6
OpenAI icon GPT-5-miniProprietary14.5± 1.223.15.619.7
Qwen Qwen-3-CoderOpen-Source14.5± 1.921.36.528.5
Kimi Kimi-K2-0905Open-Source13.0± 2.022.25.626.6
Gemini Gemini-2.5-ProProprietary10.5± 1.921.32.826.5
Gemini Gemini-2.5-FlashProprietary3.7± 1.58.30.08.3

The table is sorted in descending order by Pass@1.

I