108
# Tasks
32 (604)
# MCP servers (# tools)
7 (16)
# Local toolkits (# tools)
38.6%
Best Pass@1 Score
| Model | Type | Pass@1 | Pass@3 | Pass^3 | # Turns |
|---|---|---|---|---|---|
| Claude-4.5-Sonnet | Proprietary | 38.6± 2.7 | 51.9 | 20.4 | 20.2 |
| GPT-5 | Proprietary | 30.6± 1.5 | 43.5 | 16.7 | 18.7 |
| Claude-4-Sonnet | Proprietary | 29.9± 1.6 | 41.7 | 17.6 | 27.3 |
| GPT-5-high | Proprietary | 29.0± 3.1 | 42.6 | 16.7 | 19.0 |
| Grok-4 | Proprietary | 27.5± 1.7 | 38.9 | 16.7 | 20.3 |
| Claude-4.5-haiku | Proprietary | 26.2± 1.9 | 39.8 | 13.0 | 21.9 |
| DeepSeek-V3.2-Exp | Open-Source | 20.1± 1.2 | 27.8 | 12.0 | 26.0 |
| GLM-4.6 | Open-Source | 18.8± 2.2 | 29.6 | 9.3 | 27.9 |
| Grok-Code-Fast-1 | Proprietary | 18.5± 2.0 | 30.6 | 9.3 | 20.2 |
| Grok-4-Fast | Proprietary | 18.5± 2.0 | 32.4 | 5.6 | 15.9 |
| o3 | Proprietary | 17.0± 0.9 | 25.0 | 9.3 | 19.4 |
| o4-mini | Proprietary | 14.8± 0.8 | 26.9 | 3.7 | 16.6 |
| GPT-5-mini | Proprietary | 14.5± 1.2 | 23.1 | 5.6 | 19.7 |
| Qwen-3-Coder | Open-Source | 14.5± 1.9 | 21.3 | 6.5 | 28.5 |
| Kimi-K2-0905 | Open-Source | 13.0± 2.0 | 22.2 | 5.6 | 26.6 |
| Gemini-2.5-Pro | Proprietary | 10.5± 1.9 | 21.3 | 2.8 | 26.5 |
| Gemini-2.5-Flash | Proprietary | 3.7± 1.5 | 8.3 | 0.0 | 8.3 |
The table is sorted in descending order by Pass@1.