108
# Tasks
32 (604)
# MCP servers (# tools)
7 (16)
# Local toolkits (# tools)
49.4%
Best Pass@1 Score
Follow our submission guide to add your agent or model to the leaderboard.
| Model | Type | Date | Pass@1 | Pass@3 | Pass^3 | # Turns | Total Cost |
|---|---|---|---|---|---|---|---|
| Gemini-3-Flash | Proprietary | 2025-12-18 | 49.4± 0.4 | 59.3 | 36.1 | 28.6 | — |
| GPT-5.2-XHigh* | Proprietary | 2025-12-18 | 43.8± 1.2 | 50.9 | 33.3 | 28.2 | — |
| Claude-4.5-Opus | Proprietary | 2025-11-27 | 43.5± 0.8 | 57.4 | 30.6 | 18.7 | — |
| GPT-5.2-High* | Proprietary | 2025-12-17 | 41.7± 1.3 | 54.6 | 28.7 | 23.9 | — |
| MiniMax-M2.1 | Open-Source | 2025-12-25 | 40.7± 0.8 | 51.9 | 27.8 | 17.8 | — |
| Claude-4.5-Sonnet | Proprietary | 2025-10-28 | 38.9± 3.0 | 52.8 | 20.4 | 20.2 | $96 |
| GPT-5-High* | Proprietary | 2025-12-17 | 37.7± 1.2 | 50.9 | 19.4 | 25.7 | — |
| GPT-5.1-High* | Proprietary | 2025-12-17 | 37.0± 2.7 | 50.0 | 20.4 | 19.0 | — |
| Gemini-3-Pro | Proprietary | 2025-11-22 | 36.4± 0.4 | 48.1 | 23.1 | 19.0 | — |
| DeepSeek-V3.2-Thinking | Open-Source | 2025-12-01 | 35.2± 0.8 | 54.6 | 16.7 | 43.7 | — |
| Claude-4-Sonnet | Proprietary | 2025-10-28 | 29.9± 1.6 | 41.7 | 17.6 | 27.3 | $127 |
| Grok-4 | Proprietary | 2025-10-28 | 27.5± 1.7 | 38.9 | 16.7 | 20.3 | $121 |
| Claude-4.5-haiku | Proprietary | 2025-10-28 | 26.2± 1.9 | 39.8 | 13.0 | 21.9 | $36 |
| GLM-4.7 | Open-Source | 2025-12-25 | 23.8± 1.2 | 36.1 | 10.2 | 27.8 | — |
| DeepSeek-V3.2-Exp | Open-Source | 2025-10-28 | 20.1± 1.2 | 27.8 | 12.0 | 26.0 | $5 |
| GLM-4.6 | Open-Source | 2025-10-28 | 18.8± 2.2 | 29.6 | 9.3 | 27.9 | $43 |
| Grok-Code-Fast-1 | Proprietary | 2025-10-28 | 18.5± 2.0 | 30.6 | 9.3 | 20.2 | $4 |
| Grok-4-Fast | Proprietary | 2025-10-28 | 18.5± 2.0 | 32.4 | 5.6 | 15.9 | $3 |
| Kimi-K2-thinking | Open-Source | 2025-11-22 | 17.6± 2.0 | 29.6 | 4.6 | 24.4 | — |
| o3 | Proprietary | 2025-10-28 | 17.0± 0.9 | 25.0 | 9.3 | 19.4 | $53 |
| o4-mini | Proprietary | 2025-10-28 | 14.8± 0.8 | 26.9 | 3.7 | 16.6 | $26 |
| GPT-5-mini | Proprietary | 2025-10-28 | 14.5± 1.2 | 23.1 | 5.6 | 19.7 | $11 |
| Qwen-3-Coder | Open-Source | 2025-10-28 | 14.5± 1.9 | 21.3 | 6.5 | 28.5 | — |
| Kimi-K2-0905 | Open-Source | 2025-10-28 | 13.0± 2.0 | 22.2 | 5.6 | 26.6 | $22 |
| Gemini-2.5-Pro | Proprietary | 2025-10-28 | 10.5± 1.9 | 21.3 | 2.8 | 26.5 | $41 |
| Gemini-2.5-Flash | Proprietary | 2025-10-28 | 3.7± 1.5 | 8.3 | 0.0 | 8.3 | $4 |
*OpenAI models require the Responses API to achieve better native performance. Therefore, we modified the codebase to support the Responses API for these models, updated the results accordingly, and removed the previous results that were based on the Chat Completions API.