108
# Tasks
32 (604)
# MCP servers (# tools)
7 (16)
# Local toolkits (# tools)
54.6%
Best Pass@1 Score
Follow our submission guide to add your agent or model to the leaderboard.
| Model | Type | Agent | Date | Pass@1 | Pass@3 | Pass^3 | # Turns |
|---|---|---|---|---|---|---|---|
| GPT-5.4-xhigh | Proprietary | Default | 2026-03-06 | 54.6 | — | — | — |
| GPT-5.3-Codex-xhigh | Proprietary | Default | 2026-03-06 | 51.9 | — | — | — |
| Gemini-3-Flash | Proprietary | Default | 2025-12-18 | 49.4± 0.4 | 59.3 | 36.1 | 28.6 |
| Claude-4.6-Opus | Proprietary | Claude Agent SDK | 2026-03-06 | 47.2† | — | — | — |
| Claude-4.6-Sonnet | Proprietary | Default | 2026-02-23 | 44.8± 2.9 | 59.3 | 30.6 | 23.4 |
| GPT-5.2-xhigh‡ | Proprietary | Default | 2025-12-18 | 43.8± 1.2 | 50.9 | 33.3 | 28.2 |
| Claude-4.5-Opus | Proprietary | Default | 2025-11-27 | 43.5± 0.8 | 57.4 | 30.6 | 18.7 |
| GPT-5.2-high‡ | Proprietary | Default | 2025-12-17 | 41.7± 1.3 | 54.6 | 28.7 | 23.9 |
| MiniMax-M2.1 | Open-Source | Default | 2025-12-25 | 40.7± 0.8 | 51.9 | 27.8 | 17.8 |
| GLM-5 | Open-Source | Default | 2026-02-13 | 39.2± 1.2 | 51.9 | 25.9 | 16.5 |
| Claude-4.5-Sonnet | Proprietary | Default | 2025-10-28 | 38.9± 3.0 | 52.8 | 20.4 | 20.2 |
| GPT-5-high‡ | Proprietary | Default | 2025-12-17 | 37.7± 1.2 | 50.9 | 19.4 | 25.7 |
| Qwen3.5-Plus | Open-Source | Default | 2026-02-21 | 37.7± 1.2 | 49.1 | 25.9 | 17.4 |
| GPT-5.1-high‡ | Proprietary | Default | 2025-12-17 | 37.0± 2.7 | 50.0 | 20.4 | 19.0 |
| Gemini-3-Pro | Proprietary | Default | 2025-11-22 | 36.4± 0.4 | 48.1 | 23.1 | 19.0 |
| DeepSeek-V3.2-Thinking | Open-Source | Default | 2025-12-01 | 35.2± 0.8 | 54.6 | 16.7 | 43.7 |
| Claude-4-Sonnet | Proprietary | Default | 2025-10-28 | 29.9± 1.6 | 41.7 | 17.6 | 27.3 |
| Kimi-K2.5 | Open-Source | Default | 2026-02-04 | 27.8± 0.8 | 38.9 | 14.8 | 17.2 |
| Grok-4 | Proprietary | Default | 2025-10-28 | 27.5± 1.7 | 38.9 | 16.7 | 20.3 |
| Claude-4.5-haiku | Proprietary | Default | 2025-10-28 | 26.2± 1.9 | 39.8 | 13.0 | 21.9 |
| GLM-4.7 | Open-Source | Default | 2025-12-25 | 23.8± 1.2 | 36.1 | 10.2 | 27.8 |
| DeepSeek-V3.2-Exp | Open-Source | Default | 2025-10-28 | 20.1± 1.2 | 27.8 | 12.0 | 26.0 |
| GLM-4.6 | Open-Source | Default | 2025-10-28 | 18.8± 2.2 | 29.6 | 9.3 | 27.9 |
| Grok-Code-Fast-1 | Proprietary | Default | 2025-10-28 | 18.5± 2.0 | 30.6 | 9.3 | 20.2 |
| Grok-4-Fast | Proprietary | Default | 2025-10-28 | 18.5± 2.0 | 32.4 | 5.6 | 15.9 |
| Kimi-K2-thinking | Open-Source | Default | 2025-11-22 | 17.6± 2.0 | 29.6 | 4.6 | 24.4 |
| o3 | Proprietary | Default | 2025-10-28 | 17.0± 0.9 | 25.0 | 9.3 | 19.4 |
| o4-mini | Proprietary | Default | 2025-10-28 | 14.8± 0.8 | 26.9 | 3.7 | 16.6 |
| GPT-5-mini | Proprietary | Default | 2025-10-28 | 14.5± 1.2 | 23.1 | 5.6 | 19.7 |
| Qwen-3-Coder | Open-Source | Default | 2025-10-28 | 14.5± 1.9 | 21.3 | 6.5 | 28.5 |
| Kimi-K2-0905 | Open-Source | Default | 2025-10-28 | 13.0± 2.0 | 22.2 | 5.6 | 26.6 |
| Gemini-2.5-Pro | Proprietary | Default | 2025-10-28 | 10.5± 1.9 | 21.3 | 2.8 | 26.5 |
| Gemini-2.5-Flash | Proprietary | Default | 2025-10-28 | 3.7± 1.5 | 8.3 | 0.0 | 8.3 |
Results bearing this badge were independently evaluated by us; sources for all other results are linked in the corresponding model names.
† Claude-4.6-Opus was evaluated once due to budget constraints.
‡ OpenAI models require the Responses API to achieve better native performance. Therefore, we modified the codebase to support the Responses API for these models, updated the results accordingly, and removed the previous results that were based on the Chat Completions API.