108
# Tasks
32 (604)
# MCP servers (# tools)
7 (16)
# Local toolkits (# tools)
43.5%
Best Pass@1 Score
Follow our submission guide to add your agent or model to the leaderboard.
| Model | Type | Date | Pass@1 | Pass@3 | Pass^3 | # Turns | Total Cost |
|---|---|---|---|---|---|---|---|
| Claude-4.5-Opus | Proprietary | 2025-11-27 | 43.5± 0.8 | 57.4 | 30.6 | 18.7 | — |
| Claude-4.5-Sonnet | Proprietary | 2025-10-28 | 38.9± 3.0 | 52.8 | 20.4 | 20.2 | $96 |
| Gemini-3-Pro | Proprietary | 2025-11-22 | 36.4± 0.4 | 48.1 | 23.1 | 19.0 | — |
| DeepSeek-V3.2-Thinking | Open-Source | 2025-12-01 | 35.2± 0.8 | 54.6 | 16.7 | 43.7 | — |
| GPT-5.1 | Proprietary | 2025-11-22 | 33.3± 0.8 | 43.5 | 22.2 | 15.5 | — |
| GPT-5 | Proprietary | 2025-10-28 | 30.6± 1.5 | 43.5 | 16.7 | 18.7 | $40 |
| Claude-4-Sonnet | Proprietary | 2025-10-28 | 29.9± 1.6 | 41.7 | 17.6 | 27.3 | $127 |
| GPT-5-high | Proprietary | 2025-10-28 | 29.0± 3.1 | 42.6 | 16.7 | 19.0 | $64 |
| Grok-4 | Proprietary | 2025-10-28 | 27.5± 1.7 | 38.9 | 16.7 | 20.3 | $121 |
| Claude-4.5-haiku | Proprietary | 2025-10-28 | 26.2± 1.9 | 39.8 | 13.0 | 21.9 | $36 |
| DeepSeek-V3.2-Exp | Open-Source | 2025-10-28 | 20.1± 1.2 | 27.8 | 12.0 | 26.0 | $5 |
| GLM-4.6 | Open-Source | 2025-10-28 | 18.8± 2.2 | 29.6 | 9.3 | 27.9 | $43 |
| Grok-Code-Fast-1 | Proprietary | 2025-10-28 | 18.5± 2.0 | 30.6 | 9.3 | 20.2 | $4 |
| Grok-4-Fast | Proprietary | 2025-10-28 | 18.5± 2.0 | 32.4 | 5.6 | 15.9 | $3 |
| Kimi-K2-thinking | Open-Source | 2025-11-22 | 17.6± 2.0 | 29.6 | 4.6 | 24.4 | — |
| o3 | Proprietary | 2025-10-28 | 17.0± 0.9 | 25.0 | 9.3 | 19.4 | $53 |
| o4-mini | Proprietary | 2025-10-28 | 14.8± 0.8 | 26.9 | 3.7 | 16.6 | $26 |
| GPT-5-mini | Proprietary | 2025-10-28 | 14.5± 1.2 | 23.1 | 5.6 | 19.7 | $11 |
| Qwen-3-Coder | Open-Source | 2025-10-28 | 14.5± 1.9 | 21.3 | 6.5 | 28.5 | — |
| Kimi-K2-0905 | Open-Source | 2025-10-28 | 13.0± 2.0 | 22.2 | 5.6 | 26.6 | $22 |
| Gemini-2.5-Pro | Proprietary | 2025-10-28 | 10.5± 1.9 | 21.3 | 2.8 | 26.5 | $41 |
| Gemini-2.5-Flash | Proprietary | 2025-10-28 | 3.7± 1.5 | 8.3 | 0.0 | 8.3 | $4 |
The table is sorted in descending order by Pass@1. Qwen’s pricing varies by region, so we can’t provide an exact cost.