Skip to main content
108
# Tasks
32 (604)
# MCP servers (# tools)
7 (16)
# Local toolkits (# tools)
49.4%
Best Pass@1 Score

Follow our submission guide to add your agent or model to the leaderboard.

ModelTypeDatePass@1Pass@3Pass^3# TurnsTotal Cost
Gemini Gemini-3-FlashProprietary2025-12-1849.4± 0.459.336.128.6
OpenAI icon GPT-5.2-XHigh*Proprietary2025-12-1843.8± 1.250.933.328.2
Claude Claude-4.5-OpusProprietary2025-11-2743.5± 0.857.430.618.7
OpenAI icon GPT-5.2-High*Proprietary2025-12-1741.7± 1.354.628.723.9
Minimax iconMiniMax-M2.1Open-Source2025-12-2540.7± 0.851.927.817.8
Claude Claude-4.5-SonnetProprietary2025-10-2838.9± 3.052.820.420.2$96
OpenAI icon GPT-5-High*Proprietary2025-12-1737.7± 1.250.919.425.7
OpenAI icon GPT-5.1-High*Proprietary2025-12-1737.0± 2.750.020.419.0
Gemini Gemini-3-ProProprietary2025-11-2236.4± 0.448.123.119.0
DeepSeek DeepSeek-V3.2-ThinkingOpen-Source2025-12-0135.2± 0.854.616.743.7
Claude Claude-4-SonnetProprietary2025-10-2829.9± 1.641.717.627.3$127
Grok Grok-4Proprietary2025-10-2827.5± 1.738.916.720.3$121
Claude Claude-4.5-haikuProprietary2025-10-2826.2± 1.939.813.021.9$36
ChatGLM GLM-4.7Open-Source2025-12-2523.8± 1.236.110.227.8
DeepSeek DeepSeek-V3.2-ExpOpen-Source2025-10-2820.1± 1.227.812.026.0$5
ChatGLM GLM-4.6Open-Source2025-10-2818.8± 2.229.69.327.9$43
Grok Grok-Code-Fast-1Proprietary2025-10-2818.5± 2.030.69.320.2$4
Grok Grok-4-FastProprietary2025-10-2818.5± 2.032.45.615.9$3
Kimi Kimi-K2-thinkingOpen-Source2025-11-2217.6± 2.029.64.624.4
OpenAI icon o3Proprietary2025-10-2817.0± 0.925.09.319.4$53
OpenAI icon o4-miniProprietary2025-10-2814.8± 0.826.93.716.6$26
OpenAI icon GPT-5-miniProprietary2025-10-2814.5± 1.223.15.619.7$11
Qwen Qwen-3-CoderOpen-Source2025-10-2814.5± 1.921.36.528.5
Kimi Kimi-K2-0905Open-Source2025-10-2813.0± 2.022.25.626.6$22
Gemini Gemini-2.5-ProProprietary2025-10-2810.5± 1.921.32.826.5$41
Gemini Gemini-2.5-FlashProprietary2025-10-283.7± 1.58.30.08.3$4

*OpenAI models require the Responses API to achieve better native performance. Therefore, we modified the codebase to support the Responses API for these models, updated the results accordingly, and removed the previous results that were based on the Chat Completions API.