Skip to main content
108
# Tasks
32 (604)
# MCP servers (# tools)
7 (16)
# Local toolkits (# tools)
54.6%
Best Pass@1 Score

Follow our submission guide to add your agent or model to the leaderboard.

ModelTypeAgentDatePass@1Pass@3Pass^3# Turns
OpenAI icon GPT-5.4-xhighProprietaryDefault2026-03-0654.6
OpenAI icon GPT-5.3-Codex-xhighProprietaryDefault2026-03-0651.9
Gemini Gemini-3-FlashProprietaryDefault2025-12-1849.4± 0.459.336.128.6
Claude Claude-4.6-OpusProprietaryClaude Agent SDK2026-03-0647.2
Claude Claude-4.6-SonnetProprietaryDefault2026-02-2344.8± 2.959.330.623.4
OpenAI icon GPT-5.2-xhighProprietaryDefault2025-12-1843.8± 1.250.933.328.2
Claude Claude-4.5-OpusProprietaryDefault2025-11-2743.5± 0.857.430.618.7
OpenAI icon GPT-5.2-highProprietaryDefault2025-12-1741.7± 1.354.628.723.9
Minimax iconMiniMax-M2.1Open-SourceDefault2025-12-2540.7± 0.851.927.817.8
ChatGLM GLM-5Open-SourceDefault2026-02-1339.2± 1.251.925.916.5
Claude Claude-4.5-SonnetProprietaryDefault2025-10-2838.9± 3.052.820.420.2
OpenAI icon GPT-5-highProprietaryDefault2025-12-1737.7± 1.250.919.425.7
Qwen Qwen3.5-PlusOpen-SourceDefault2026-02-2137.7± 1.249.125.917.4
OpenAI icon GPT-5.1-highProprietaryDefault2025-12-1737.0± 2.750.020.419.0
Gemini Gemini-3-ProProprietaryDefault2025-11-2236.4± 0.448.123.119.0
DeepSeek DeepSeek-V3.2-ThinkingOpen-SourceDefault2025-12-0135.2± 0.854.616.743.7
Claude Claude-4-SonnetProprietaryDefault2025-10-2829.9± 1.641.717.627.3
Kimi Kimi-K2.5Open-SourceDefault2026-02-0427.8± 0.838.914.817.2
Grok Grok-4ProprietaryDefault2025-10-2827.5± 1.738.916.720.3
Claude Claude-4.5-haikuProprietaryDefault2025-10-2826.2± 1.939.813.021.9
ChatGLM GLM-4.7Open-SourceDefault2025-12-2523.8± 1.236.110.227.8
DeepSeek DeepSeek-V3.2-ExpOpen-SourceDefault2025-10-2820.1± 1.227.812.026.0
ChatGLM GLM-4.6Open-SourceDefault2025-10-2818.8± 2.229.69.327.9
Grok Grok-Code-Fast-1ProprietaryDefault2025-10-2818.5± 2.030.69.320.2
Grok Grok-4-FastProprietaryDefault2025-10-2818.5± 2.032.45.615.9
Kimi Kimi-K2-thinkingOpen-SourceDefault2025-11-2217.6± 2.029.64.624.4
OpenAI icon o3ProprietaryDefault2025-10-2817.0± 0.925.09.319.4
OpenAI icon o4-miniProprietaryDefault2025-10-2814.8± 0.826.93.716.6
OpenAI icon GPT-5-miniProprietaryDefault2025-10-2814.5± 1.223.15.619.7
Qwen Qwen-3-CoderOpen-SourceDefault2025-10-2814.5± 1.921.36.528.5
Kimi Kimi-K2-0905Open-SourceDefault2025-10-2813.0± 2.022.25.626.6
Gemini Gemini-2.5-ProProprietaryDefault2025-10-2810.5± 1.921.32.826.5
Gemini Gemini-2.5-FlashProprietaryDefault2025-10-283.7± 1.58.30.08.3

Results bearing this badge were independently evaluated by us; sources for all other results are linked in the corresponding model names.

Claude-4.6-Opus was evaluated once due to budget constraints.

OpenAI models require the Responses API to achieve better native performance. Therefore, we modified the codebase to support the Responses API for these models, updated the results accordingly, and removed the previous results that were based on the Chat Completions API.