Skip to main content
108
# Tasks
32 (604)
# MCP servers (# tools)
7 (16)
# Local toolkits (# tools)
55.6%
Best Pass@1 Score

Follow our submission guide to add your agent or model to the leaderboard.

ModelTypeAgentDatePass@1Pass@3Pass^3# Turns
OpenAI icon GPT-5.5-xhighProprietaryDefault2026-04-2455.6
OpenAI icon GPT-5.4-xhighProprietaryDefault2026-03-0654.6
DeepSeek DeepSeek-V4-Pro MaxOpen-SourceDefault2026-04-2552.8± 1.963.938.924.1
Claude Claude-Opus-4.7ProprietaryDefault2026-04-2552.816.2
OpenAI icon GPT-5.3-Codex-xhighProprietaryDefault2026-03-0651.9
Kimi Kimi-K2.6Open-SourceDefault2026-04-2150.0
Gemini Gemini-3-FlashProprietaryDefault2025-12-1849.4± 0.459.336.128.6
Gemini Gemini-3.1-ProProprietaryDefault2026-03-1348.8± 2.362.034.327.9
DeepSeek DeepSeek-V4-Flash MaxOpen-SourceDefault2026-04-2548.2± 0.957.437.026.1
Claude Claude-Opus-4.6ProprietaryClaude Agent SDK2026-03-0647.2
Minimax iconMiniMax-M2.7Open-SourceDefault2026-03-1846.3
Claude Claude-Sonnet-4.6ProprietaryDefault2026-02-2344.8± 2.959.330.623.4
OpenAI icon GPT-5.2-xhighProprietaryDefault2025-12-1843.8± 1.250.933.328.2
Claude Claude-Opus-4.5ProprietaryDefault2025-11-2743.5± 0.857.430.618.7
OpenAI icon GPT-5.4-mini-xhighProprietaryDefault2026-03-1742.9
OpenAI icon GPT-5.2-highProprietaryDefault2025-12-1741.7± 1.354.628.723.9
Minimax iconMiniMax-M2.1Open-SourceDefault2025-12-2540.7± 0.851.927.817.8
ChatGLM GLM-5.1Open-SourceDefault2026-04-0740.7
Qwen Qwen3.6-PlusProprietaryDefault2026-04-0239.8
ChatGLM GLM-5Open-SourceDefault2026-02-1339.2± 1.251.925.916.5
Claude Claude-Sonnet-4.5ProprietaryDefault2025-10-2838.9± 3.052.820.420.2
Minimax iconMiniMax-M2.5Open-SourceDefault2026-03-1838.3
Qwen Qwen3.5-397B-A17BOpen-SourceDefault2026-04-0238.3
OpenAI icon GPT-5-highProprietaryDefault2025-12-1737.7± 1.250.919.425.7
Qwen Qwen3.5-PlusOpen-SourceDefault2026-02-2137.7± 1.249.125.917.4
OpenAI icon GPT-5.1-highProprietaryDefault2025-12-1737.0± 2.750.020.419.0
Gemini Gemini-3-ProProprietaryDefault2025-11-2236.4± 0.448.123.119.0
OpenAI icon GPT-5.4-nano-xhighProprietaryDefault2026-03-1735.5
DeepSeek DeepSeek-V3.2-ThinkingOpen-SourceDefault2025-12-0135.2± 0.854.616.743.7
Qwen Qwen3.5-27BOpen-SourceDefault2026-04-1731.5
Claude Claude-Sonnet-4ProprietaryDefault2025-10-2829.9± 1.641.717.627.3
Qwen Qwen3.5-35BA3BOpen-SourceDefault2026-04-1728.7
Kimi Kimi-K2.5Open-SourceDefault2026-02-0427.8± 0.838.914.817.2
Grok Grok-4ProprietaryDefault2025-10-2827.5± 1.738.916.720.3
Qwen Qwen3.6-35BA3BOpen-SourceDefault2026-04-1726.9
Claude Claude-Haiku-4.5ProprietaryDefault2025-10-2826.2± 1.939.813.021.9
ChatGLM GLM-4.7Open-SourceDefault2025-12-2523.8± 1.236.110.227.8
DeepSeek DeepSeek-V3.2-ExpOpen-SourceDefault2025-10-2820.1± 1.227.812.026.0
ChatGLM GLM-4.6Open-SourceDefault2025-10-2818.8± 2.229.69.327.9
Grok Grok-Code-Fast-1ProprietaryDefault2025-10-2818.5± 2.030.69.320.2
Grok Grok-4-FastProprietaryDefault2025-10-2818.5± 2.032.45.615.9
Kimi Kimi-K2-thinkingOpen-SourceDefault2025-11-2217.6± 2.029.64.624.4
OpenAI icon o3ProprietaryDefault2025-10-2817.0± 0.925.09.319.4
OpenAI icon o4-miniProprietaryDefault2025-10-2814.8± 0.826.93.716.6
OpenAI icon GPT-5-miniProprietaryDefault2025-10-2814.5± 1.223.15.619.7
Qwen Qwen-3-CoderOpen-SourceDefault2025-10-2814.5± 1.921.36.528.5
Gemini Gemini-3.1-Flash-LiteProprietaryDefault2026-03-1314.2± 1.220.47.431.2
Kimi Kimi-K2-0905Open-SourceDefault2025-10-2813.0± 2.022.25.626.6
Gemini Gemini-2.5-ProProprietaryDefault2025-10-2810.5± 1.921.32.826.5
Gemini Gemini-2.5-FlashProprietaryDefault2025-10-283.7± 1.58.30.08.3

Results bearing this badge were independently evaluated by us; sources for all other results are linked in the corresponding model names.

Claude-Opus was evaluated once due to budget constraints.

OpenAI models require the Responses API to achieve better native performance. Therefore, we modified the codebase to support the Responses API for these models, updated the results accordingly, and removed the previous results that were based on the Chat Completions API.