Skip to main content

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Real-world language agents must handle complex, multi-step workflows across diverse applications. The Tool Decathlon (dubbed as Toolathlon) is a benchmark for language agents offering diverse applications and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional applications like WooCommerce, Kubernetes, and BigQuery. And it includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple applications over approximately 20 interaction turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts.

ModelTypeAgentDatePass@1Pass@3Pass^3# Turns
OpenAI icon GPT-5.4-xHigh*ProprietaryDefault2026-03-0654.6
Gemini Gemini-3-FlashProprietaryDefault2025-12-1849.4± 0.459.336.128.6
Claude Claude-4.6-OpusProprietaryClaude Agent SDK2026-03-0647.2
Minimax iconMiniMax-M2.1Open-SourceDefault2025-12-2540.7± 0.851.927.817.8
ChatGLM GLM-5Open-SourceDefault2026-02-1339.2± 1.251.925.916.5
Qwen Qwen3.5-PlusOpen-SourceDefault2026-02-2137.7± 1.249.125.917.4
DeepSeek DeepSeek-V3.2-ThinkingOpen-SourceDefault2025-12-0135.2± 0.854.616.743.7
Kimi Kimi-K2.5Open-SourceDefault2026-02-0427.8± 0.838.914.817.2
Grok Grok-4ProprietaryDefault2025-10-2827.5± 1.738.916.720.3

* This result is sourced from the provider’s published report and has not yet been independently reproduced by ToolAthlon. The source document is linked in the model name.