Skip to main content

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Real-world language agents must handle complex, multi-step workflows across diverse applications. The Tool Decathlon (dubbed as Toolathlon) is a benchmark for language agents offering diverse applications and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional applications like WooCommerce, Kubernetes, and BigQuery. And it includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple applications over approximately 20 interaction turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts.

ModelTypePass@1Pass@3Pass^3# Turns
Claude Claude-4.5-SonnetProprietary38.6± 2.751.920.420.2
OpenAI icon GPT-5Proprietary30.6± 1.543.516.718.7
Grok Grok-4Proprietary27.5± 1.738.916.720.3
DeepSeek DeepSeek-V3.2-ExpOpen-Source20.1± 1.227.812.026.0
ChatGLM GLM-4.6Open-Source18.8± 2.229.69.327.9
Qwen Qwen-3-CoderOpen-Source14.5± 1.921.36.528.5
Kimi Kimi-K2-0905Open-Source13.0± 2.022.25.626.6
Gemini Gemini-2.5-ProProprietary10.5± 1.921.32.826.5
  • 💰NV Market
  • 🏢Travel Reimbursement
  • 💻HF Upload
  • 🛒Product Recall
  • 🎓Homework Grader
  • 📖Prompt Box
  • 🎤Final Performance Analysis
I