Introducing Toolathlon

English
简体中文

Today, we’re excited to announce the very first release of Toolathlon — a benchmark designed to quantitatively evaluate how well LLM agents perform on long‑horizon tasks across diverse, realistic scenarios.

Motivation

Over the past six months, we’ve witnessed remarkable progress in LLM agents. They’ve become significantly more capable in areas such as vibe coding, deep research, and browsing, delivering substantial convenience for users in these domains.However, the real world is not limited to those use cases. Consider a few everyday examples:

As a teaching assistant, you might need to download students’ Python assignments from your email inbox, run the scripts according to a grading rubric, and record the grades in Canvas end‑to‑end.
As an online shop owner on Shopify/WooCommerce, you might need to identify problematic items in orders, then email the relevant customers with a customized Google Form link for follow‑up.
As a web developer, you might need to deploy a service to a Kubernetes cluster, test it in a browser, and archive the test report in your remote GitHub repository.

While none of these tasks are overly complex for a professional, they are:

Tedious and detail‑oriented,
Hard for non‑experts to pick up quickly, and
Ubiquitous in everyday workflows across many domains.

On the other hand, we already have the core ingredients to let agents handle such workflows: powerful models (Claude‑4.5‑Sonnet, GPT‑5, etc.) and the ability to connect them to real‑world applications (via MCP, REST APIs, GUI control, etc.). This raises some natural questions:

In complex, realistic, and common multi‑step tasks, how do different models actually perform?
How big are the performance gaps between them?
Can they really take a loosely defined task and deliver exactly the outcome we expect?

Toolathlon was built to answer these questions. We have drawn from a wide range of common real-world scenarios to collect task requirements that are both genuinely plausible and reflect actual user needs—tasks that users would realistically want to delegate to LLM agents (rather than tasks users can easily complete themselves or artificially contrived scenarios designed to trip up models). From this, we carefully curated 108 tasks (the video below shows an example task involving NVIDIA ownership trends).

These tasks feature the following characteristics:

Integration with 30+ MCP servers (e.g., email, file system, Hugging Face, databases, web browsers, etc.);
Access to 600+ tools, including both custom-built and standard APIs;
An average of 20+ interaction turns required to complete each task;
A fully automated, verifiable scoring script for every task;
Most tasks start from realistic initial states (not from empty files or blank databases).

Thanks to our containerized isolation and parallel execution architecture, the entire benchmark can be evaluated in under one hour (see Framework Overview), dramatically improving testing efficiency and reproducibility.Even the most advanced models today—such as Claude-4.5-Sonnet and GPT-5—achieve less than 40% accuracy on these tasks. We hope Toolathlon will provide new insights into evaluating the agent capabilities of current models and serve as a compass guiding future agents toward greater competence in real-world scenarios.You can learn more on our website:

Model Performance

The results in the figure above are the average results and error bars of the model in three runs. The agent scaffold used is the openai-agent-sdk framework officially adopted by Tooalthlon.

The Benchmark

Each task in Toolathlon has the following structure:

task/
├── preprocess/ # Code for setting up the initial working state (optional)
├── docs/ # Task instructions
	├── task.md
	└── agent_system_prompt.md
├── initial_workspace/ # Files related to the local initial working state (optional)
├── groundtruth_workspace/ # Files related to the standard answer (optional)
├── evaluation/ # Code for testing the correctness of the task
	├── main.py # Main evaluation script
	...
├── ... # Other required resources (optional)
└── task_config.json # Set up the tools and MCP server required for the task (optional)

As we can see, each task includes a strictly executable verification script. To enable these tasks to be solved by models, we adapted the openai-agent-sdk agent framework and added multiple MCP servers and common tools, allowing models to operate these applications and complete tasks within this framework.The task_config.json file must contain at least the following two properties:

needed_mcp_servers specifies the MCP servers that the agent may use.
needed_local_tools specifies the custom toolkits that the agent may use.

Task Examples

NVIDIA Market

Analyze NVIDIA’s institutional ownership trends across 8 quarters, adjust for stock split, populate results_template.xlsx with common holdings only.

Experiment Recordings

Update the Notion table with best scores and steps per benchmark from W&B runs, combining same-named runs and averaging available metrics.

Canvas Homework Grader Python

Grade Homework2 by downloading latest Python submissions from email, running them to check for errors, and assigning 10 (pass) or 0 (fail) in Canvas based on correctness.

Toolathlon Framework

To make our evaluation more realistic and easy to use, we designed the toolathlon framework as follows:

Multiple Language Model Support: Compare OpenAI, Anthropic, Google, and open source models through a standardized API.
Autonomous Agent Design: All functionality is provided as “tools,” and models can manage their execution autonomously through prompts.
Error Handling: Tool errors do not terminate the task, but instead return an error message, allowing the agent to retry or adjust its strategy.
Overlong Output Management: Automatically truncates excessively long tool responses and provides paging/search tools to access the full content.
Context and History Management: Provides tools for querying, deleting, and retrieving history, supporting long tasks that exceed the model context limit.
Containerized Isolation: Each task can run in a separate Docker/Podman container, ensuring that tasks do not interfere with each other.
Parallel Execution: Multiple processes batch run tasks, scalable linearly to additional computing resources, and ensuring stable and efficient large-scale evaluation.
State preservation and verification: Save the complete workspace after the task is completed, and use the script to compare the results with the expectations.

今天，我们正式发布 Toolathlon 的首个版本——一个面向多样化真实场景与需求、用于定量评估 LLM 代理在长程任务中表现的基准测试。

Motivation

在过去的半年多时间里，我们见证了各种大模型智能体能力的显著提升，它们在氛围编程（Vibe Coding）、深度研究（Deep Research）、信息浏览与检索（Browing）等领域已展现出长足的进步，并在上述领域中为用户带来了极大的便利。然而，现实世界中的任务不仅限于这些领域。以下是几个简单的例子：

作为助教，你可能需要端到端地完成从邮箱下载学生的 Python 作业、依据评分手册运行程序、并在 Canvas 上打分；
作为在线商店店主，你可能需要从订单中查找存在问题的商品，并向相关用户发送包含定制谷歌问卷链接的邮件进行提醒；
作为网站开发者，你可能需要在 K8s 集群中部署服务，打开浏览器进行测试，并将测试报告同步到远程 GitHub 仓库进行归档。

这些任务并不复杂，事实上它们对于专业人士都很直接且易懂的，但是这些任务仍然是：1）需要仔细操作且很耗费精力的 2）对于非专业人士难以上手的 3）在真实世界中每天都大量存在的。另一方面，事实上现在我们已经具备了让 Agent 完成上述任务的核心要素：足够聪明的模型（Claude-4.5-Sonnet, GPT-5等），以及让模型连接到真实世界各类应用的能力（如 MCP、REST API、GUI 等）。因此一个自然的问题是——在这些复杂，真实且常见的长流程任务中，不同大模型的表现究竟如何？各个模型之间的差距有多大？它们是否真的能够在接收到一个任务指令（即便指令并不十分明确）后，准确地以我们期望的方式完成工作？Toolathlon 正是为此而设计。我们基于多种常用应用场景，广泛收集了各种真实可能发生、且用户确实有需求借助大模型智能体来解决（而非用户自己已能轻而易举完成，或用完全不真实的场景刻意刁难模型）的任务需求，并精心构建了 108 个任务（下面的视频展示了一个有关英伟达持股趋势的任务示例）。

这些任务具有以下特点：

覆盖 30+ 个 MCP 服务器（如邮箱、文件系统、Hugging Face、数据库、浏览器等）
调用 600+ 个工具（包括 MCP 服务器和自定义工具）
平均每个任务需要 20+ 个交互轮次才能完成
每个任务都配备完全可验证的自动化评分脚本
大多数任务基于真实的初始状态（而非从空白文档或空数据库开始）

得益于容器化隔离与并行执行架构，所有任务的评估仅需一个小时左右（详见框架介绍）。即使是目前最先进的模型（Claude-4.5-Sonnet, GPT-5），在这些任务上也只能取得不到40%的准确率。我们希望 Toolathlon 能为评估当前模型的智能体能力带来新的洞察，并成为指引 Agent 不断提升、更加契合真实场景需求的指南针。关于Toolathlon的详细信息，可以进一步参看站内的其他页面：快速上手，框架介绍，可用MCP，自定义工具，任务结构，以及全部任务示例。

Model Performance

上图结果均为模型在三次运行中的平均结果与误差条，所用agent scaffold均为Tooalthlon官方规定的基于openai-agent-sdk的框架。

The Benchmark

Toolathlon中的每个任务都包括了如下结构：

task/
├── preprocess/             # 设定初始工作状态的代码 （可选）
├── docs/                   # 任务指令
    ├── task.md
    └── agent_system_prompt.md
├── initial_workspace/      # 本地初始工作状态的相关文件（可选）
├── groundtruth_workspace/  # 标准答案相关文件（可选）
├── evaluation/             # 测试任务正误的代码
	├── main.py				# 主要评估脚本 
	...
├── ...                     # 其他所需资源（可选）
└── task_config.json        # 设置该任务所需的工具与MCP服务器（可选）

其中可以看到每个任务都包含一个严格可执行的验证脚本。为了让这些任务可以被模型实际解决，我们改编了openai-agent-sdk这一智能体框架，并加入了多个MCP server和常用工具，使得模型可以在这一框架下实际操作这些应用程序完成任务。task_config.json 文件需要至少包含以下两个属性：

needed_mcp_servers 智能体可能用到的MCP服务器
needed_local_tools 智能体可能用到的我们自行实现的工具包

Task Examples

分析英伟达（NVIDIA）在八个季度的机构持股趋势，根据股票拆分进行调整，并仅将共同持股数据填入 results_template.xlsx 文件中。

Experiment Recordings

根据 W&B 运行结果，将各基准测试的最佳分数和步数更新到 Notion 表格中，合并同名运行，并对可用指标取平均值。

Canvas Homework Grader Python

通过从邮箱下载最新的 Python 作业提交，运行代码检查错误，并根据正确性在 Canvas 中给予 10 分（通过）或 0 分（未通过）来批改 Homework2。

Toolathlon Framework

为了让我们的测评更贴近真实并且易用，我们对toolathlon的框架进行了如下设计：

支持多种语言模型：通过标准化 API，比对 OpenAI、Anthropic、Google 及开源模型。
自主智能体设计：一切功能都以“工具”形式提供，模型可通过提示自主管理执行。
错误处理：工具出错时不会终止任务，而是返回错误信息，允许智能体自行重试或调整策略。
超长输出管理：自动截断超长工具响应，并提供分页/搜索工具访问完整内容。
上下文与历史管理：提供查询、删除和检索历史的工具，支持超过模型上下文限制的长任务。
容器化隔离：每个任务可在独立的 Docker/Podman 容器中运行，保证任务间互不干扰。
并行执行：多进程批量运行任务，线性扩展到更多计算资源，大规模评测稳定高效。
状态保存与验证：任务结束后保存完整工作空间，用脚本对比结果与预期。

Oct 28, 2025

​Motivation

​Model Performance

​The Benchmark

​Task Examples

​NVIDIA Market

​Experiment Recordings

​Canvas Homework Grader Python

​Toolathlon Framework

​Motivation

​Model Performance

​The Benchmark

​Task Examples

​Experiment Recordings

​Canvas Homework Grader Python

​Toolathlon Framework

Motivation

Model Performance

The Benchmark

Task Examples

NVIDIA Market

Experiment Recordings

Canvas Homework Grader Python

Toolathlon Framework

Motivation

Model Performance

The Benchmark

Task Examples

Experiment Recordings

Canvas Homework Grader Python

Toolathlon Framework