Skip to main content
In one Toolathlon task, an agent was asked to build a personal website in Notion. The agent-facing instructions listed four required sections. The evaluator, however, checked for five—including an Exhibitions section that the task itself never requested. An agent could follow the written task faithfully and still fail.That is not an agent-capability failure. It is a measurement failure.Today, we are releasing Toolathlon-Verified, a major repair and validation release of Toolathlon. It preserves the original 108-task scope while revising the task definitions, initial states, ground truth, evaluators, and execution infrastructure that turn those tasks into a trustworthy benchmark.
A benchmark failure should point to agent capability—not ambiguous instructions, stale data, brittle formatting checks, leaked state, or an external service that has not propagated yet.

The scale of the update

Toolathlon-Verified is not a new, smaller subset. Both versions contain the same 108 tasks, with no task added, removed, or renamed. Instead, artifacts changed across more than three quarters of the benchmark.
AreaVerified change scope
Task packages with net changes83 / 108
Tasks with changes under evaluation/76
Tasks with changes under preprocess/28
Tasks with changes under groundtruth_workspace/19
Tasks with changes under initial_workspace/14
Tasks with net changes to agent-facing task instructions14
These categories overlap. They describe where the final benchmark snapshot differs, not separate groups of tasks.The development-and-review range from 500e3d86 to the Verified snapshot d57361c0 contains 339 commits: 309 non-merge commits and 30 merge commits. We treat that number as evidence of the engineering and review effort—not as a claim that there were “339 bugs.” The history also includes infrastructure work, model and agent support, documentation, experiments, reviews, and revisions to earlier candidate fixes.

A task is a contract, not just a prompt

A long-horizon task is defined by more than its instruction text. The agent must receive a coherent contract across five layers:
  1. what the task asks for;
  2. what state the environment starts in;
  3. what tools can actually express;
  4. what the ground truth records; and
  5. what the evaluator accepts.
Toolathlon-Verified aligns these layers. The changes were often small in code but substantial in meaning:
  • Cross-region pricing: the iPad education-price task compared prices in Mainland China, the United States, Hong Kong, and Singapore without defining a common currency. The revised task explicitly converts each total to HKD and defines the expected numeric output.
  • Invoice completeness: the travel-reimbursement task now defines a complete invoice as one containing an invoice number, tax amount, and description. A missing field or N/A is treated as incomplete, and preprocessing preserves missing tax values as N/A rather than rendering them as zero.
  • Legal revision detection: the revised-terms task now defines a substantive legal change, excludes mere renumbering or relocation, and aligns the prompt, ground truth, and evaluator around the same evidence requirements.
  • Time-sensitive requirements: tasks such as language-school research no longer hard-code a quickly expiring application year. They specify the applicant context and how to use current-year information with a controlled fallback.
These are not prompt copyedits. They convert ambiguous, contradictory, or time-sensitive requests into executable contracts that agents can understand and evaluators can audit.

Correcting errors in both directions

Benchmark errors can reject valid work, but they can also reward incomplete or incorrect work. Toolathlon-Verified addresses both directions.

When correct work was rejected

In live-transactions, the evaluator called a logging helper with positional arguments. A vestigial parameter shifted the launch time into the log-name parameter, so the check queried the wrong logging target and could reject an otherwise correct run. The fix switched to explicit keyword arguments and added a bounded retry for logging visibility.Other false negatives came from incidental representation details: a trailing newline in a LaTeX file, equivalent timestamps represented in UTC rather than local time, harmless ordering differences, schema aliases, or a remote service returning a newly written object a few seconds late.

When incorrect work could pass

In language-school, a comparison helper returned (is_different, error_message), but the caller tested the tuple directly. Non-empty tuples are truthy in Python, so scalar mismatches could silently evade the failure path. The same task also contained an incorrect CMU IELTS ground-truth value, which was corrected from 7.5 to 7.In apply-phd-email, preprocessing cleared the sender mailbox but not the receiver mailbox read by the evaluator. A correct ZIP attachment left over from an earlier run could therefore make a later, incorrect submission pass. Toolathlon-Verified clears the state that grading actually observes.The goal was not to make graders uniformly more permissive. Candidate changes were accepted, rejected, or partially adopted according to the task contract. Proposals that would have widened the accepted answer space beyond the intended task were rejected or tightened.

Reproducible state and trusted grading

Stateful applications make agent evaluation more realistic, but they also create more ways for runs to interfere with one another. Toolathlon-Verified strengthens isolation at both task and grading time:
  • five previously unseeded preprocessing scripts now use fixed random seeds;
  • Kubernetes clusters are namespaced per instance, and Canvas cleanup is limited to task-owned courses;
  • affected tasks clear the historical email, application, or service state that could interfere with the current run;
  • evaluator, preprocessing, and ground-truth artifacts are withheld from the agent; and
  • scoring restores those artifacts from private, hash-checked copies and removes agent-created name collisions first.
The default phased containerized and decoupled runners both apply this protection. The evaluator uses trusted configuration and host-observed execution state rather than trusting fields that an agent could write into its own trajectory.
  1. Task contract
  2. Deterministic setup
  3. Isolated Agent run
  4. Trusted grading
  5. Failure analysis
  6. Rerun and publish

External-service failures are not agent failures

Toolathlon tasks use real services, so a successful write and an immediately visible read are not always the same event. Toolathlon-Verified separates two failure layers:
  • Transport reliability: bounded timeouts and retries handle transient connection failures, rate limits, and server errors for services such as Canvas, Notion, WooCommerce, Poste, and Google APIs. Deterministic client errors still fail immediately.
  • Evaluation visibility: 43 task evaluators now use shared or equivalent bounded whole-check retries for cases where a service accepted a write but has not exposed it to the grader yet.
The release also adds readiness probes for Canvas, Poste, WooCommerce, and Kubernetes, propagates preprocessing failures instead of silently continuing, preloads Kubernetes task images, and namespaces Kubernetes clusters per instance.Several failures required service-specific fixes. The PDF tool could deadlock after repeated searches because a non-reentrant lock was acquired recursively. The arXiv wrapper could retain a sticky failure state after a transient error. Canvas initialization could appear ready before Rails and the database were actually ready. Notion’s rotating OAuth refresh token could be lost or raced across short-lived containers. Each issue could look like weak agent behavior from the outside; none measured agent capability.

A more comparable execution interface

We also added a decoupled execution mode. Preprocessing, MCP services, state, and evaluation remain controlled by the task container, while an agent scaffold can run on the host through a single MCP gateway. This makes it possible to compare different scaffolds without rebuilding the task environment around each one.At the model boundary, Toolathlon normalizes model-visible tool names and handles provider differences in reasoning and tool-call payloads. The intent is not to erase meaningful scaffold differences, but to prevent avoidable protocol quirks from becoming benchmark failures.

How we verified the fixes

Toolathlon-Verified was not produced by applying a blanket tolerance rule. Candidate fixes were reviewed against the full task contract—prompt, initial state, available tools, ground truth, and evaluator. Review records include accepted, rejected, and partially accepted changes. In one reconciliation pass, 14 disputed task changes were resolved individually: nine candidate changes were adopted, while five existing implementations were retained or further strengthened.Key failures were traced through real agent submissions, reproduced on the evaluation service, and rerun after repair. Targeted regression tests were added where appropriate across task evaluators, preprocessing, runners, tool adapters, and grading isolation.Finally, we evaluated each of the six reported models across three benchmark runs. The released trajectories expose model messages and tool calls for inspection, while the leaderboard reports final grading outcomes. Together they provide auditable evidence for the benchmark; we do not claim that every published trajectory was manually reviewed line by line.

A new Verified baseline

The chart below reports mean Pass@1 across the three runs for each model. Error bars show the across-run standard deviation—not a confidence interval. All six rows use the Default agent configuration and are dated June 30, 2026.
Toolathlon-Verified Pass@1 ranking: Claude Opus 4.8 (max) 76.2 plus or minus 3.4, GPT-5.5 (xhigh) 73.5 plus or minus 1.2, Gemini 3.5 Flash (high) 67.3 plus or minus 1.2, GLM 5.2 (max) 59.9 plus or minus 1.9, kimi-k2.7-code 58.0 plus or minus 4.3, and Deepseek-v4-pro (max) 55.9 plus or minus 1.2. Error bars show standard deviation across three runs.
The live leaderboard also reports Pass@3, Pass³, average turns, model type, agent configuration, and evaluation date. A green check marks rows independently evaluated by the Toolathlon team.
Toolathlon-Verified begins a separate official score series; its results are not directly comparable with the original release. Prompts, expected outputs, evaluators, initial states, and the execution stack changed. Differences from earlier scores should not be interpreted as like-for-like model gains or regressions.

Limitations and what comes next

“Verified” does not mean frozen forever. Toolathlon deliberately evaluates agents against real applications, APIs, data sources, and authentication systems. Those systems will continue to change. New models may also reveal valid solution paths that existing evaluators do not yet capture.The practical lesson is that an agent benchmark must be maintained as a measurement system, not published once as a static collection of tasks. We will continue to investigate reproducible failures, distinguish benchmark defects from agent errors, and update the Verified series when the evidence warrants it.

Acknowledgements

Toolathlon-Verified is primarily maintained and developed by @jxhe and @lockon-n. We thank @bugmaker00 for the valuable feedback provided throughout the verification and repair process.

Resources