Quick Start
Basically you have four ways of running Toolathlon evaluation:
- Using our public evaluation service: Check EVAL_SERVICE_README.md for more details.
- Setup your own Toolathlon evaluation service on your own machine as detailed below.
- If you are a major user that will use Toolathlon evaluation a lot, you can also contact us ([email protected] / [email protected]), we may be able to provide a dedicated evaluation service for you (for free).
- If you have an API endpoint and just want to test your model, you can contact us ([email protected] / [email protected]) and we are happy to help you run evaluation on Toolathlon with your given API endpoint.
Using Our Public Evaluation Service
We provide Toolathlon evaluation as a service on public servers, where we have setup all the required MCP accounts and you don’t need to worry about the setup — you don’t even need to install any MCP-related dependencies, evaluation can be ran by just communicating with our public server such as:Besides using the evaluation service, below we introduce how to setup the Toolathlon evaluation on your own machine.
Installation Dependencies
uv
Make sure you have uv installed, otherwise please install it:uv. Just run:
Docker/Podman
For each task we setup a separate container for it to be executed. We assume you have docker or podman installed and correctly configurated. Please specify your choice on these two inconfigs/global_configs.py.
Then, pull our prepared image:
Configure Global Configs
Simply set these two env variables, note thatTOOLATHLON_OPENAI_BASE_URL must be an OpenAI SDK compatible one, as our agent scaffold relies on this:
utils/api_model/model_provider.py). You can also use any model deployed on your own machine, like via vLLM or SGLang, in that case you do not need to set the api key.
(Optional) We also provide some pre-configurated options for you in configs/global_configs.py to manage all LLM APIs. You may open this file and fill in the api keys in it, and specify which provider you want to use later.
You can find details about model providers in utils/api_model/model_provider.py.
Quick Example
After the above two steps, we provide a very quick example here. We use claude-sonnet-4-5 via openrouter in this example, so make sure you have configured TOOLATHLON_OPENAI_API_KEY and TOOLATHLON_OPENAI_BASE_URL accordingly if you want to run this script without any modification.dumps_quick_start/anthropic_claude-sonnet-4.5/finalpool/find-alita-paper.
Full Preparation
Choose a Proper Machine
To run our benchmark, we strongly suggest you deploy it on a Linux machine with docker installed that can directly access the Internet. Although you can indeed run our benchmark without sudo, some configurations still need this (you may ask an administrator to help you with this), like configuring podman and inotify parameters (see ”# k8s” part inglobal_preparation/install_env_minimal.sh).
Configure App-Aware Tokens, Keys and Credentials
Please read carefully through how2register_accounts.md and follow the guides. You need to register some accounts and configure some tokens/api keys/secrets inconfigs/token_key_session.py.
Misc Configuration
Simply run the following:Deploy Needed Apps
deployment/*/scripts/setup.sh for each local application we deployed.
MCP Servers Verification
Make sure you have finished all the previous steps, and then you can run the following script to check if all MCP servers are working properly, after you setup all the above configs and deployed the app containers:Run Single Task
We use the same scriptscripts/run_single_containerized.sh to run any task, just simply switch to another task in the input arguments:
utils/api_model/model_provider.py for more details.
Evaluation in Parallel with Task Isolation
You can run this to enable evaluation in parallel:bash global_preparation/deploy_containers.sh [true|false] again) each time before you launch a formal parallel evaluation.
This will run all the tasks in parallel with at most 10 workers, and you will find all output trajectories and evaluation summary (eval_stats.json) in {your_dump_path}.
If you’d like to evaluate multiple models in sequence, we provide an ensemble script for you:
Visualization
To facilitate viewing the reasoning trajectories of LLMs, we provide a replay tool for developers to visualize any trajectory invis_traj. After obtaining the results, you can simply run the following command:
Supporting Multiple Agent Scaffolds
In addition to the scaffold we have implemented in Toolathlon based on the openai-agent-sdk, we are also committed to introducing more scaffolds for more comprehensive testing. Currently, we have preliminarily integrated OpenHands, which can be found in ouropenhands-compatibility branch. In the future, we hope to introduce more scaffolds, and we also welcome community contributions of Toolathlon implementations or testing results under other scaffolds.