Quick Start
Installation Dependencies
Make sure you haveuv installed, otherwise please install it:
uv. Just run:
Configuate Global Configs (Part 1: LLM APIs)
Please copy theconfigs/global_configs_example.py to a new file configs/global_configs.py:
configs/global_configs.py to manage all LLM APIs, you need to open this file and fill in the api keys in it. Note that you do not need to fill in all of them, but instead just fill in the api keys for the providers you want to use. We recommand using openrouter, as it enables us to use various LLMs by only configurating one api key.
You can find details about the model providers in utils/api_model/model_provider.py.
Quick Example
After the above two steps, you can directlt run this very quick example. We use claude-4.5-haiku-1001 via openrouter in this example, so make sure you have configurated it.dumps_quick_start/claude-4.5-haiku-1001/finalpool/SingleUserTurn-find-alita-paper.
Full Preparation
Configuate Global Configs (Part 2: Containerization)
Make sure you have docker or podman installed and correctly configurated, please fill in your choice inglobal_configs.py
Configuate App-Aware Tokens, Keys and Credentials
Please copy theconfigs/token_key_session_example.py to a new file configs/token_key_session.py:
global_preparation/how2register_accounts.md and follow the guides. You need to register some accounts and configure some tokens/api keys/secrets in configs/token_key_session.py.
Misc Configuration
Simply run the following:Deployment Needed Apps
deployment/*/scripts/setup.sh for each local application we deployed.
MCP Servers Verfication
You can simply run this script to check if all MCP servers are working properly, after you setup all the above configs and deployed the app containers:Run Single Task
We use the same scriptscripts/quick_start/quick_start_run.sh to run any task, just simply edit the task variable in this script:
Evaluation in Parallel with Task Isolation
To ensure that the execution of different tasks does not interfere with each other, we use containerization to run each task in an isolated environment. This also makes it possible to run tasks in parallel, greatly accelerating evaluation speed.docker.io/lockon0927/toolathlon-task-image:1016beta, which will be pulled automatically in global_preparation/install_env.sh, so you do not need to pull it manually.
This will run all the tasks in parallel with at most 10 workers, and you will find all output trajectories and evaluation summary (eval_stats.json) in ./{your_dump_path}.
If you’d like to evaluate multiple models in sequence, we provide an ensemble script for you: