| Tookit | Description | Tools |
|---|---|---|
| sleep | Sometimes other servers raise some errors and the agent has to wait for a while. | 1 |
| claim_done | Claim the task has been done. | 1 |
| handle_overlong_tool_outputs | Truncate overlong tool outputs to a preset threshold (100K characters) instead of placing the entire response into the context. | 4 |
| manage_context | Enable the model to know the context it has used, and do truncation. | 3 |
| history_search | Enable the agent to search some contents in the interaction history, for instance, if some important information is lost after truncation, this tool will be useful. | 5 |
| python_execute | Execute a python script. | 1 |
| web-search | Enable the agent to search the internet for content. | 1 |
MCP Server and Toolkits
Self Implemented Tookits
Toolkits we implemented to make it more robust and capable for our complex evaluation.
In addition to selecting from existing publicly available MCP servers, we also implemented several local servers for the model to call.