Hugging Face Model Upload

Toolathlon-Verified · Website task ID 19 · Canonical task huggingface-upload View the task source at d57361c0

Required Tools

MCP Servers

filesystem

terminal

huggingface

Local Tools

claim_done

python_execute

handle_overlong_tool_outputs

manage_context

history

Instruction

Please scan the workspace folder, pick the model checkpoint with the highest eval_accuracy, then push the best model’s folder to Hugging Face Hub as a model repo named MyAwesomeModel-TestRepo. Finalize the repo’s README.md with the detailed evaluation results for all 15 benchmarks (keep three decimal places), you must refer to the current README.md under workspace and ensure its completeness in the uploaded repo. Do not change any other content in the README.md besides the benchmark scores. You can use the hf_token.txt under the workspace if necessary.

Initial State

Local Workspace

workspace/ ├── checkpoints/ │ ├── step_100/ │ │ ├── config.json │ │ └── pytorch_model.bin │ ├── step_1000/ │ │ ├── config.json │ │ └── pytorch_model.bin │ ├── step_200/ │ │ ├── config.json │ │ └── pytorch_model.bin │ ├── step_300/ │ │ ├── config.json │ │ └── pytorch_model.bin │ ├── step_400/ │ │ ├── config.json │ │ └── pytorch_model.bin │ ├── step_500/ │ │ ├── config.json │ │ └── pytorch_model.bin │ ├── step_600/ │ │ ├── config.json │ │ └── pytorch_model.bin │ ├── step_700/ │ │ ├── config.json │ │ └── pytorch_model.bin │ ├── step_800/ │ │ ├── config.json │ │ └── pytorch_model.bin │ └── step_900/ │ ├── config.json │ └── pytorch_model.bin ├── evaluation/ │ ├── benchmarks/ │ │ ├── code_generation/ │ │ │ └── eval.py │ │ ├── common_sense/ │ │ │ └── eval.py │ │ ├── creative_writing/ │ │ │ └── eval.py │ │ ├── dialogue_generation/ │ │ │ └── eval.py │ │ ├── instruction_following/ │ │ │ └── eval.py │ │ ├── knowledge_retrieval/ │ │ │ └── eval.py │ │ ├── logical_reasoning/ │ │ │ └── eval.py │ │ ├── math_reasoning/ │ │ │ └── eval.py │ │ ├── question_answering/ │ │ │ └── eval.py │ │ ├── reading_comprehension/ │ │ │ └── eval.py │ │ ├── safety_evaluation/ │ │ │ └── eval.py │ │ ├── sentiment_analysis/ │ │ │ └── eval.py │ │ ├── summarization/ │ │ │ └── eval.py │ │ ├── text_classification/ │ │ │ └── eval.py │ │ └── translation/ │ │ └── eval.py │ ├── build/ │ │ ├── lib.linux-x86_64-cpython-313/ │ │ │ └── utils/ │ │ │ ├── init.cpython-313-x86_64-linux-gnu.so │ │ │ └── benchmark_utils.cpython-313-x86_64-linux-gnu.so │ │ └── temp.linux-x86_64-cpython-313/ │ │ └── utils/ │ │ ├── init.o │ │ └── benchmark_utils.o │ ├── utils/ │ │ ├── init.c │ │ ├── init.cpython-313-x86_64-linux-gnu.so │ │ ├── benchmark_utils.c │ │ └── benchmark_utils.cpython-313-x86_64-linux-gnu.so │ ├── .setup.py.swp │ ├── eval.py │ └── setup.py ├── figures/ │ ├── fig1.png │ ├── fig2.png │ └── fig3.png └── README.md

Runtime Setup

This task initializes application or service state during preprocessing. Review the pinned setup source.

Legacy Trajectories

These replays were produced on the original Toolathlon release. They are retained for historical inspection and are not Toolathlon-Verified results.

✅ Claude-4.5-Sonnet
❌ Deepseek-v3.2-Exp

​Required Tools

​Instruction

​Initial State

​Local Workspace

​Runtime Setup

​Legacy Trajectories

Required Tools

Instruction

Initial State

Local Workspace

Runtime Setup

Legacy Trajectories