Skip to main content

MCP Servers
wandb
notion
terminal
filesystem
Local Tools
history
claim_done
python_execute
manage_context
handle_overlong_tool_outputs

Instruction

In the table of the Notion page mcp_experiments_recordings, based on the historical experiments of W&B project mbzuai-llm/Guru, list the highest val-core acc mean@1/mean@k scores for each benchmark according to the table headers, and calculate and fill in the Best Step for that run (format: step(average acc)). Instructions:
  • If multiple runs have the same name, treat them as one run for combined statistics.
  • The average score should only be calculated using the arithmetic mean of metrics available at that step; missing metrics are not included.
  • Only operate on the target page under the specified parent page; do not change column names or order.

Initial State

Local Workspace

workspace └── table_template.txt

Model Trajectory