MCP Servers




Local Tools




manage_context
handle_overlong_tool_outputs
Instruction
I’ve been learning about large-scale language models recently, and I’ve decided to train a decoder-only language model on my own. The first step is to prepare pre-training data, so I need your help organizing the pre-training data for llama and gpt-neo into theptdata sheet in LLM Pre-training Data spreadsheet. The pre-training data should be sorted in descending order by data size, with columns named name, use in llm (the value should only include llama and gpt-neo; if used by both models, it should be either gpt-neo or llama), size (the numeric value in GB, but do not include “GB” in cell), and link in order. The link to huggingface is preferred; if huggingface doesn’t have the dataset, we’ll try to provide other links.