Skip to main content

MCP Servers
fetch
scholarly
google_sheet
playwright_with_chunk
Local Tools
history
web_search
claim_done
python_execute
manage_context
handle_overlong_tool_outputs

Instruction

I’ve been learning about large-scale language models recently, and I’ve decided to train a decoder-only language model on my own. The first step is to prepare pre-training data, so I need your help organizing the pre-training data for llama and gpt-neo into the ptdata sheet in LLM Pre-training Data spreadsheet. The pre-training data should be sorted in descending order by data size, with columns named name, use in llm (the value should only include llama and gpt-neo; if used by both models, it should be either gpt-neo or llama), size (the numeric value in GB, but do not include “GB” in cell), and link in order. The link to huggingface is preferred; if huggingface doesn’t have the dataset, we’ll try to provide other links.
I