LLM Training Dataset

MCP Servers

fetch

scholarly

google_sheet

playwright_with_chunk

Local Tools

history

web_search

claim_done

python_execute

manage_context

handle_overlong_tool_outputs

Instruction

I’ve been learning about large-scale language models recently, and I’ve decided to train a decoder-only language model on my own. The first step is to prepare pre-training data, so I need your help organizing the pre-training data for llama and gpt-neo into the ptdata sheet in LLM Pre-training Data spreadsheet. The pre-training data should be sorted in descending order by data size, with columns named name, use in llm (the value should only include llama and gpt-neo; if used by both models, it should be either gpt-neo or llama), size (the numeric value in GB, but do not include “GB” in cell), and link in order. The link to huggingface is preferred; if huggingface doesn’t have the dataset, we’ll try to provide other links.

Tech & Dev

Research & Academic

Campus & Study

Daily & Entertainment

Finance & Market

Office & Business

Shopping & E-commerce

LLM Training Dataset

Instruction

Tech & Dev

Research & Academic

Campus & Study

Daily & Entertainment

Finance & Market

Office & Business

Shopping & E-commerce

​Instruction

Instruction