Structuring Instruction Datasets

Teaching a Small Language Model a specific task requires more than raw and unstructured text. The model must learn the explicit relationship between a given command and the appropriate response. This requires structuring the raw data into an instruction dataset. Supervised fine-tuning relies on the model learning to predict the next token based on a specific prompt format, making the initial data organization a significant factor for the success of the training process.

An instruction dataset consists of organized examples that demonstrate how the model should behave. Rather than reading a continuous block of text, the model processes discrete pairs of prompts and completions.

Components of an Instruction Pair

Each example in an instruction dataset typically contains three main fields. You must map your raw data into these distinct categories:

Instruction: The specific directive or task you want the model to perform. This could be a question, a command to summarize text, or a request to generate code.
Input: Optional background context required to complete the instruction. If the instruction is "Summarize the following article," the input field contains the article text.
Output: The exact target response the model must generate. During training, the model calculates its loss based on how closely its generated tokens match this output.

When preparing your dataset, these components are usually stored as structured JSON objects. A widely adopted standard for this structure is the Alpaca format. If a dataset contains $M$ examples, each example is formatted as a single dictionary within a larger JSON array.

[
  {
    "instruction": "Identify the primary programming language used in the provided code snippet.",
    "input": "def calculate_area(radius):\n    return 3.14 * radius ** 2",
    "output": "The primary programming language used in the snippet is Python."
  }
]

Handling Conversational Data

If you are training a conversational agent rather than a single-turn instruction follower, your data should reflect a multi-turn dialogue. The ChatML format is an industry standard for structuring conversational datasets. Instead of generic instruction and input fields, the data is organized by roles.

The roles typically include a system prompt that defines the overall behavior of the model, a user prompt representing the human input, and an assistant prompt representing the model's reply.

[
  {
    "messages": [
      {"role": "system", "content": "You are a helpful coding assistant."},
      {"role": "user", "content": "How do I print a string in Python?"},
      {"role": "assistant", "content": "You can print a string using the print() function."}
    ]
  }
]

Structuring data in this role-based format ensures the model learns the cadence of a back-and-forth conversation, recognizing when it is supposed to "listen" and when it is supposed to "speak."

Mapping to a Prompt Template

While JSON is excellent for storing and organizing data, a neural network cannot process JSON dictionaries directly. Before the data reaches the tokenizer, the structured fields must be concatenated into a single, continuous text string. This is achieved using a prompt template.

A prompt template injects specific separator tokens between your instruction, input, and output fields. These separators act as mathematical boundaries, helping the attention mechanism of the model distinguish between the user's command and its own generated text.

Data pipeline mapping structured JSON fields into a single text string for tokenization.

If you are fine-tuning a base model that already has an established instruction format, you must format your strings to match that exact template. Failing to use the correct formatting template will result in degraded performance, as the model will not recognize the boundaries between the prompt and the expected response.

Dataset Quality and Distribution

The structural formatting of the data is only one part of the equation. The actual content within those structured fields dictates the final behavior of the model.

When formatting instruction datasets, maintaining high data quality is more important than simply amassing a large volume of text. A model trained on 1,000 carefully curated, diverse instruction pairs will consistently outperform one trained on 50,000 repetitive or poorly formatted examples.

For instance, the distribution of your target sequence lengths. Let $N$ represent the number of tokens in a given output. If your dataset only contains examples where $N$ is small, perhaps 10 to 20 tokens, the fine-tuned model will heavily bias toward generating short responses. Even when prompted for a detailed explanation during deployment, the model will likely output a brief sentence. To prevent this, ensure your output fields contain a diverse range of lengths, matching the distribution of responses you expect the model to produce in a production environment.

References

Self-Instruct: Aligning Language Models with Self-Generated Instructions, Yizhe Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi, 2023 Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) DOI: 10.48550/arXiv.2212.10560 - This paper describes the methodology for creating instruction-following data, providing the foundation for the triplet structure (instruction, input, output) used in the Alpaca format.
LIMA: Less Is More for Alignment, Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Fan, Tao Chen, Sneha Kudugunta, and Luke Zettlemoyer, 2023 arXiv preprint arXiv:2305.11206 DOI: 10.48550/arXiv.2305.11206 - This study supports the section's emphasis on data quality over quantity, demonstrating that a small set of high-quality examples can produce strong alignment results.
Chat Templates, Hugging Face, 2024 - Official documentation explaining how to implement prompt templates to convert structured conversational data into the specific strings required by different models.
Instruction Tuning for Large Language Models: A Survey, Shengyi Jiang, Cat P. Le, Ruisi Zhang, and others, 2023 arXiv preprint arXiv:2308.10792 DOI: 10.48550/arXiv.2308.10792 - A comprehensive review that covers dataset construction, common formatting standards, and the mechanics of mapping data to prompt templates.