Creating and Structuring Custom Datasets

Transforming raw data into a structured format is essential for effective model learning. An LLM fine-tuning dataset is not just a collection of text; it is a carefully assembled set of examples, each demonstrating a specific input-to-output behavior. This process involves defining a consistent structure and applying it to raw information to create clear, learnable instances.

From Raw Information to Training Instances

The fundamental unit of most instruction-based fine-tuning datasets is an input-output pair. Your goal is to convert unstructured content, like an article, a document, or a log file, into a series of these pairs. Each pair serves as a single training example.

For example, here's a short paragraph of raw text about the planet Mars:

Mars, often called the Red Planet, is the fourth planet from the Sun. Its reddish appearance is due to iron oxide prevalent on its surface. Mars has a thin atmosphere and two small moons, Phobos and Deimos. It is a terrestrial planet with a cold, desert-like surface.

From this single paragraph, you can generate multiple training instances, each targeting a different skill:

Summarization:
- Input: "Summarize the following text: Mars, often called the Red Planet, is the fourth planet from the Sun..."
- Output: "Mars is the fourth planet from the Sun, known as the Red Planet because of surface iron oxide. It has a thin atmosphere and two moons, Phobos and Deimos."
Question Answering:
- Input: "Why is Mars called the Red Planet?"
- Output: "Mars is called the Red Planet because its surface is covered in iron oxide."
Information Extraction:
- Input: "Extract the names of Mars's moons from this sentence: Mars has a thin atmosphere and two small moons, Phobos and Deimos."
- Output: "Phobos, Deimos"

This transformation from a block of text into discrete, task-oriented examples is the core of creating a custom dataset. The quality and diversity of these examples will directly influence the final capabilities of your fine-tuned model.

Structuring Data with Prompt Templates

To maintain consistency, especially when your data has multiple components like an instruction, context, and a specific input, you should use a prompt template. A template is a fixed string format that organizes the different parts of your input into a single prompt. This ensures the model always sees inputs in the same structure it was trained on, which is important for reliable performance.

A common structure for a single data point is a dictionary containing keys like instruction, input, and output.

{
  "instruction": "Identify the main scientific reason provided.",
  "input": "The document states that Mars appears red because its surface is rich in iron oxide.",
  "output": "The main scientific reason is the prevalence of iron oxide on its surface."
}

You can then use a prompt template to combine the instruction and input fields into a coherent prompt for the model. For instance, you might use a template like this:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:

Applying this template to the JSON example above creates the final formatted prompt that the model will process during training. The model's task is to learn to generate the text from the output field whenever it sees a prompt that starts this way.

Here is a simple Python function to demonstrate this formatting:

def create_prompt(instruction, input_text):
    """Formats the instruction and input into a standardized prompt."""
    prompt_template = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input_text}

### Response:"""
    return prompt_template.format(instruction=instruction, input_text=input_text)

# Example usage:
data_point = {
  "instruction": "Identify the main scientific reason provided.",
  "input": "The document states that Mars appears red because its surface is rich in iron oxide.",
  "output": "The main scientific reason is the prevalence of iron oxide on its surface."
}

# The model sees this combined prompt and learns to generate the output
formatted_prompt = create_prompt(data_point["instruction"], data_point["input"])
print(formatted_prompt)

The output field is not part of the input prompt; it is the target generation that the model learns to produce. During training, the combination of formatted_prompt and output constitutes one complete training example.

The Data Transformation Workflow

The overall process can be visualized as a pipeline that converts raw, unstructured information into a final, model-ready dataset file. This involves extracting meaningful pairs and structuring them consistently.

The workflow for creating a custom dataset. Raw text is processed into structured instruction and output pairs, which are then saved into a final dataset file like JSON Lines.

Serializing Datasets with JSON Lines

Once you have a list of structured data points (e.g., a list of Python dictionaries), you need to save them to a file. While a standard JSON file containing a single large list is an option, it is not ideal for large datasets. A better format is JSON Lines (or .jsonl).

In a JSON Lines file, each line is a separate, self-contained JSON object.

Standard JSON (data.json):

[
  {
    "instruction": "Summarize the text.",
    "input": "Mars is a cold, desert-like planet...",
    "output": "Mars is a cold planet with a desert environment."
  },
  {
    "instruction": "What are the moons of Mars?",
    "input": "The two moons of Mars are Phobos and Deimos.",
    "output": "Phobos and Deimos."
  }
]

JSON Lines (data.jsonl):

{"instruction": "Summarize the text.", "input": "Mars is a cold, desert-like planet...", "output": "Mars is a cold planet with a desert environment."}
{"instruction": "What are the moons of Mars?", "input": "The two moons of Mars are Phobos and Deimos.", "output": "Phobos and Deimos."}

JSON Lines offers several advantages for machine learning workflows:

Streaming: You can read and process the file one line at a time without loading the entire dataset into memory. This is highly efficient for datasets that are gigabytes in size.
Appendable: You can easily add new examples by appending new lines to the end of the file.
Robustness: If the file writing process is interrupted, you don't corrupt the entire file, only potentially the last line written.

Here is how you can write your structured data to a .jsonl file in Python:

import json

# Assume 'structured_data' is a list of dictionaries
structured_data = [
    {
      "instruction": "Summarize the text.",
      "input": "Mars is a cold, desert-like planet...",
      "output": "Mars is a cold planet with a desert environment."
    },
    {
      "instruction": "What are the moons of Mars?",
      "input": "The two moons of Mars are Phobos and Deimos.",
      "output": "Phobos and Deimos."
    }
]

file_path = "my_dataset.jsonl"

with open(file_path, 'w') as f:
    for entry in structured_data:
        json_record = json.dumps(entry)
        f.write(json_record + '\n')

print(f"Dataset saved to {file_path}")

By following these steps, you transform your initial collection of raw information into a clean, structured dataset. This file is now ready for the final preparation step, tokenization, where the text will be converted into a numerical format the model can directly process.

Build LLM apps faster with Kerb

Cleaner syntax. Built-in debugging. Production-ready from day one.

Built for the AI systems behind ApX Machine Learning

Was this section helpful?

References

Self-Instruct: Aligning Language Models with Self-Generated Instructions, Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi, 2022 ACL 2023 DOI: 10.48550/arXiv.2212.10560 - This paper introduced the concept of generating instruction-following datasets using large language models, which supports the idea of creating input-output pairs for fine-tuning.
Hugging Face datasets Library Documentation, Hugging Face, 2024 (Hugging Face) - Provides comprehensive guidance on loading, processing, and saving datasets for machine learning, including support for formats like JSON Lines.
JSON Lines Text Format, 2013 - Official specification for the JSON Lines text format, explaining its structure and benefits for streaming data.