All Courses

Practice: Preparing an Instruction Tuning Dataset

Having explored the principles of instruction tuning and the required data formats, let's put theory into practice. This section guides you through the essential steps of taking raw, potentially unstructured data and transforming it into a high-quality dataset suitable for Supervised Fine-tuning (SFT) of an instruction-following LLM. Careful data preparation is a significant factor in the success of fine-tuning.

Objective

Our goal is to convert a source dataset, often found in various formats, into a consistent structure that clearly delineates instructions, optional context or input, and the desired model output. We will focus on creating text prompts that the model will learn to complete during SFT.

Source Data Example

Instruction tuning datasets often come in structured formats like JSON or CSV. A common pattern, inspired by datasets like Alpaca, is a list of dictionaries, where each dictionary represents an instruction-following example. Let's assume our raw data is in a JSON Lines (.jsonl) file, where each line is a JSON object like this:

{"instruction": "Convert the temperature from Celsius to Fahrenheit.", "input": "25", "output": "77"}
{"instruction": "Explain the concept of photosynthesis in simple terms.", "input": "", "output": "Photosynthesis is the process plants use to turn sunlight, water, and carbon dioxide into food (sugar) and oxygen."}
{"instruction": "Write a short poem about a rainy day.", "input": "", "output": "Grey skies weep soft tears,\nWindow panes reflect the gloom,\nPuddles mirror clouds,\nNature sighs, a damp perfume."}
{"instruction": "", "input": "Translate 'hello' to French", "output": "Bonjour"}
{"instruction": "Summarize the main points.", "input": "Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by humans and animals. Leading AI textbooks define the field as the study of \"intelligent agents\": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.", "output": ""}

This raw data might contain inconsistencies: missing instructions, missing outputs, or empty inputs where they might be implicitly needed.

Defining the Target Format

For SFT, we typically concatenate the instruction, input (if available), and output into a single text sequence, often using specific separators or templates. This combined text serves as the training example. The model is trained to predict the output part, given the instruction and input as context.

A widely used template structure looks like this:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}

If the input field is empty, the template might be simplified:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{output}

Choosing a consistent format is important for the model to learn the pattern of instruction following. The ### markers help delineate the different parts of the prompt clearly. The introductory sentence sets the context for the model.

Processing Steps with Python

Let's use the Hugging Face datasets library, a standard tool for handling datasets in the NLP ecosystem.

Load the Data: Assuming our raw data is in raw_data.jsonl:

from datasets import load_dataset

# Load the raw dataset
raw_dataset = load_dataset('json', data_files='raw_data.jsonl', split='train')

print(f"Initial number of examples: {len(raw_dataset)}")
print("Example entry:")
print(raw_dataset[0])

Clean and Filter: We need to remove examples that are unsuitable for training. Common issues include missing instructions or outputs.

# Filter out examples with empty instructions or outputs
def filter_invalid_examples(example):
    return example['instruction'] is not None and \
           example['instruction'].strip() != "" and \
           example['output'] is not None and \
           example['output'].strip() != ""

cleaned_dataset = raw_dataset.filter(filter_invalid_examples)

print(f"Number of examples after cleaning: {len(cleaned_dataset)}")

# Optional: Further filtering based on length (e.g., remove extremely short/long examples)
# min_length = 10
# max_length = 1024 # Example token length limit
# cleaned_dataset = cleaned_dataset.filter(lambda x: min_length < len(x['instruction']) + len(x['output']) < max_length)
# print(f"Number of examples after length filtering: {len(cleaned_dataset)}")

This step is essential. Training on poorly formed examples can significantly degrade model performance or teach undesirable behaviors.

Format the Data: Apply the chosen template to each example.

def format_instruction(example):
    if example.get('input') and example['input'].strip():
        # Format with input
        prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

Instruction:

{example['instruction']}

Input:

{example['input']}

Response:

{example['output']}""" else: # Format without input prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

Instruction:

{example['instruction']}

Response:

{example['output']}""" # Append an End-of-Sequence token if required by the model/training framework # prompt += "" # Example for models requiring EOS token return {"text": prompt}

formatted_dataset = cleaned_dataset.map(format_instruction, remove_columns=raw_dataset.column_names)

print("Example formatted entry:")
print(formatted_dataset[0]['text'])
```
The resulting `formatted_dataset` contains a single column, `text`, where each entry is a complete prompt ready for SFT.

4. Visualize Data Characteristics (Optional): Understanding the distribution of prompt lengths after formatting can be useful for setting training parameters like maximum sequence length.

```python
import pandas as pd

# Calculate lengths
lengths = [len(x['text']) for x in formatted_dataset]
df = pd.DataFrame({'length': lengths})

# Basic statistics
print(df['length'].describe())

# Create a histogram using Plotly format
# (Note: For large datasets, sample or use appropriate binning)
# This example uses fixed data for demonstration
chart_data = {"layout":{"title":{"text":"Distribution of Formatted Prompt Lengths"},"xaxis":{"title":{"text":"Prompt Length (Characters)"}},"yaxis":{"title":{"text":"Count"}},"bargap":0.1},"data":[{"type":"histogram","x":[250, 300, 150, 450, 500, 600, 350, 280, 420, 550, 180, 220, 380, 480, 580, 320],"marker":{"color":"#228be6"}}]}

# Display the chart data (in a real environment, you'd render this JSON)
print("Chart JSON:")
```
```plotly
{"layout":{"title":{"text":"Distribution of Formatted Prompt Lengths"},"xaxis":{"title":{"text":"Prompt Length (Characters)"}},"yaxis":{"title":{"text":"Count"}},"bargap":0.1,"width":600,"height":400},"data":[{"type":"histogram","x":[250, 300, 150, 450, 500, 600, 350, 280, 420, 550, 180, 220, 380, 480, 580, 320],"marker":{"color":"#228be6"}}]}
```

> Distribution of character lengths for formatted prompts in the prepared dataset. Understanding this helps in setting appropriate sequence length limits during training.

5. Save the Processed Dataset: Save the final dataset in a format suitable for your training framework. Parquet is often a good choice for efficiency.

```python
# Save to disk
formatted_dataset.to_json("processed_instruction_dataset.jsonl", orient="records", lines=True)
# Or save in Arrow/Parquet format for efficiency
# formatted_dataset.save_to_disk("processed_instruction_dataset_arrow")

print("Processed dataset saved.")
```

Important Considerations

Consistency: Ensure the chosen prompt template is applied uniformly across all examples. Inconsistent formatting can confuse the model.
Data Quality: The "garbage in, garbage out" principle strongly applies. Time spent cleaning and filtering the source data is often more valuable than simply increasing dataset size with low-quality examples. Look for ambiguity, factual errors, or harmful content in instructions and outputs.
Bias: Instruction datasets can inherit biases from their source or generation process. While cleaning can remove egregious examples, be mindful that subtle biases might remain and require specific evaluation later (as covered in Chapter 6).
Tokenization: The length calculations above are character-based. During training, the actual length will depend on the tokenizer used. Be aware that the number of tokens might differ significantly from the character count, especially with complex words or code. You might need to tokenize the data during preprocessing to get accurate length statistics relevant to the model's context window.

This practical exercise demonstrates the core workflow for preparing instruction tuning data. While specific datasets and requirements might necessitate more sophisticated cleaning or formatting logic, the fundamental steps of loading, cleaning, transforming, and saving remain consistent. A well-prepared dataset is the foundation for effective instruction fine-tuning.

Was this section helpful?