Hands-on Practical: Building a Fine-Tuning Dataset

Convert a raw dataset into a structured format suitable for instruction fine-tuning. The process involves a complete, repeatable workflow: loading raw data, cleaning it, structuring it into a consistent prompt format, and finally, tokenizing it for the model.

For this exercise, we will use a modified subset of the databricks-dolly-15k dataset, which contains instruction-response pairs generated by humans. This provides a realistic starting point for building a high-quality dataset.

1. Loading and Initial Inspection

Our first step is to load the data and get a feel for its contents. We'll use the Hugging Face datasets library, an essential tool in the LLM ecosystem that simplifies data loading and manipulation. Let's assume our data is in a JSON Lines (.jsonl) file, where each line is a JSON object.

from datasets import load_dataset

# Load the dataset from a local file or the Hugging Face Hub
# For this example, we'll assume it's available on the Hub.
raw_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

# Let's see the structure and a few examples
print(raw_dataset)
print(raw_dataset[0])

The output will show us the dataset's features (columns) and the content of the first example. Typically, you'll see fields like instruction, context, and response.

Dataset({
    features: ['instruction', 'context', 'response', 'category'],
    num_rows: 15011
})

{
  'instruction': 'When did Virgin Australia start operating?',
  'context': 'Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.',
  'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.',
  'category': 'closed_qa'
}

Our goal is to transform these separate fields into a single, structured text sequence that the model can learn from.

2. Cleaning and Filtering

Data is rarely perfect. It often contains empty fields, duplicates, or irrelevant information. A small amount of high-quality data is far more valuable than a large amount of noisy data.

Let's perform some basic cleaning. A common issue is entries where the instruction or response is missing or too short to be useful. We can filter these out.

# Filter out examples where the instruction is too short
initial_size = len(raw_dataset)
filtered_dataset = raw_dataset.filter(lambda example: len(example['instruction']) > 3)

# Filter out examples where the response is too short
filtered_dataset = filtered_dataset.filter(lambda example: len(example['response']) > 3)

print(f"Original size: {initial_size}")
print(f"Size after filtering: {len(filtered_dataset)}")

This is a simple filtering step. In a production scenario, you might add more sophisticated rules, such as removing examples with specific keywords, filtering out non-English text if that's your target, or de-duplicating similar instructions.

3. Structuring into a Prompt Format

A model learns to follow instructions by recognizing patterns. A consistent prompt structure is therefore essential for effective fine-tuning. The model needs to clearly distinguish between the instruction, any provided input, and the expected response.

We will adopt a simple and effective template that separates these components.

### Instruction:
{instruction}

### Input:
{context}

### Response:
{response}

If a sample has no context, we will omit the ### Input: section entirely to avoid feeding the model an empty field. Let's create a Python function to format each example.

def format_prompt(example):
    """Formats a single example into a standardized prompt string."""
    instruction = f"### Instruction:\n{example['instruction']}"

    # Use context if it exists and is not empty
    if example.get('context') and example['context'].strip():
        context = f"### Input:\n{example['context']}"
    else:
        context = ""

    response = f"### Response:\n{example['response']}"

    # Join the parts, filtering out empty strings
    full_prompt = "\n\n".join(filter(None, [instruction, context, response]))

    return {"text": full_prompt}

This function creates a new column named text containing the fully formatted prompt. We can apply this transformation to our entire dataset using the map method, which is highly efficient.

# Apply the formatting function to each example
structured_dataset = filtered_dataset.map(format_prompt)

# Let's inspect an example with context and one without
print("--- Example with context ---")
print(structured_dataset[0]['text'])

print("\n--- Example without context ---")
# Find an example without context to print
for ex in structured_dataset:
    if "### Input:" not in ex['text']:
        print(ex['text'])
        break

This unified text field is exactly what the model will see during the training process.

4. Tokenization

The final step in data preparation is tokenization: converting the formatted text strings into a sequence of integer IDs that the model can process. This step also requires careful handling of special tokens.

We need to choose a tokenizer that matches the base model we intend to fine-tune. For this example, let's use the tokenizer for meta-llama/Llama-2-7b-hf. It's also important to add an end-of-sequence (eos) token to the end of each prompt. This token signals to the model that the response is complete, teaching it when to stop generating text during inference.

from transformers import AutoTokenizer

# Load the tokenizer for a specific model
model_id = "meta-llama/Llama-2-7b-hf" 
# Note: You may need to authenticate with Hugging Face to access this model
# from huggingface_hub import login; login()
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Set a padding token if one is not already defined
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    # Append the EOS token to the end of each text
    text_with_eos = [s + tokenizer.eos_token for s in examples["text"]]

    # Tokenize the text
    return tokenizer(
        text_with_eos,
        truncation=True,        # Truncate sequences longer than max length
        max_length=512,         # A common max length
        padding=False,          # We will handle padding later with a data collator
    )

# Apply the tokenization
tokenized_dataset = structured_dataset.map(
    tokenize_function, 
    batched=True,  # Process examples in batches for efficiency
    remove_columns=structured_dataset.column_names # Remove old text columns
)

# Check the output of the tokenization
print(tokenized_dataset[0].keys())
print(tokenized_dataset[0]['input_ids'][:20]) # Print first 20 token IDs

The output now consists of input_ids and an attention_mask, which are the standard inputs for a Transformer model. We have successfully transformed our raw data into a model-ready format.

The following diagram summarizes our data preparation pipeline.

The data preparation workflow, from raw source file to a tokenized dataset ready for training.

5. Saving the Processed Dataset

After all this work, it's wise to save our final dataset to disk. This allows us to quickly load it for training later without repeating these preprocessing steps.

# Save the dataset to a local directory
tokenized_dataset.save_to_disk("./fine-tuning-dataset")

# You can load it back anytime using:
# from datasets import load_from_disk
# reloaded_dataset = load_from_disk("./fine-tuning-dataset")

With our dataset now built, cleaned, structured, tokenized, and saved, we are fully prepared for the next stage: using it to fine-tune a large language model.

Build LLM apps faster with Kerb

Cleaner syntax. Built-in debugging. Production-ready from day one.

Built for the AI systems behind ApX Machine Learning

Was this section helpful?

References

Hello Dolly: Democratizing Generative AI, The Databricks Team, 2023 Databricks Blog (Databricks) - Introduces the databricks-dolly-15k dataset, its creation methodology, and its role in open-source instruction-following LLMs.
The 🤗 Datasets Library Documentation, Hugging Face, 2024 Hugging Face Documentation - Official documentation for the Hugging Face datasets library, covering data loading, processing, and management, which is central to building fine-tuning datasets.
Tokenization in 🤗 Transformers, Hugging Face, 2024 Hugging Face Documentation (Hugging Face) - A guide to tokenization within the Hugging Face transformers library, explaining its practical usage for preparing text data for LLMs.
Finetuned Language Models Are Zero-Shot Learners, Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le, 2021 arXiv (arXiv) DOI: 10.48550/arXiv.2109.01652 - A foundational paper on instruction tuning, demonstrating how structuring tasks for fine-tuning enables better generalization in language models.