Convert a raw dataset into a structured format suitable for instruction fine-tuning. The process involves a complete, repeatable workflow: loading raw data, cleaning it, structuring it into a consistent prompt format, and finally, tokenizing it for the model.
For this exercise, we will use a modified subset of the databricks-dolly-15k dataset, which contains instruction-response pairs generated by humans. This provides a realistic starting point for building a high-quality dataset.
Our first step is to load the data and get a feel for its contents. We'll use the Hugging Face datasets library, an essential tool in the LLM ecosystem that simplifies data loading and manipulation. Let's assume our data is in a JSON Lines (.jsonl) file, where each line is a JSON object.
from datasets import load_dataset
# Load the dataset from a local file or the Hugging Face Hub
# For this example, we'll assume it's available on the Hub.
raw_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
# Let's see the structure and a few examples
print(raw_dataset)
print(raw_dataset[0])
The output will show us the dataset's features (columns) and the content of the first example. Typically, you'll see fields like instruction, context, and response.
Dataset({
features: ['instruction', 'context', 'response', 'category'],
num_rows: 15011
})
{
'instruction': 'When did Virgin Australia start operating?',
'context': 'Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.',
'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.',
'category': 'closed_qa'
}
Our goal is to transform these separate fields into a single, structured text sequence that the model can learn from.
Data is rarely perfect. It often contains empty fields, duplicates, or irrelevant information. A small amount of high-quality data is far more valuable than a large amount of noisy data.
Let's perform some basic cleaning. A common issue is entries where the instruction or response is missing or too short to be useful. We can filter these out.
# Filter out examples where the instruction is too short
initial_size = len(raw_dataset)
filtered_dataset = raw_dataset.filter(lambda example: len(example['instruction']) > 3)
# Filter out examples where the response is too short
filtered_dataset = filtered_dataset.filter(lambda example: len(example['response']) > 3)
print(f"Original size: {initial_size}")
print(f"Size after filtering: {len(filtered_dataset)}")
This is a simple filtering step. In a production scenario, you might add more sophisticated rules, such as removing examples with specific keywords, filtering out non-English text if that's your target, or de-duplicating similar instructions.
A model learns to follow instructions by recognizing patterns. A consistent prompt structure is therefore essential for effective fine-tuning. The model needs to clearly distinguish between the instruction, any provided input, and the expected response.
We will adopt a simple and effective template that separates these components.
### Instruction:
{instruction}
### Input:
{context}
### Response:
{response}
If a sample has no context, we will omit the ### Input: section entirely to avoid feeding the model an empty field. Let's create a Python function to format each example.
def format_prompt(example):
"""Formats a single example into a standardized prompt string."""
instruction = f"### Instruction:\n{example['instruction']}"
# Use context if it exists and is not empty
if example.get('context') and example['context'].strip():
context = f"### Input:\n{example['context']}"
else:
context = ""
response = f"### Response:\n{example['response']}"
# Join the parts, filtering out empty strings
full_prompt = "\n\n".join(filter(None, [instruction, context, response]))
return {"text": full_prompt}
This function creates a new column named text containing the fully formatted prompt. We can apply this transformation to our entire dataset using the map method, which is highly efficient.
# Apply the formatting function to each example
structured_dataset = filtered_dataset.map(format_prompt)
# Let's inspect an example with context and one without
print("--- Example with context ---")
print(structured_dataset[0]['text'])
print("\n--- Example without context ---")
# Find an example without context to print
for ex in structured_dataset:
if "### Input:" not in ex['text']:
print(ex['text'])
break
This unified text field is exactly what the model will see during the training process.
The final step in data preparation is tokenization: converting the formatted text strings into a sequence of integer IDs that the model can process. This step also requires careful handling of special tokens.
We need to choose a tokenizer that matches the base model we intend to fine-tune. For this example, let's use the tokenizer for meta-llama/Llama-2-7b-hf. It's also important to add an end-of-sequence (eos) token to the end of each prompt. This token signals to the model that the response is complete, teaching it when to stop generating text during inference.
from transformers import AutoTokenizer
# Load the tokenizer for a specific model
model_id = "meta-llama/Llama-2-7b-hf"
# Note: You may need to authenticate with Hugging Face to access this model
# from huggingface_hub import login; login()
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Set a padding token if one is not already defined
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
# Append the EOS token to the end of each text
text_with_eos = [s + tokenizer.eos_token for s in examples["text"]]
# Tokenize the text
return tokenizer(
text_with_eos,
truncation=True, # Truncate sequences longer than max length
max_length=512, # A common max length
padding=False, # We will handle padding later with a data collator
)
# Apply the tokenization
tokenized_dataset = structured_dataset.map(
tokenize_function,
batched=True, # Process examples in batches for efficiency
remove_columns=structured_dataset.column_names # Remove old text columns
)
# Check the output of the tokenization
print(tokenized_dataset[0].keys())
print(tokenized_dataset[0]['input_ids'][:20]) # Print first 20 token IDs
The output now consists of input_ids and an attention_mask, which are the standard inputs for a Transformer model. We have successfully transformed our raw data into a model-ready format.
The following diagram summarizes our data preparation pipeline.
The data preparation workflow, from raw source file to a tokenized dataset ready for training.
After all this work, it's wise to save our final dataset to disk. This allows us to quickly load it for training later without repeating these preprocessing steps.
# Save the dataset to a local directory
tokenized_dataset.save_to_disk("./fine-tuning-dataset")
# You can load it back anytime using:
# from datasets import load_from_disk
# reloaded_dataset = load_from_disk("./fine-tuning-dataset")
With our dataset now built, cleaned, structured, tokenized, and saved, we are fully prepared for the next stage: using it to fine-tune a large language model.
Cleaner syntax. Built-in debugging. Production-ready from day one.
Built for the AI systems behind ApX Machine Learning
Was this section helpful?
databricks-dolly-15k dataset, its creation methodology, and its role in open-source instruction-following LLMs.datasets library, covering data loading, processing, and management, which is central to building fine-tuning datasets.transformers library, explaining its practical usage for preparing text data for LLMs.© 2026 ApX Machine LearningEngineered with