Preparing Datasets for Fine-Tuning

While fine-tuning a large language model is a complex process involving training infrastructure and cost management, the most significant factor for success is the quality of the training data. A well-prepared dataset enables the model to learn specific tasks, adopt a particular style, or gain knowledge from a new domain. The toolkit provides a suite of utilities to structure, clean, and format your data, which is the foundational first step in any fine-tuning project.

Structuring Fine-Tuning Data

Before a model can learn from your data, it needs to be organized into a consistent format. Most fine-tuning tasks use one of two primary structures:

Completion Format: This is a simple prompt-and-response pair. The model learns to generate the completion text when it sees a similar prompt. This format is suitable for tasks like classification, simple Q&A, or text generation from a starting point.
Chat Format: This format consists of a sequence of messages, typically from a system, user, and assistant. It is used to train models for conversational tasks, where context from previous turns is important.

The toolkit represents these with the TrainingExample and TrainingDataset classes. A TrainingExample is a single data point, while a TrainingDataset is a collection of these examples, along with metadata about the dataset's format.

Here is how you would create a TrainingExample for each format:

from kerb.fine_tuning import TrainingExample

# Completion format example
completion_example = TrainingExample(
    prompt="Translate to French: Hello",
    completion="Bonjour"
)

# Chat format example
chat_example = TrainingExample(
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "How do I create a list in Python?"},
        {"role": "assistant", "content": "You can create a list using square brackets: my_list = [1, 2, 3]"}
    ]
)

Preparing a Dataset from Raw Data

Manually creating TrainingExample objects is useful, but you will often start with a large collection of raw data, perhaps in a list of dictionaries. The prepare_dataset function is designed to streamline the conversion of this raw data into a structured TrainingDataset, while also performing validation and cleaning.

This function can automatically handle several important steps:

Validation: It checks if each example conforms to the expected structure (e.g., chat messages have role and content keys).
Deduplication: It removes exact duplicate examples, which can negatively impact training.
Shuffling: It randomizes the order of examples to prevent the model from learning any unintentional sequences in the data.

Suppose you have raw conversational data. You can process it into a clean, ready-to-use dataset with a single function call.

from kerb.fine_tuning import prepare_dataset, DatasetFormat, FineTuningProvider

raw_data = [
    {
        "messages": [
            {"role": "user", "content": "What is a dictionary?"},
            {"role": "assistant", "content": "A dictionary is a key-value data structure in Python."}
        ]
    },
    # Identical to the first entry to show deduplication
    {
        "messages": [
            {"role": "user", "content": "What is a dictionary?"},
            {"role": "assistant", "content": "A dictionary is a key-value data structure in Python."}
        ]
    },
    {
        "messages": [
            {"role": "user", "content": "How do I iterate over a list?"},
            {"role": "assistant", "content": "Use a for loop: for item in my_list: print(item)"}
        ]
    }
]

# Prepare the dataset, enabling all cleaning options
dataset = prepare_dataset(
    data=raw_data,
    format=DatasetFormat.CHAT,
    provider=FineTuningProvider.OPENAI,
    validate=True,
    deduplicate=True,
    shuffle=True
)

print(f"Original examples: {len(raw_data)}")
print(f"Prepared examples: {len(dataset)}")

Notice that the final dataset has fewer examples than the raw data, as the duplicate entry was automatically removed. The provider argument helps tailor validation rules for specific services like OpenAI.

Analyzing Dataset Quality

Before investing time and money into a fine-tuning job, it is good practice to analyze your dataset's quality. A dataset with issues like empty entries, extreme length variations, or personally identifiable information (PII) can lead to poor model performance or privacy risks.

The analyze_dataset function provides a high-level statistical overview, including token counts, duplicate counts, and label distributions.

from kerb.fine_tuning import analyze_dataset

# Assuming 'dataset' is the TrainingDataset from the previous step
stats = analyze_dataset(dataset)

print(f"Total examples: {stats.total_examples}")
print(f"Total tokens: {stats.total_tokens}")
print(f"Average tokens per example: {stats.avg_tokens_per_example:.2f}")
print(f"Duplicate count: {stats.duplicate_count}")

For a more granular check, you can use specialized functions. For instance, detect_pii helps you find and remove sensitive information, which is a significant step for ensuring privacy and safety.

from kerb.fine_tuning.quality import detect_pii

text_with_pii = "Contact me at [email protected] or 555-123-4567"
pii_found = detect_pii(text_with_pii)

if pii_found:
    print("PII Detected:")
    for pii_type, values in pii_found.items():
        print(f"  {pii_type}: {values}")

Running these quality checks helps you identify and fix problems early, saving you from failed training runs and leading to a more effective fine-tuned model.

Crafting Instruction-Following Datasets

Many fine-tuning projects aim to create "instruction-tuned" models that are experts at specific tasks. The system prompt is a powerful tool for this, as it sets the context and persona for the model. For consistency, it is best to use a standardized system prompt across all examples in your dataset.

The standardize_system_prompts function lets you apply a single system message to an entire dataset, replacing any existing ones. This ensures the model receives a consistent instruction set during training.

from kerb.fine_tuning.prompts import standardize_system_prompts

# Assuming 'dataset' is our prepared dataset
standard_prompt = "You are an expert Python programmer. Provide clear and accurate code examples."
standardized_dataset = standardize_system_prompts(dataset, standard_prompt)

# All examples in 'standardized_dataset' now have the same system prompt
print("System prompt has been standardized across the dataset.")

Formatting for Fine-Tuning Services

Different model providers require fine-tuning data to be formatted in a specific way, often as a JSONL (JSON Lines) file. The toolkit simplifies this by providing functions to convert your TrainingDataset into the required structure for major providers.

Once your dataset is prepared, you can format it for OpenAI and write it to a file.

import tempfile
import os
from kerb.fine_tuning import to_openai_format, write_jsonl

# Convert the dataset to the format expected by OpenAI's API
openai_formatted_data = to_openai_format(dataset)

# Write the data to a JSONL file
with tempfile.TemporaryDirectory() as temp_dir:
    file_path = os.path.join(temp_dir, "training_data.jsonl")
    write_jsonl(openai_formatted_data, file_path)
    print(f"Dataset written to {file_path}")

    # You can inspect the first line of the file to see the format
    with open(file_path, 'r') as f:
        print("\nFirst line of JSONL file:")
        print(f.readline().strip())

This final JSONL file is what you would upload to the provider's service to start a fine-tuning job. By following this structured preparation process, from raw data to a validated, provider-specific file, you establish a solid foundation for creating a high-performing custom model.

Was this section helpful?

References

Fine-tuning, OpenAI, 2024 (OpenAI) - Official guide to fine-tuning large language models with OpenAI, detailing data preparation, formatting requirements (completion and chat), and best practices.
datasets library documentation, Hugging Face, 2024 (Hugging Face) - Documentation for the Hugging Face datasets library, offering utilities and guidelines for loading, processing, and managing NLP datasets, applicable to fine-tuning data preparation.
Natural Language Processing with Transformers, Lewis Tunstall, Leandro von Werra, and Thomas Wolf, 2022 (O'Reilly Media) - This book provides practical guidance on preparing and processing textual data for transformer models, including considerations for structuring datasets and ensuring data quality.