While fine-tuning a large language model is a complex process involving training infrastructure and cost management, the most significant factor for success is the quality of the training data. A well-prepared dataset enables the model to learn specific tasks, adopt a particular style, or gain knowledge from a new domain. The toolkit provides a suite of utilities to structure, clean, and format your data, which is the foundational first step in any fine-tuning project.
Before a model can learn from your data, it needs to be organized into a consistent format. Most fine-tuning tasks use one of two primary structures:
completion text when it sees a similar prompt. This format is suitable for tasks like classification, simple Q&A, or text generation from a starting point.system, user, and assistant. It is used to train models for conversational tasks, where context from previous turns is important.The toolkit represents these with the TrainingExample and TrainingDataset classes. A TrainingExample is a single data point, while a TrainingDataset is a collection of these examples, along with metadata about the dataset's format.
Here is how you would create a TrainingExample for each format:
from kerb.fine_tuning import TrainingExample
# Completion format example
completion_example = TrainingExample(
prompt="Translate to French: Hello",
completion="Bonjour"
)
# Chat format example
chat_example = TrainingExample(
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "How do I create a list in Python?"},
{"role": "assistant", "content": "You can create a list using square brackets: my_list = [1, 2, 3]"}
]
)
Manually creating TrainingExample objects is useful, but you will often start with a large collection of raw data, perhaps in a list of dictionaries. The prepare_dataset function is designed to streamline the conversion of this raw data into a structured TrainingDataset, while also performing validation and cleaning.
This function can automatically handle several important steps:
role and content keys).Suppose you have raw conversational data. You can process it into a clean, ready-to-use dataset with a single function call.
from kerb.fine_tuning import prepare_dataset, DatasetFormat, FineTuningProvider
raw_data = [
{
"messages": [
{"role": "user", "content": "What is a dictionary?"},
{"role": "assistant", "content": "A dictionary is a key-value data structure in Python."}
]
},
# Identical to the first entry to show deduplication
{
"messages": [
{"role": "user", "content": "What is a dictionary?"},
{"role": "assistant", "content": "A dictionary is a key-value data structure in Python."}
]
},
{
"messages": [
{"role": "user", "content": "How do I iterate over a list?"},
{"role": "assistant", "content": "Use a for loop: for item in my_list: print(item)"}
]
}
]
# Prepare the dataset, enabling all cleaning options
dataset = prepare_dataset(
data=raw_data,
format=DatasetFormat.CHAT,
provider=FineTuningProvider.OPENAI,
validate=True,
deduplicate=True,
shuffle=True
)
print(f"Original examples: {len(raw_data)}")
print(f"Prepared examples: {len(dataset)}")
Notice that the final dataset has fewer examples than the raw data, as the duplicate entry was automatically removed. The provider argument helps tailor validation rules for specific services like OpenAI.
Before investing time and money into a fine-tuning job, it is good practice to analyze your dataset's quality. A dataset with issues like empty entries, extreme length variations, or personally identifiable information (PII) can lead to poor model performance or privacy risks.
The analyze_dataset function provides a high-level statistical overview, including token counts, duplicate counts, and label distributions.
from kerb.fine_tuning import analyze_dataset
# Assuming 'dataset' is the TrainingDataset from the previous step
stats = analyze_dataset(dataset)
print(f"Total examples: {stats.total_examples}")
print(f"Total tokens: {stats.total_tokens}")
print(f"Average tokens per example: {stats.avg_tokens_per_example:.2f}")
print(f"Duplicate count: {stats.duplicate_count}")
For a more granular check, you can use specialized functions. For instance, detect_pii helps you find and remove sensitive information, which is a significant step for ensuring privacy and safety.
from kerb.fine_tuning.quality import detect_pii
text_with_pii = "Contact me at [email protected] or 555-123-4567"
pii_found = detect_pii(text_with_pii)
if pii_found:
print("PII Detected:")
for pii_type, values in pii_found.items():
print(f" {pii_type}: {values}")
Running these quality checks helps you identify and fix problems early, saving you from failed training runs and leading to a more effective fine-tuned model.
Many fine-tuning projects aim to create "instruction-tuned" models that are experts at specific tasks. The system prompt is a powerful tool for this, as it sets the context and persona for the model. For consistency, it is best to use a standardized system prompt across all examples in your dataset.
The standardize_system_prompts function lets you apply a single system message to an entire dataset, replacing any existing ones. This ensures the model receives a consistent instruction set during training.
from kerb.fine_tuning.prompts import standardize_system_prompts
# Assuming 'dataset' is our prepared dataset
standard_prompt = "You are an expert Python programmer. Provide clear and accurate code examples."
standardized_dataset = standardize_system_prompts(dataset, standard_prompt)
# All examples in 'standardized_dataset' now have the same system prompt
print("System prompt has been standardized across the dataset.")
Different model providers require fine-tuning data to be formatted in a specific way, often as a JSONL (JSON Lines) file. The toolkit simplifies this by providing functions to convert your TrainingDataset into the required structure for major providers.
Once your dataset is prepared, you can format it for OpenAI and write it to a file.
import tempfile
import os
from kerb.fine_tuning import to_openai_format, write_jsonl
# Convert the dataset to the format expected by OpenAI's API
openai_formatted_data = to_openai_format(dataset)
# Write the data to a JSONL file
with tempfile.TemporaryDirectory() as temp_dir:
file_path = os.path.join(temp_dir, "training_data.jsonl")
write_jsonl(openai_formatted_data, file_path)
print(f"Dataset written to {file_path}")
# You can inspect the first line of the file to see the format
with open(file_path, 'r') as f:
print("\nFirst line of JSONL file:")
print(f.readline().strip())
This final JSONL file is what you would upload to the provider's service to start a fine-tuning job. By following this structured preparation process, from raw data to a validated, provider-specific file, you establish a solid foundation for creating a high-performing custom model.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with