All Courses

Structuring Data for Various Fine-Tuning Frameworks

Once you've generated your synthetic data, the next practical step is to prepare it for the fine-tuning process. Fine-tuning frameworks and libraries don't magically understand arbitrary data arrangements. Instead, they expect data to be organized in specific structures and formats. Getting this right is fundamental for a successful fine-tuning run. Think of it like calling a function in a program: you need to pass the arguments in the expected order and format for the function to work correctly. This section details common data formats and structuring conventions used in LLM fine-tuning.

Common Data Formats for Fine-Tuning

While LLM tools are always evolving, a few data formats have become prevalent due to their simplicity and utility.

JSON Lines (JSONL)

JSON Lines, often abbreviated as JSONL, is a widely adopted format for fine-tuning datasets. In a JSONL file, each line is a complete, self-contained JSON object. This structure offers several advantages:

Streamability: You can process the file line by line, which is efficient for very large datasets that might not fit entirely in memory.
Readability: JSON is human-readable, making it easier to inspect and debug your data.
Flexibility: Each JSON object can have a complex structure, accommodating various types of fine-tuning data like instruction-response pairs or conversational histories.

For instruction fine-tuning, a typical JSONL entry might look like this:

{"instruction": "Translate the following English sentence to French.", "input": "Hello, world!", "output": "Bonjour, le monde!"}
{"instruction": "Summarize the main points of the provided text.", "input": "The text discusses the benefits of regular exercise, including improved cardiovascular health, weight management, and enhanced mental well-being. It also mentions the importance of consistency.", "output": "Regular exercise offers benefits like better heart health, weight control, and improved mental state, with consistency being important."}
{"instruction": "Write a Python function that calculates the factorial of a number.", "output": "def factorial(n):\n    if n == 0:\n        return 1\n    else:\n        return n * factorial(n-1)"}

Notice the input field is optional. If an instruction doesn't require additional context beyond the instruction itself, the input field can be omitted or left as an empty string. Some frameworks might use slightly different names, such as prompt and completion, or a single text field that contains a fully formatted prompt including the instruction, input, and a placeholder for the response. Always consult the documentation for the specific fine-tuning framework you are using.

Comma-Separated Values (CSV) or Tab-Separated Values (TSV)

CSV and TSV files are simpler, text-based formats where data is organized into rows, with values in each row separated by commas or tabs, respectively. While less flexible than JSONL for complex nested data, they can be suitable for straightforward prompt-completion pairs.

A CSV file for fine-tuning might have columns like prompt and completion:

prompt,completion
"What is the capital of Thailand?","Bangkok"
"Explain the concept of photosynthesis in simple terms.","Photosynthesis is the process plants use to convert light energy into chemical energy, making their own food."

When using CSV or TSV, pay close attention to how newlines and special characters within your text fields are handled, as they can sometimes cause parsing issues.

Structuring Data for Different Fine-Tuning Objectives

The way you structure your synthetic data within these formats depends heavily on the fine-Tuning objective.

Instruction Fine-Tuning (IFT)

As seen in the JSONL example, IFT datasets typically consist of instruction-response pairs. The goal is to teach the LLM to follow directives effectively. The synthetic data you generate should clearly delineate:

Instruction: The specific task the model should perform.
Input (Optional): Any context or data the instruction operates on.
Output/Response: The desired model generation.

Many fine-tuning scripts will then internally combine these fields into a single formatted prompt string that the model sees during training. A common templating pattern might look like this:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}

If the input field is empty, that part of the template would be omitted. Consistency in applying such templates is important for the model to learn the pattern.

Chat and Conversational Fine-Tuning

For fine-tuning models to be effective chatbots or conversational agents, the data needs to represent multi-turn dialogues. Each entry in your dataset, typically a JSON object in a JSONL file, would represent an entire conversation or a segment of one. Within this object, you'd usually have a list of messages, where each message specifies the role (e.g., system, user, assistant) and the content of the message.

Here's an example of a single JSONL entry for conversational fine-tuning:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant that provides concise answers."},
    {"role": "user", "content": "What is the primary benefit of using synthetic data for LLM fine-tuning?"},
    {"role": "assistant", "content": "It allows for creating targeted, task-specific datasets, especially when real-world data is scarce or difficult to obtain."}
  ]
}

This structured format allows the model to learn the flow of conversation and appropriate responses based on the preceding dialogue and the roles of the speakers.

Continued Pretraining or Domain Adaptation

If your goal is to adapt a pre-existing LLM to a specific domain (e.g., legal texts, medical research) or a particular style, your synthetic data might consist of large blocks of text representative of that domain or style. In this case, the structure can be very simple, often just a JSONL file where each line contains a JSON object with a single text field:

{"text": "This is a long passage of synthetic text written in the style of 19th-century scientific articles..."}
{"text": "Another example document focusing on legal terminology and case precedents..."}

The model is then further pretrained on this specialized corpus, allowing its internal representations and generation patterns to adapt.

The following diagram illustrates the transformation of raw synthetic data components into a JSONL entry and then how it might be formatted into a prompt string for an instruction fine-tuning task.

Data transformation from individual components to a JSONL entry and then into a formatted prompt string ready for a fine-tuning framework.

Tokenization and Special Tokens

Regardless of the format, the text data you prepare will ultimately be converted into tokens by the LLM's tokenizer. While the raw data structure doesn't usually include tokens directly, it should be designed to make the later addition of any required special tokens (e.g., <s> for start of sequence, </s> for end of sequence, [INST] and [/INST] for Llama 2 instruction markers) straightforward.

For instance, if you're using a chat format with role and content fields, the fine-tuning script will typically iterate through the messages and apply the model-specific chat template, inserting special tokens between turns or around user/assistant messages automatically. If you are creating monolithic prompt strings, you might need to include these special tokens directly in your data generation or templating logic.

Data Splits: Train, Validation, and Test

While not strictly a formatting issue for individual data points, remember that your overall synthetic dataset needs to be split into training, validation, and sometimes test sets. This is typically done by creating separate files (e.g., train.jsonl, validation.jsonl) or by adding a field to each data entry indicating its assignment (though separate files are more common for fine-tuning). The structure within each file remains consistent with the formats discussed above.

Leveraging Tools and Libraries

Many popular LLM libraries, such as Hugging Face datasets, provide convenient utilities for loading data in standard formats like JSONL or CSV. They often allow you to define dataset features explicitly and can handle much of the parsing and preprocessing. Using such libraries can save you time and help ensure your data is loaded correctly by the fine-tuning scripts. For example, the datasets library can load a JSONL file with a simple command:

# Example Python code snippet
from datasets import load_dataset

# Load a dataset from a JSONL file
# Assumes your data is in 'my_synthetic_data.jsonl'
# dataset = load_dataset('json', data_files='my_synthetic_data.jsonl')

# For instruction fine-tuning, you might then map it to a specific format
# def format_instruction(example):
#     if example.get('input'):
#         return {'text': f"Instruction: {example['instruction']}\nInput: {example['input']}\nOutput: {example['output']}"}
#     else:
#         return {'text': f"Instruction: {example['instruction']}\nOutput: {example['output']}"}

# formatted_dataset = dataset.map(format_instruction)

This snippet illustrates how you might load and then transform your structured JSONL data to fit the exact string format expected by a particular model or fine-tuning script.

In summary, structuring your synthetic data appropriately is a critical preparatory step for fine-tuning. While JSONL is a versatile and common choice, always refer to the documentation of your chosen fine-tuning framework or model for specific formatting requirements. The hands-on practical later in this chapter will give you an opportunity to apply these principles by creating and structuring a synthetic dataset for a specific task.

Was this section helpful?