Instruction-Based vs. Conversational Data Formats

The structure of your fine-tuning data directly shapes the model's learned abilities. To teach a model a new skill, you must present examples of that skill in a consistent, machine-readable format. For most fine-tuning tasks, this data is structured in one of two primary formats: instruction-based or conversational. The format you choose is a foundational decision that aligns your dataset with the model's intended use case.

The Instruction-Based Format

Instruction-based datasets are designed to teach a model how to perform a specific, discrete task. The model learns to map a given prompt to a desired completion. This format is ideal for applications like summarization, translation, classification, and question-answering where the interaction is a single-shot request and response.

The most common structure is a collection of JSON objects, where each object represents a single training example. While the field names can vary, they typically contain an instruction, an optional input for context, and the expected output.

For example, a dataset for teaching a model to summarize articles. A single data point might look like this:

{
  "instruction": "Summarize the following article in three sentences.",
  "input": "The Falcon 9 is a two-stage rocket designed and manufactured by SpaceX for the reliable and safe transport of satellites and the Dragon spacecraft into orbit. The rocket's first stage is reusable, capable of re-entering the atmosphere and landing vertically after separating from the second stage. This reusability has significantly reduced the cost of access to space.",
  "output": "The Falcon 9 is a two-stage, reusable rocket from SpaceX used for orbital transport. Its first stage can land vertically after launch, a feature that lowers launch costs. This capability makes it a cost-effective choice for deploying satellites and spacecraft."
}

During training, these structured fields are often combined into a single string using a prompt template. The template standardizes the input format, signaling to the model what kind of task it needs to perform. For the example above, a template might be:

prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
"""

The model is then trained to generate the text from the output field as the completion to this formatted prompt. The consistency of the prompt template across all examples is important for stable training.

The Conversational Format

Conversational datasets teach a model to engage in multi-turn dialogue. Instead of a single instruction and response, the data is structured as a sequence of exchanges. This format is essential for building chatbots, virtual assistants, or any application that must maintain context across several interactions.

Data is typically represented as a list of turns, where each turn specifies the speaker's role (user or assistant) and their message (content). Many open-source models like Llama are fine-tuned on data structured this way.

Here is an example of a two-turn conversation for training a technical support assistant:

{
  "messages": [
    {
      "role": "user",
      "content": "I'm trying to connect to my new database, but I'm getting a 'Connection Timed Out' error. What should I check first?"
    },
    {
      "role": "assistant",
      "content": "A 'Connection Timed Out' error usually points to a network or firewall issue. Can you confirm that the database server's IP address is reachable from your machine and that port 5432 is open on any firewalls between you and the server?"
    },
    {
      "role": "user",
      "content": "I can ping the IP, but how do I check the port?"
    },
    {
      "role": "assistant",
      "content": "You can use a tool like `telnet` or `nc` (netcat). For example, from your command line, you could run `telnet YOUR_DATABASE_IP 5432`. If it connects, the port is open; otherwise, it's likely blocked."
    }
  ]
}

This entire list of messages constitutes a single training example. The model learns to generate the assistant's response based on the entire preceding conversation history. During processing, the list of turns is converted into a single formatted string, often using special tokens defined by the base model's tokenizer to delineate roles and turns.

Comparing Data Structures

The primary difference between these two formats lies in their representation of context. The instruction-based format is stateless, treating each example as an independent task. The conversational format is stateful, teaching the model to build upon prior turns in the dialogue.

Diagram illustrating the information flow for instruction-based (single-turn) and conversational (multi-turn) data formats.

How to Choose the Right Format

Your choice of data format should be driven entirely by the final application's requirements.

Use the instruction-based format when:
- The task is atomic and does not depend on previous interactions. Examples include document summarization, sentiment analysis, or code generation from a specification.
- You want to build a reliable tool for a well-defined, repeatable function.
- The input and output are clearly defined and self-contained.
Use the conversational format when:
- The model must remember previous parts of the interaction to respond coherently.
- The application is a chatbot, customer support agent, or role-playing character.
- The interaction is expected to be a dialogue that evolves over multiple turns.

It is also possible to frame single-turn tasks within a conversational structure. For instance, an instruction can simply become the first user message in a two-turn conversation. However, for strictly task-oriented fine-tuning, the instruction-based format is more direct and often easier to construct. Selecting the appropriate data structure is the first step in ensuring your model learns the behavior you intend to build.

Build LLM apps faster with Kerb

Cleaner syntax. Built-in debugging. Production-ready from day one.

Built for the AI systems behind ApX Machine Learning

Was this section helpful?

References

Speech and Language Processing, Daniel Jurafsky and James H. Martin, 2025 (Stanford University) - Chapter 28 provides a discussion of chatbots and dialogue systems, covering the foundational concepts of multi-turn conversational AI.
How to format inputs for chat models, OpenAI, 2024 (OpenAI) - Provides official guidelines and examples for structuring conversational data using the 'messages' array format for OpenAI's chat completion models.