When preparing datasets for a Small Language Model, feeding it plain text or generic JSON is insufficient. Base models are trained on unstructured text, but instruction-tuned models expect a highly specific conversational structure. This structure consists of special vocabulary tokens that separate different parts of an interaction. If you fine-tune a model using a format it was not originally trained on, the loss function will struggle to converge, and the resulting text generation will degrade significantly.
Prompt templates act as the structural blueprint for interactions. They tell the model who is speaking, what the system rules are, and when it is time to generate a response. A standard interaction usually involves a system message, a user prompt, and the expected assistant reply. To delineate these roles, architectures rely on special control tokens added to their vocabulary during the initial pre-training phase.
Raw dataset interactions mapped to architecture-specific prompt structures.
Different research groups have developed distinct formatting conventions over time. Understanding these differences is necessary when swapping out base models in your pipeline.
Alpaca Format The Alpaca format emerged early in the instruction-tuning timeline. It relies on standard markdown-like headers rather than specialized vocabulary tokens. While older, it is still frequently used for many open-weight models. It usually looks like this:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a Python script to add two numbers.
### Response:
ChatML Format Chat Markup Language (ChatML) is utilized by modern models like Qwen and various Mistral variants. It introduces special tokens to define role boundaries explicitly, reducing the chance of the model hallucinating user inputs.
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Llama Format
Meta implemented a unique structure for their Llama series. For Llama 2, it uses strict bracketed tags like [INST] and <<SYS>>.
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
Hello! [/INST]
Manually concatenating strings for these formats is error-prone. Forgetting a single space or an end-of-sequence token can severely impact training. The Hugging Face Transformers library provides a built-in method called apply_chat_template attached directly to the tokenizer.
Instead of writing custom string manipulation scripts, you structure your data as a list of dictionaries. Each dictionary contains a specific role and the corresponding text content.
messages = [
{"role": "system", "content": "You are a helpful AI."},
{"role": "user", "content": "Explain backpropagation."}
]
When you pass this list to tokenizer.apply_chat_template(messages, tokenize=False), the tokenizer automatically maps the generic dictionaries to the exact string format the specific model expects. The tokenize=False argument ensures the output remains a human-readable string rather than an array of tensor indices. This is recommended during the initial data preparation phase so you can visually inspect the formatting before passing the data to the neural network.
To prepare data for supervised fine-tuning, the assistant's response must also be included in the message list. The model learns to generate the exact sequence of text that follows the assistant prompt token.
training_messages = [
{"role": "system", "content": "You are a helpful AI."},
{"role": "user", "content": "Explain backpropagation."},
{"role": "assistant", "content": "Backpropagation is an algorithm used to train neural networks..."}
]
formatted_prompt = tokenizer.apply_chat_template(training_messages, tokenize=False)
This approach completely decouples your raw dataset from the specific model architecture. You can take a dataset formatted as standard JSON roles and fine-tune an Alpaca-style model today, then switch to a ChatML model tomorrow without rewriting a single line of your data processing pipeline.
When formatting prompts for fine-tuning, the arrangement directly impacts the loss calculation. If the entire sequence length is with individual tokens , we only want the model to update its weights based on generating the assistant's response. The loss is strictly masked for the system and user prompt tokens.
If the prompt template is incorrectly formatted, the attention mask will misalign. The cross-entropy loss function will then compute errors based on the system or user prompt tokens. This causes the model to update its weights to predict the user's question rather than its own response, severely degrading conversational performance.
During text generation, models need a specific signal to stop generating output. This is governed by an End-Of-Sequence (EOS) token. When formatting prompts for training, you must ensure the EOS token is appended to the absolute end of the assistant's response.
If you omit this token, the fine-tuned model might ramble endlessly during inference because it never learned the statistical probability of stopping. The apply_chat_template method automatically handles appending the EOS token if the final dictionary in your list belongs to the assistant. However, if you are forced to build templates manually for a custom architecture, you must explicitly append tokenizer.eos_token to the final training string.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•