Chapter 2: Data Preparation and Formatting

In the previous chapter, we established the technical mechanics of Small Language Models and supervised fine-tuning. Now, we turn our attention to the data. A machine learning model is only as effective as the information it processes during training. For fine-tuning, this means converting raw text into a highly structured format that the model architecture expects.

Raw text cannot be fed directly into a neural network. It must be broken down into numerical representations. Let's take for example a standard training dataset. The model requires the text to be converted into arrays of integers through tokenization. When processing data in batches, sequences of variable lengths must be standardized. We pad shorter sequences to match the longest one in the batch to maintain uniform matrix dimensions. If the maximum sequence length is $L$ and a given input length is $N$ , we append $L - N$ padding tokens to the array.

In this chapter, you will learn how to prepare custom data for supervised fine-tuning. We will cover the formatting rules for instruction datasets, organizing text into structured instruction and response pairs. You will apply tokenizers, implement padding strategies, and generate attention masks so the model mathematically ignores padding tokens during loss calculation. Finally, you will build a complete data pipeline that reads custom datasets and outputs the precise tensor shapes required by specific model architectures.

Sections

2.1 Structuring Instruction Datasets
2.2 Tokenization and Padding Strategies
2.3 Handling Attention Masks
2.4 Formatting Prompts for Specific Architectures
2.5 Practice: Building a Custom Dataset Pipeline