Combining prompt formatting, tokenization, and attention mask generation into a single automated pipeline allows for consistent data preparation. A data pipeline transforms raw text into the exact tensor shapes required by the model architecture during training.
Building this pipeline requires a systematic approach. We will load a raw dataset, apply a formatting template to merge instructions and responses, tokenize the resulting text, and finally convert the data into multidimensional arrays formatted for PyTorch.
We begin by loading our raw data into memory. For this exercise, assume we have a JSON Lines file named training_data.jsonl. Each line in the file is a JSON object containing an instruction field and a response field. We use the Hugging Face datasets library to load this file efficiently.
from datasets import load_dataset
# Load the JSONL dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
# Inspect the first row
print(dataset[0])
# Output: {'instruction': 'Translate to French: Hello', 'response': 'Bonjour'}
Loading the data into a Dataset object provides access to optimized memory mapping operations. The dataset is not loaded entirely into RAM at once, which prevents memory crashes when working with files that contain millions of rows.
The model cannot understand separate instruction and response fields. It expects a single continuous string formatted with specific structural markers. We define a standard Python function to concatenate these fields into the prompt format required by the model architecture.
def format_instruction(example):
"""
Combines the instruction and response into a single standardized string.
"""
prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
return {"formatted_prompt": prompt}
We apply this function across the entire dataset using the .map() method. This operation creates a new column in our dataset containing the finalized text strings.
# Apply the formatting function to all rows
dataset = dataset.map(format_instruction)
With the text structurally formatted, we must convert the characters into integer sequences. We instantiate a tokenizer corresponding to our base model. Neural networks require static input dimensions during batch processing. If the batch size is and the sequence length is , the resulting input tensor must always have the dimensions .
To achieve this uniform shape, we configure the tokenizer to truncate sequences that exceed our maximum length and pad sequences that fall short.
from transformers import AutoTokenizer
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("base-model-identifier")
# Assign a padding token if the base model lacks one
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
"""
Tokenizes the formatted prompts, applying padding and truncation.
"""
return tokenizer(
examples["formatted_prompt"],
truncation=True,
padding="max_length",
max_length=512
)
By setting padding="max_length" and max_length=512, every sequence in our dataset will consist of exactly 512 tokens. The tokenizer automatically generates two important arrays for each entry: input_ids which represent the text, and attention_mask which tells the model to ignore the padded positions mathematically.
We apply the tokenization function to the dataset. We use the batched=True parameter to process multiple examples simultaneously. This significantly accelerates the tokenization process by leveraging parallel execution. We also remove the original text columns, as the neural network only requires the numerical tensors.
# Apply tokenization and remove textual columns
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
remove_columns=dataset.column_names
)
Data processing stages transforming text entries into padded integer arrays.
The final step in the pipeline is to configure the data format for our machine learning framework. Currently, the dataset columns contain standard Python lists. The training loop requires PyTorch tensors. We apply the .set_format() method to convert these lists directly into tensors that can be transferred to the GPU.
# Set the format to PyTorch tensors
tokenized_dataset.set_format("torch")
# Verify the tensor dimensions
print(tokenized_dataset[0]["input_ids"].shape)
The output shape will be torch.Size([512]). When the PyTorch DataLoader groups these individual rows into batches during the training loop, the resulting matrices will perfectly align. For example, a batch size of 8 will produce a tensor of shape .
The custom dataset pipeline is now complete. The raw text inputs have been standardized, converted to integers, bounded to a uniform length, and transformed into framework-specific tensors ready for the fine-tuning process.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•