Practice: Building a Custom Dataset Pipeline

Combining prompt formatting, tokenization, and attention mask generation into a single automated pipeline allows for consistent data preparation. A data pipeline transforms raw text into the exact tensor shapes required by the model architecture during training.

Building this pipeline requires a systematic approach. We will load a raw dataset, apply a formatting template to merge instructions and responses, tokenize the resulting text, and finally convert the data into multidimensional arrays formatted for PyTorch.

Loading the Raw Dataset

We begin by loading our raw data into memory. For this exercise, assume we have a JSON Lines file named training_data.jsonl. Each line in the file is a JSON object containing an instruction field and a response field. We use the Hugging Face datasets library to load this file efficiently.

from datasets import load_dataset

# Load the JSONL dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

# Inspect the first row
print(dataset[0])
# Output: {'instruction': 'Translate to French: Hello', 'response': 'Bonjour'}

Loading the data into a Dataset object provides access to optimized memory mapping operations. The dataset is not loaded entirely into RAM at once, which prevents memory crashes when working with files that contain millions of rows.

Defining the Formatting Function

The model cannot understand separate instruction and response fields. It expects a single continuous string formatted with specific structural markers. We define a standard Python function to concatenate these fields into the prompt format required by the model architecture.

def format_instruction(example):
    """
    Combines the instruction and response into a single standardized string.
    """
    prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
    return {"formatted_prompt": prompt}

We apply this function across the entire dataset using the .map() method. This operation creates a new column in our dataset containing the finalized text strings.

# Apply the formatting function to all rows
dataset = dataset.map(format_instruction)

Implementing Tokenization and Padding

With the text structurally formatted, we must convert the characters into integer sequences. We instantiate a tokenizer corresponding to our base model. Neural networks require static input dimensions during batch processing. If the batch size is $N$ and the sequence length is $L$ , the resulting input tensor must always have the dimensions $N \times L$ .

To achieve this uniform shape, we configure the tokenizer to truncate sequences that exceed our maximum length and pad sequences that fall short.

from transformers import AutoTokenizer

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("base-model-identifier")

# Assign a padding token if the base model lacks one
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    """
    Tokenizes the formatted prompts, applying padding and truncation.
    """
    return tokenizer(
        examples["formatted_prompt"],
        truncation=True,
        padding="max_length",
        max_length=512
    )

By setting padding="max_length" and max_length=512, every sequence in our dataset will consist of exactly 512 tokens. The tokenizer automatically generates two important arrays for each entry: input_ids which represent the text, and attention_mask which tells the model to ignore the padded positions mathematically.

Executing the Pipeline

We apply the tokenization function to the dataset. We use the batched=True parameter to process multiple examples simultaneously. This significantly accelerates the tokenization process by leveraging parallel execution. We also remove the original text columns, as the neural network only requires the numerical tensors.

# Apply tokenization and remove textual columns
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset.column_names
)

Data processing stages transforming text entries into padded integer arrays.

Formatting for the PyTorch DataLoader

The final step in the pipeline is to configure the data format for our machine learning framework. Currently, the dataset columns contain standard Python lists. The training loop requires PyTorch tensors. We apply the .set_format() method to convert these lists directly into tensors that can be transferred to the GPU.

# Set the format to PyTorch tensors
tokenized_dataset.set_format("torch")

# Verify the tensor dimensions
print(tokenized_dataset[0]["input_ids"].shape)

The output shape will be torch.Size([512]). When the PyTorch DataLoader groups these individual rows into batches during the training loop, the resulting matrices will perfectly align. For example, a batch size of 8 will produce a tensor of shape $8 \times 512$ .

The custom dataset pipeline is now complete. The raw text inputs have been standardized, converted to integers, bounded to a uniform length, and transformed into framework-specific tensors ready for the fine-tuning process.

References

Datasets: A Community Library for Natural Language Processing, Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugger, Clément Delangue, Théo Matussière, Lysandre Debut, Stas Bekman, Pierric Cistac, Thibault Goehringer, Victor Mustar, François Lagunas, Alexander M. Rush, Thomas Wolf, 2021 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (Association for Computational Linguistics) DOI: 10.18653/v1/2021.emnlp-demo.21 - Describes the library's design for efficient data loading and memory mapping, which supports the pipeline stages discussed.
HuggingFace's Transformers: State-of-the-art Natural Language Processing, Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, Alexander M. Rush, 2020 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations DOI: 10.18653/v1/2020.emnlp-demos.6 - Technical documentation on the tokenization process and the integration between text processing and model architectures.
Training Language Models to Follow Instructions with Human Feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022 Advances in Neural Information Processing Systems, Vol. 35 - Provides the research context for instruction-based data formatting and the structural requirements of fine-tuning datasets.
The Hugging Face NLP Course, Hugging Face, 2023 - Specific tutorials on the .map() method, batching, and preparing data for PyTorch training loops.