Hands-On Practical: Initializing a Pre-Trained SLM

Loading a pre-trained Small Language Model into local memory allows for the verification of environment and hardware compatibility. Before writing training loops, you must confirm that the base model loads correctly and can generate text. This step establishes the baseline performance of the model before any supervised learning occurs.

For this exercise, we will use the Hugging Face transformers library to interface with the model. We will load a small, standard architecture that easily fits into consumer hardware.

Initializing the Tokenizer

Language models do not process raw text. They require a tokenizer to translate human readable strings into numerical token IDs. These IDs correspond to specific entries in the model's embedding matrix. You initialize the tokenizer using the AutoTokenizer class, pointing it to a specific model repository.

from transformers import AutoTokenizer

model_id = "Qwen/Qwen1.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(model_id)

The tokenizer downloads the vocabulary file associated with the model. This ensures that the token IDs generated by your script perfectly match the token IDs the model was exposed to during its initial training phase.

Loading the Model Weights

Next, you bring the model weights into memory. Earlier, we defined the weight matrix in our gradient equations as $W$ . The from_pretrained function loads these exact weight matrices directly into your system's VRAM.

To respect the memory limits discussed in the previous section, we instruct PyTorch to load the model using 16-bit precision.

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

The device_map="auto" argument evaluates your available hardware and automatically places the model layers on the GPU. If no GPU is available, it defaults to the CPU. The torch_dtype=torch.bfloat16 argument halves the memory footprint compared to standard 32-bit floats, which is a standard practice when preparing models for parameter-efficient fine-tuning.

Executing the Generation Pipeline

With the model and tokenizer active, you can execute a basic text generation task. This involves preparing a prompt, converting it into a tensor, and passing it to the model.

prompt = "Machine learning is a field of artificial intelligence that"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

When you call model.generate, the system performs a sequence of forward passes. During each pass, the model calculates logits representing the probability distribution for the next token.

The parameters passed to the generate function control how the model samples from these probabilities. The max_new_tokens parameter stops the loop after 50 tokens to prevent infinite generation. Setting do_sample=True activates probabilistic sampling, and the temperature parameter scales the logits before the softmax operation. A temperature of 0.7 flattens the distribution slightly, introducing variation into the generated text. A lower temperature would make the output more deterministic.

The standard text generation pipeline converting human text to numerical tensors and back into text.

When you print the response, you will notice that the model attempts to complete the sentence based on its pre-training data. At this stage, the output might wander off topic or fail to follow specific instructions. Because the model has not yet undergone supervised fine-tuning for your specific use case, it acts purely as a statistical text continuator. This behavior demonstrates exactly why you need to format custom datasets and adjust the internal parameters to create a reliable, task-specific application.

References

Generation with LLMs, Hugging Face Team, 2024 - Official documentation explaining the parameters of the generate function, including sampling and temperature.
Qwen Technical Report, Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Lyuwan Zhang, Xiaohuan Zhou, Junyang Zhu, Sunbin Zhu, 2023 arXiv DOI: 10.48550/arXiv.2309.16609 - The technical report for the Qwen model family used in this exercise, detailing its architecture and training.
Methods and tools for efficient training on a single GPU, Hugging Face Team, 2024 - Comprehensive guide on memory optimization, covering half-precision (FP16/BF16) and device mapping.
Mixed Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, 2018 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1710.03740 - The foundational research paper describing the use of reduced precision to decrease memory usage during model execution.