Now that we understand what inference is – using a pre-trained LLM to generate text, answer questions, or perform other language tasks – let's look at the hardware resources needed to make this happen. Compared to training, the hardware requirements for inference are generally less demanding, but they still scale directly with the size of the model you intend to run.
The most immediate hardware consideration for running LLM inference is memory. Specifically, you need enough memory to hold the model's parameters.
Think of the model parameters as a very large instruction manual. VRAM is like a workbench right next to the worker (the GPU). If the manual fits on the workbench, the worker can reference it very quickly. If the manual is too large and has to be kept on a shelf across the room (system RAM), the worker has to constantly walk back and forth, slowing down the entire process significantly.
Therefore, the primary question when considering hardware for inference is: "Do I have enough VRAM to load the desired model?"
We'll explore how to estimate this in Chapter 5, but the basic principle is that larger models (more parameters) require more VRAM. Using techniques like quantization (introduced briefly in Chapter 3) can reduce the memory footprint, allowing larger models to fit into less VRAM, but the fundamental need remains: the (potentially compressed) model must fit.
While having enough VRAM is necessary to load the model, the GPU's processing capability determines how fast the inference runs. Inference involves performing many calculations (matrix multiplications) using the model parameters and your input prompt to generate the output.
So, while VRAM determines if you can run a model, the GPU's compute power and memory bandwidth dictate how fast you can run it.
This chart illustrates the general impact of different hardware components on the speed of LLM inference, assuming the VRAM is sufficient to load the model in the first place. GPU compute and bandwidth are the main drivers of performance (tokens per second).
System RAM and the CPU also play essential supporting roles during inference.
In summary, for inference, you need:
Understanding these roles is important when selecting hardware or choosing which models are feasible to run on your existing system. The focus for most users interacting with pre-trained LLMs will be meeting these inference requirements, which are significantly less intensive than the requirements for training models from scratch.
© 2025 ApX Machine Learning