In the previous chapters, we established that Large Language Models are defined by their parameters, the numerical values learned during training that encapsulate the model's capabilities. We also identified the key hardware components involved, especially the GPU and its dedicated memory, VRAM. Now, let's explore the direct link between the number of parameters a model has and the amount of memory it requires to function.
Think of the model's parameters like the complete text of an enormous encyclopedia. When the encyclopedia is stored on a bookshelf (your computer's disk storage), it holds a vast amount of information, but you can't read it instantly. To actually use the information (run the LLM), you need to bring the relevant volumes (the parameters) to your reading desk where you can access them quickly.
For an LLM, the "reading desk" is the computer's active memory. While system RAM can be used, the most effective place to load the parameters for fast operation is the GPU's VRAM. Why VRAM? As discussed in Chapter 2, GPUs are designed for massively parallel computations, exactly the kind needed to process the complex mathematical operations involved in LLMs. To achieve their remarkable speed, GPUs need extremely fast access to the data they are working on. Loading the model's parameters directly into the VRAM attached to the GPU provides this high-speed access.
If the parameters were only in system RAM, the GPU would constantly have to wait for data to be transferred from the RAM over a relatively slower connection (the system bus), creating a significant bottleneck and drastically reducing performance. Therefore, for efficient LLM inference, the primary goal is to fit all the necessary model parameters into the available VRAM.
Model parameters are loaded from slower disk storage into the GPU's fast VRAM to enable efficient processing during inference.
Each parameter in an LLM is essentially a number. Storing billions of these numbers naturally requires a significant amount of memory. The core relationship is straightforward:
If a model has 7 billion parameters, you need enough memory to store 7 billion numbers. If another model has 70 billion parameters, you need roughly ten times that amount of memory just to hold its parameters.
This is the most fundamental factor connecting model size to hardware requirements. The sheer volume of parameters dictates the minimum memory capacity needed, primarily in terms of VRAM. While other factors like activation memory (which we'll touch upon later) also consume VRAM, the space needed for the parameters themselves is usually the largest component.
However, how much space does each individual parameter number take up? This depends on the numerical format, or precision, used to store it. We'll examine this critical detail in the next section on data types.
© 2025 ApX Machine Learning