In the previous section, we established that the millions or billions of parameters making up an LLM need to be loaded into your computer's memory to work. Specifically, for fast processing during inference, these parameters ideally reside in the GPU's dedicated memory, the VRAM. But how much memory does each parameter actually take up? The answer depends on its data type and precision.
Think of data types like containers for numbers inside the computer. Just like you have different-sized containers in your kitchen for storing different amounts of food, computers use different-sized digital containers to store numbers. The size of the container determines the precision of the number, essentially how much detail it can represent.
Traditionally, many machine learning models, including early LLMs, were trained using a data type called Floating-Point 32, often abbreviated as FP32. Each parameter stored in FP32 format occupies 32 bits of memory.
Since there are 8 bits in a byte, an FP32 parameter takes up:
32 bits/8 bits/byte=4 bytesThis might not sound like much, but remember, LLMs have billions of parameters. Let's consider a relatively small 7 billion parameter model (often written as 7B). To store its parameters using FP32, the memory requirement is:
7,000,000,000 parameters×4 bytes/parameter=28,000,000,000 bytesThat's 28 billion bytes, which translates to 28 Gigabytes (GB) of memory, primarily VRAM, just for the model weights! This is a significant amount and highlights why running larger models requires substantial hardware.
To make LLMs more accessible and efficient, developers often use lower precision data types. A very common one is Floating-Point 16 (FP16), sometimes called "half-precision". As the name suggests, each FP16 parameter uses only 16 bits.
Calculating the memory per parameter:
16 bits/8 bits/byte=2 bytesNow, let's recalculate the memory needed for our 7B parameter model using FP16:
7,000,000,000 parameters×2 bytes/parameter=14,000,000,000 bytesThis is 14 GB. By switching from FP32 to FP16, we've cut the memory required for the model parameters in half! This makes a huge difference. It means you might be able to run the model on a GPU with 16 GB of VRAM, whereas the FP32 version would require a GPU with more than 28 GB of VRAM.
Does using half the precision hurt the model's performance? Sometimes there's a small decrease in accuracy, but for many tasks, especially inference (running the model), the difference is often negligible, making FP16 a popular choice for deploying LLMs.
We can potentially reduce memory usage even further using integer data types, such as Integer 8 (INT8). An INT8 parameter uses just 8 bits.
Memory per parameter:
8 bits/8 bits/byte=1 byteFor our 7B example model, the memory requirement becomes:
7,000,000,000 parameters×1 byte/parameter=7,000,000,000 bytesThat's only 7 GB! This is half the memory of FP16 and a quarter of the original FP32 requirement. Using INT8 allows even larger models to run on more modest hardware.
However, representing parameters with only 8 bits significantly reduces the range and precision compared to floating-point types. This can lead to a more noticeable drop in model accuracy if not handled carefully. Techniques like quantization-aware training or sophisticated conversion methods are often needed to minimize this accuracy loss.
The choice of data type directly impacts the VRAM needed. Let's visualize the estimated memory requirement for storing the parameters of a hypothetical 7 billion parameter model using these different precisions:
Estimated VRAM needed solely for storing the parameters of a 7 billion parameter LLM, based on the data type precision used.
As the chart shows, moving from FP32 to FP16 halves the memory footprint, and moving to INT8 halves it again. This reduction is fundamental to making LLMs runnable on consumer-grade GPUs or optimizing their performance in resource-constrained environments.
In practice, FP16 has become a very common standard for LLM inference due to its excellent balance of memory savings, speed improvements on modern GPUs, and generally minimal impact on output quality. INT8 is increasingly used, especially when memory or speed is paramount, often involving a process we'll introduce next: quantization. Understanding these data types is the first step in estimating the hardware you'll need.
© 2025 ApX Machine Learning