Estimating the hardware needed to run a Large Language Model (LLM) can seem complex, but we can start with a straightforward approximation. The most significant factor determining the memory requirement, specifically the Graphics Processing Unit's Video RAM (VRAM), is the size of the model itself, measured by its number of parameters.
Think of each parameter in the model as a number that needs to be stored somewhere accessible for computation. Since GPUs are the workhorses for running LLMs (as discussed in Chapter 2), these parameters are primarily loaded into the GPU's dedicated memory, the VRAM.
How much space does each parameter take? This depends on the precision or data type used to store it. Common precisions include:
- FP32 (Single-Precision Floating Point): Each parameter takes 32 bits, which is equal to 4 bytes. This offers high precision but requires more memory.
- FP16 (Half-Precision Floating Point): Each parameter takes 16 bits, or 2 bytes. This cuts the memory requirement roughly in half compared to FP32, often with a minimal impact on performance for inference.
- INT8 (8-bit Integer): Each parameter takes only 8 bits, or 1 byte. This further reduces memory usage significantly but can sometimes lead to a noticeable decrease in the model's accuracy. This is often achieved through a process called quantization (which we briefly introduced in Chapter 3).
The most common starting point for estimation assumes the model will be run using FP16 precision, as it offers a good balance between memory usage and performance.
The Basic Calculation
Based on this, we can establish a simple rule of thumb to estimate the minimum VRAM needed just to load the model parameters:
RequiredVRAM(GB)≈10243(Bytes/GB)ParameterCount(Billions)×BytesPerParameter
However, a simpler mental shortcut, especially for FP16, is often used:
RequiredVRAM(GB)≈ParameterCount(Billions)×2(BytesforFP16)
Let's look at a couple of examples:
-
A 7 Billion Parameter Model (e.g., Llama 2 7B):
- Using FP16 precision (2 bytes/parameter):
VRAM≈7Billion×2Bytes=14Billion Bytes
VRAM≈1024314×109≈13.04GB
So, you'd need approximately 14 GB of VRAM just to hold the model weights in FP16.
- Using FP32 precision (4 bytes/parameter):
VRAM≈7Billion×4Bytes=28Billion Bytes
VRAM≈1024328×109≈26.07GB
Loading the same model in full precision would require about 28 GB of VRAM.
- Using INT8 precision (1 byte/parameter, post-quantization):
VRAM≈7Billion×1Byte=7Billion Bytes
VRAM≈102437×109≈6.52GB
Using 8-bit quantization significantly reduces the requirement to roughly 7 GB of VRAM.
-
A 70 Billion Parameter Model (e.g., Llama 2 70B):
- Using FP16 precision (2 bytes/parameter):
VRAM≈70Billion×2Bytes=140Billion Bytes
VRAM≈10243140×109≈130.36GB
This larger model demands around 140 GB of VRAM in FP16. This often requires multiple high-end GPUs.
Estimated VRAM needed solely for storing the parameters of a 7 Billion parameter model at different numerical precisions.
Important Considerations
This rule of thumb provides a baseline estimate for storing the model's parameters. It's a fundamental starting point for determining if your hardware might be sufficient. However, remember that this calculation only accounts for the model weights themselves. Running the model involves more than just storing it. As we'll discuss next, factors like activations during inference, the length of the input prompt and generated output (context length), and software overhead will consume additional VRAM. Therefore, you should always budget for more VRAM than this simple calculation suggests.