Let's put the rules of thumb we've discussed into practice. Remember, these are estimates primarily focused on the memory needed to hold the model weights. Actual usage will be higher due to activations, operating system overhead, and the specific software you use.Our main estimation tool will be the relationship:$$Required , VRAM \approx Parameter , Count \times Bytes , Per , Parameter$$We'll focus on FP16 (16-bit floating point) precision, which is very common for inference. In FP16, each parameter requires 2 bytes of storage.Example 1: Estimating VRAM for a 7 Billion Parameter Model (FP16)Consider a model often referred to as a "7B" model, meaning it has approximately 7 billion parameters.Parameters: 7,000,000,000Bytes per Parameter (FP16): 2 bytesCalculation: $$VRAM_{FP16} \approx 7,000,000,000 , parameters \times 2 , bytes/parameter$$ $$VRAM_{FP16} \approx 14,000,000,000 , bytes$$To convert bytes to gigabytes (GB), we divide by $1024^3$ (or roughly by one billion for a quick estimate).$$14,000,000,000 , bytes \div (1024 \times 1024 \times 1024) \approx 13.04 , GB$$Result: A 7B parameter model running at FP16 precision requires roughly 13-14 GB of VRAM just to store the model weights.Example 2: Estimating VRAM for a 13 Billion Parameter Model (FP16)Now let's look at a slightly larger "13B" model.Parameters: 13,000,000,000Bytes per Parameter (FP16): 2 bytesCalculation: $$VRAM_{FP16} \approx 13,000,000,000 , parameters \times 2 , bytes/parameter$$ $$VRAM_{FP16} \approx 26,000,000,000 , bytes$$Converting to GB:$$26,000,000,000 , bytes \div (1024^3) \approx 24.21 , GB$$Result: A 13B parameter model at FP16 needs approximately 24-25 GB of VRAM for its weights. This already exceeds the VRAM available on many consumer GPUs.Example 3: Comparing FP16 vs. INT8 for a 7B ModelLet's revisit the 7B model but consider using INT8 (8-bit integer) precision through quantization. In INT8, each parameter requires only 1 byte.Parameters: 7,000,000,000Bytes per Parameter (INT8): 1 byteCalculation: $$VRAM_{INT8} \approx 7,000,000,000 , parameters \times 1 , byte/parameter$$ $$VRAM_{INT8} \approx 7,000,000,000 , bytes$$Converting to GB:$$7,000,000,000 , bytes \div (1024^3) \approx 6.52 , GB$$Result: By using INT8 quantization, the VRAM requirement for the 7B model's weights drops to roughly 6.5-7 GB. This makes it feasible to run on GPUs with less VRAM, like those with 8 GB or 12 GB, although you still need headroom for activations and overhead.{"layout": {"title": "Estimated VRAM for 7B Model Weights (Different Precisions)", "xaxis": {"title": "Precision"}, "yaxis": {"title": "Estimated VRAM (GB)"}, "margin": {"l": 50, "r": 20, "t": 40, "b": 40}, "bargap": 0.3}, "data": [{"type": "bar", "x": ["FP16 (2 bytes/param)", "INT8 (1 byte/param)"], "y": [13.04, 6.52], "marker": {"color": ["#339af0", "#20c997"]}, "name": "VRAM (GB)"}]}Estimated VRAM required just for the weights of a 7 billion parameter model using FP16 (16-bit floating point) and INT8 (8-bit integer) precision.About WeightsRemember, these calculations provide a baseline for the model weights only. You need additional VRAM for:Activations: Intermediate calculations performed during inference. The amount needed depends heavily on the context length (how much text the model is processing at once) and batch size (how many requests are processed simultaneously).KV Cache: Stores information about the sequence being generated, grows with the length of the generated text.Software Overhead: The runtime environment, libraries (like CUDA), and the operating system all consume some VRAM.A safer rule of thumb is often to add a buffer of 20-40% on top of the weight VRAM estimate, or simply ensure your GPU has significantly more VRAM than the calculated weight requirement. For instance, to comfortably run the 7B FP16 model (estimated 13-14 GB for weights), a GPU with 16 GB or ideally 24 GB of VRAM would be preferable, especially for longer contexts.Compare these estimates to the specifications of your hardware (as discussed in the "Checking Hardware Specifications" section) to gauge whether running a specific model is feasible on your system.