Alongside your CPU, Random Access Memory (RAM) is another significant factor determining which Large Language Models (LLMs) you can run locally and how well they perform. Think of RAM as your computer's workbench: it's the temporary, high-speed memory space where your computer holds the data and programs it's actively working on.
When you want to use an LLM, the first step is loading the model's data into memory. LLMs are "large" primarily because they consist of billions of parameters, which are essentially the learned values the model uses to make predictions. All these parameters need a place to reside while the model is active, and that place is primarily your system's RAM, especially if you don't have a dedicated GPU or if the GPU's memory (VRAM) isn't large enough.
Here’s why sufficient RAM is important:
The amount of RAM needed depends directly on the size of the LLM you intend to run. Model size is typically measured in billions of parameters (e.g., 7B for 7 billion parameters, 13B for 13 billion).
A very rough rule of thumb for unoptimized models used to be that you needed slightly more RAM than the model size in gigabytes (GB), multiplied by the size of each parameter (often 2 bytes or more). For example, a 7B parameter model where each parameter takes 2 bytes would need roughly 7×2=14 GB of RAM just for the parameters, plus extra for calculations and your operating system.
However, modern techniques have significantly improved efficiency. Quantization, a process we'll cover in detail in Chapter 3, reduces the amount of memory each parameter requires, often shrinking RAM needs by 50-75% or more with minimal impact on quality for many tasks. This makes running larger models feasible on consumer hardware.
Here are some general guidelines, keeping in mind that these are estimates and actual usage depends heavily on the specific model format (like GGUF, discussed later) and quantization level:
Approximate RAM required to load and run commonly quantized LLM models (e.g., 4-bit quantization). Includes a small buffer for OS and inference overhead. Actual requirements vary.
Remember to account for your operating system (Windows, macOS, Linux) and any background applications, which also consume RAM. The amount listed on your computer is the total RAM; the available RAM for the LLM will be less.
If you try to load a model that requires more RAM than you have available, one of two things will likely happen:
It's useful to distinguish system RAM from Graphics Card RAM (Video RAM, or VRAM), which we discuss in the next section. If you have a capable GPU with sufficient VRAM, parts of the LLM (or the entire model, if VRAM is large enough) can be loaded onto the GPU for much faster processing. However, system RAM still plays a role:
As mentioned in the "Checking Your System Specifications" section later in this chapter, you can easily find out how much RAM your computer has:
free -h
command in the terminal or System Monitor application.Understanding your system's RAM capacity is fundamental before you start downloading models. It directly influences which models are accessible to you for local execution. While techniques like quantization help stretch your hardware further, having ample RAM provides a smoother and more flexible experience when starting with local LLMs.
© 2025 ApX Machine Learning