We've seen how the sheer number of parameters in a Large Language Model dictates how much memory (especially VRAM) is needed to simply hold the model. But storing the model is only half the story. To actually use the model, whether for generating text or answering questions (a process called inference), the computer needs to perform a vast number of calculations. The larger the model, the more calculations are involved.
At their core, LLMs work by processing input data (like your prompt) through layers of artificial neurons. This involves many mathematical operations, primarily matrix multiplications and additions. Think of it like a huge, complex chain of calculations where the result of one step feeds into the next. Each parameter in the model plays a role in these calculations. When you ask an LLM to generate text, it's essentially performing millions or billions of these arithmetic operations involving decimal numbers (floating point numbers).
How do we measure the computational power needed for these tasks? A standard measure is FLOPS, which stands for FLoating point Operations Per Second.
We often see prefixes like GigaFLOPS (GFLOPS, billions of FLOPS), TeraFLOPS (TFLOPS, trillions of FLOPS), or even PetaFLOPS (PFLOPS, quadrillions of FLOPS) because modern hardware, especially GPUs, can perform an astonishing number of these calculations very quickly.
CPUs are designed for general purpose tasks and executing instructions sequentially or a few at a time. GPUs, on the other hand, are designed with thousands of simpler cores that work in parallel. This parallel architecture makes them incredibly efficient at performing the same type of calculation (like those needed for matrix multiplication in LLMs) across large amounts of data simultaneously.
This means GPUs typically have much higher FLOPS ratings than CPUs for the kinds of operations central to deep learning and LLMs. This is a primary reason why GPUs are the preferred hardware for running these models.
There's a direct relationship between the size of an LLM (number of parameters) and the computational power (FLOPS) needed to run it effectively:
Think of it like assembling a product. A simple product with few parts (small model) can be assembled quickly even with basic tools (lower FLOPS). A complex product with thousands of parts (large model) requires a sophisticated assembly line with many robotic arms working in parallel (higher FLOPS) to be completed efficiently. If you used the basic tools for the complex product, it would take an impractically long time.
This chart shows a conceptual representation. While not a precise linear scale in practice, larger models fundamentally require significantly more computation (higher FLOPS) to operate efficiently during inference.
It's important to note (and we'll discuss this more in the next chapter) that the computational demands for training an LLM from scratch are orders of magnitude higher than for just running inference (using a pre trained model). Training involves repeatedly processing massive datasets and adjusting all the model parameters, requiring enormous FLOPS sustained over long periods.
For inference, which is the focus for most users, the goal is usually to get a response quickly enough for interaction. While still computationally intensive, especially for large models, the required FLOPS are much less than for training. However, a GPU with sufficient FLOPS capability is still essential for a smooth user experience with larger models.
In summary, alongside memory (VRAM), the computational throughput measured in FLOPS is a significant factor linking model size to hardware requirements. Larger models demand more calculations, requiring hardware (typically GPUs) with higher FLOPS ratings to deliver results at an acceptable speed.
© 2025 ApX Machine Learning