All Courses

Focus on Inference Requirements

As we've established, using a Large Language Model (LLM) involves two main activities: inference and training. While training is the computationally intensive process of creating or significantly modifying a model, inference is the much more common task of using a pre-trained model to generate text, answer questions, or perform other language tasks.

Think about how you typically interact with AI today. You might use a chatbot, a translation service, or a text summarization tool. In nearly all these cases, you are performing inference. You are sending input to a model that has already been trained, and it's generating an output based on its existing knowledge.

The hardware requirements for these two activities differ dramatically. Training an LLM, especially a large one, demands immense computational resources. It often requires clusters of powerful GPUs working together for days, weeks, or even months, along with substantial amounts of system RAM and extremely fast storage. This is typically the domain of large research labs and tech companies with dedicated infrastructure.

Inference, while still computationally demanding compared to traditional software, generally requires significantly less hardware. The primary need is enough VRAM on a GPU (or specialized accelerator) to hold the model's parameters and enough compute power to process the input and generate the output reasonably quickly. System RAM and CPU speed also play supporting roles. This makes running inference feasible on a wider range of hardware, from high-end consumer GPUs to cloud-based instances, depending on the model's size.

Consider the goals of this course: understanding LLM sizes and their relationship to hardware requirements. For the vast majority of users, developers integrating LLMs into applications, or individuals experimenting with readily available models, the practical question is: "What hardware do I need to run this model?" This question is fundamentally about the requirements for inference.

While fine-tuning, a process of making smaller adjustments to a pre-trained model, requires more resources than basic inference, it's still generally less demanding than training from scratch. However, even fine-tuning often goes past the hardware capabilities readily available to many individual users.

Therefore, for the remainder of this course, particularly when we discuss estimating hardware needs, our primary focus will be on the requirements for inference. Understanding how to estimate the VRAM and consider the compute needed to simply use a pre-trained LLM is the most relevant skill for most people getting started in this area. The estimation techniques we will cover in the next chapter are designed to help you determine if a particular model can run effectively on hardware you might have access to for tasks like text generation or analysis.

Was this section helpful?