Now that we understand that Large Language Models are measured by their parameter count, let's look at some real-world examples. This will give you a better sense of the scale we're talking about and how models are often categorized, even if informally. Keep in mind that the field moves fast, so what's considered "large" today might seem moderate tomorrow.
Models can generally be grouped by their parameter counts. While there are no strict boundaries, here's a common way to think about them:
Small Models (< 1 Billion to ~7 Billion Parameters)
These models are the most accessible in terms of computational requirements.
- Parameter Count: Typically range from a few hundred million up to around 7 billion (7B).
- Examples:
- BERT-Large: While not always called an "LLM" in the modern sense (it's an encoder model), it has around 340 million parameters and was foundational.
- DistilBERT: A smaller, distilled version of BERT with about 66 million parameters, designed for efficiency.
- Phi-2: A model from Microsoft with 2.7 billion parameters, known for strong performance relative to its size.
- Gemma 2B: A smaller model from Google with 2 billion parameters.
- Characteristics:
- Relatively fast to run (inference).
- Lower memory requirements; some can run on CPUs or consumer-grade GPUs with limited VRAM.
- Often perform well on specific tasks they were trained or fine-tuned for.
- May lack the broad world knowledge or complex reasoning abilities of larger models.
- Suitable for deployment on edge devices or applications where resource constraints are tight.
Medium Models (~7 Billion to ~70 Billion Parameters)
This category represents a popular middle ground, balancing capability with resource needs. Many widely used open-source models fall into this range.
- Parameter Count: Roughly from 7 billion (7B) up to about 70 billion (70B).
- Examples:
- Llama 2 (7B, 13B, 70B): A family of popular open-source models from Meta, available in various sizes within this range.
- Mistral 7B: A well-regarded open-source model with 7 billion parameters, known for its efficiency.
- Falcon (7B, 40B): Another family of open-source models.
- Gemma 7B: Google's 7 billion parameter open model.
- Characteristics:
- Offer significantly more capability than small models across a wider range of tasks (text generation, summarization, translation, basic reasoning).
- Require capable GPUs with substantial VRAM (e.g., 16GB, 24GB, or more, especially for the larger end of this range).
- Represent a good balance for many research and development purposes.
Large Models (~70 Billion to Several Hundred Billion Parameters)
These models demonstrate powerful capabilities but come with significant hardware demands.
- Parameter Count: Generally starting around 70 billion (70B) and going up to a few hundred billion parameters.
- Examples:
- GPT-3 (175B): OpenAI's model (specifically the
davinci
version) has 175 billion parameters and was highly influential.
- Falcon 180B: A very large open-source model.
- BLOOM (176B): An open multilingual model.
- Characteristics:
- Exhibit strong performance in language understanding, generation, and reasoning.
- Hardware requirements are substantial, often needing multiple high-end GPUs working together.
- Less common for individuals to run locally; often accessed via cloud platforms or APIs.
Very Large / Frontier Models (Often > 500 Billion Parameters, Exact Counts Often Undisclosed)
These are the state-of-the-art models, often developed by major research labs and tech companies. Their exact parameter counts are frequently not publicly released, but they are understood to be significantly larger than the "Large" category, potentially reaching or exceeding a trillion parameters.
- Parameter Count: Estimated to be in the high hundreds of billions or over a trillion (1T+). Specific numbers are usually not confirmed by the developers.
- Examples:
- GPT-4: OpenAI's successor to GPT-3. Parameter count is not public but widely believed to be very large, possibly using a "Mixture of Experts" (MoE) architecture which complicates a single parameter count.
- Google Gemini (Ultra, Pro, Nano): Google's family of advanced models. Sizes are not specified in parameters.
- Claude 3 (Opus, Sonnet, Haiku): Anthropic's family of models. Again, parameter counts are not public.
- Characteristics:
- Represent the cutting edge in AI capabilities, excelling at complex reasoning, creativity, and nuanced understanding.
- Require massive computational resources (large clusters of GPUs or specialized hardware like TPUs) for both training and inference.
- Primarily accessible through APIs or managed cloud services due to their immense resource requirements.
To help visualize the difference in scale, consider these approximate parameter counts:
Approximate parameter counts for selected LLMs on a logarithmic scale to show relative sizes across categories. Note that GPT-3 (175B) is significantly larger than Llama 2 (70B), which is much larger than Mistral (7B) and Phi-2 (2.7B).
Understanding these rough categories helps contextualize discussions about different LLMs. As you can see, the number of parameters directly influences a model's potential capabilities but also drastically changes the hardware needed to run it. In the following chapters, we will explore the specific hardware components involved and how their specifications relate to handling models of these varying sizes, focusing primarily on the requirements for inference (running the model).