Running Large Language Models (LLMs) locally on consumer hardware is increasingly feasible, offering benefits like enhanced privacy, cost savings, and customization options. NVIDIA's RTX 40 series GPUs, with their significant VRAM and compute capabilities, provide a strong platform.

Selecting the right LLM for your specific card requires understanding the interplay between GPU specifications, model characteristics, and the available software frameworks. Success depends on matching the model's requirements, particularly its VRAM footprint, to the capabilities of your hardware.

Understanding RTX 40 Series GPUs for LLMs

The suitability of these GPUs for running LLMs primarily hinges on their Video Random Access Memory (VRAM). Larger models, or models running with less compression (quantization), demand more VRAM. Other factors like CUDA core count, Tensor Core performance (especially for lower precision formats like FP16/INT8), and memory bandwidth influence inference speed (tokens per second).

Here's a quick overview of the VRAM available across the RTX 40 series desktop lineup:

GPU Model	VRAM (GB)	Typical LLM Use Case Potential
RTX 4090	24	Large models (70B+ Q*), High performance
RTX 4080 / Super	16	Medium-Large models (13B-30B Q), Good perf.
RTX 4070 Ti Super	16	Medium-Large models (13B-30B Q), Good perf.
RTX 4070 Ti	12	Medium models (7B-13B Q), Solid performance
RTX 4070 / Super	12	Medium models (7B-13B Q), Solid performance
RTX 4060 Ti	16	Medium-Large models (13B-30B Q), Value Perf.
RTX 4060 Ti	8	Small-Medium models (3B-7B Q), Entry-level
RTX 4060	8	Small models (3B-7B Q), Basic use

* Q denotes quantized models. Performance varies significantly within tiers based on specific model, quantization, and framework. Note that for some larger models, even with quantization, partial offloading to CPU RAM might be necessary, impacting performance.

Ensure you have the latest NVIDIA drivers installed. They often include performance improvements and compatibility fixes relevant to CUDA and LLM workloads.

Factors for Choosing a Local LLM

Selecting an LLM involves balancing several technical considerations:

Model Size (Parameters)

LLMs are often categorized by their number of parameters (e.g., 7 billion, 13B, 70B). Larger models generally exhibit better reasoning and knowledge capabilities but require significantly more VRAM and compute power. The VRAM requirement scales roughly linearly with the number of parameters and depends on the precision used (e.g., FP16, INT8, INT4).

Quantization

Quantization is a technique used to reduce the memory footprint and sometimes accelerate the inference speed of LLMs by representing the model's weights and activations with lower-precision data types (e.g., 8-bit integers (INT8) or 4-bit integers (INT4)) instead of the standard 16-bit floating-point (FP16 or BF16). Common quantization formats include:

GGUF (GPT-Generated Unified Format): Used extensively by llama.cpp, it bundles the model and quantization information into a single file. It supports various quantization methods (e.g., Q4_K_M, Q5_K_S, Q8_0) optimized for different quality/performance trade-offs.
GPTQ (Generative Pre-trained Transformer Quantization): An early popular post-training quantization method, often requiring specific library support (like AutoGPTQ).
AWQ (Activation-aware Weight Quantization): This is another method that aims to preserve model quality better during quantization.
BitsAndBytes: This is a library integrated with Hugging Face Transformers that enables on-the-fly quantization (e.g., loading models in 8-bit or 4-bit directly).
Quantization-Aware Training (QAT): Increasingly, models are released with QAT, meaning they were trained with quantization in mind, leading to significantly better quality retention at very low bitrates (e.g., Google's Gemma QAT models).

Quantization significantly lowers VRAM usage, making running larger models on GPUs with limited memory possible. For example, a 7B parameter model might require ~14GB in FP16 but only ~4-5GB using 4-bit quantization (like Q4_K_M).

Inference Speed

Inference speed, measured in tokens per second (tok/s), determines how quickly the model generates text. It's affected by the GPU's compute power (CUDA/Tensor cores), memory bandwidth, model size, quantization level, batch size, and the efficiency of the inference framework being used.

Task Suitability

Models are often fine-tuned for specific tasks. Some are general-purpose chat models (e.g., Llama 3.1 Instruct), while others might specialize in coding (e.g., Code Llama), instruction following, or specific knowledge domains. Choose a model whose training aligns with your intended application.

Licensing

LLMs have different licenses that dictate how they can be used, modified, and distributed. Common licenses include Apache 2.0 (permissive), MIT, and specific community licenses like the Llama 2 & 3 Community License, which might restrict commercial use for large companies. Always check the model's license before using it, especially for commercial applications.

The relationship between GPU VRAM, Model Size, Quantization, and Local LLM Execution.

Best LLMs for Each GPU

These recommendations focus on popular, high-performing models and assume the use of quantization (primarily 4-bit or 5-bit GGUF variants like Q4_K_M or Q5_K_M) unless otherwise stated. Performance is qualitative.

RTX 4090 (24GB VRAM)

With the most VRAM in the consumer lineup, the 4090 can handle the largest models with reasonable quantization or smaller models with higher precision.

Llama 3.1 70B Instruct (Quantized): A top-performing open model. Runs with 4-bit or 5-bit quantization (e.g., Q4_K_M, Q5_K_M GGUF). Expect an alright performance, potentially with some CPU offloading for larger context lengths. An INT4 quantized version requires around 35GB VRAM.
Gemma 3 27B QAT (Quantized): Google's latest Gemma 3 series, specifically the 27B Quantization-Aware Trained (QAT) models, are highly optimized and fit comfortably within 24GB VRAM in 4-bit (requiring around 14.1GB). Offers strong reasoning capabilities.
Mixtral 8x7B (Quantized): A high-quality Mixture-of-Experts (MoE) model. While it has 47B parameters, only about 13B are active per token. Requires significant VRAM even when quantized (~25-30GB for Q4_K_M). It will push the limits, potentially requiring layers offloaded to CPU RAM, but is generally usable.
Command R+ (Quantized): Cohere's powerful 104B parameter model. Requires heavy quantization (e.g., Q4) and can fit within 24GB, offering strong reasoning capabilities, but will be slow.
DeepSeek-R1-Distill-Qwen-14B (Quantized): Runs well on the 4090 with 4-bit quantization. It can also run DeepSeek-R1-Distill-Qwen-32B (Quantized) with a smaller context length and degraded inference speed.
Fine-tuned 30B+ Models: Ideal for running specialized, fine-tuned models based on architectures like Llama or Mistral at higher quality settings.

RTX 4080 / Super & RTX 4070 Ti Super (16GB VRAM)

These GPUs offer a good balance and are capable of running large models with quantization.

Llama 3.1 8B Instruct (Unquantized/Quantized): Runs exceptionally well, even unquantized (FP16, ~16GB VRAM needed, tight fit). Quantized versions (e.g., Q4_K_M, ~5GB VRAM) run very fast.
Gemma 2 9B / Gemma 3 12B QAT (Quantized): Excellent choices for 16GB cards, offering good performance and quality. The Gemma 3 12B QAT model in INT4 requires around 6.6GB VRAM.
Mistral 7B / Zephyr / OpenHermes (Unquantized/Quantized): Smaller, more efficient models that run very fast.
Phi-3 Medium (Quantized): Microsoft's capable small model runs well.
Mixtral 8x7B (Heavily Quantized/Offloaded): While technically possible with very aggressive quantization (e.g., Q3_K_M) or significant CPU offloading, performance will be noticeably impacted due to the high overall parameter count. Expect slower inference speeds.
DeepSeek-R1-Distill-Qwen-7B (Quantized): Runs comfortably at 4-bit (~4.5GB VRAM). A great mix of performance and size.
Fine-tuned 13B Models (Quantized): Can run fine-tuned 13B models comfortably.

RTX 4070 / Super & RTX 4070 Ti (12GB VRAM)

These 12GB cards are competent mid-range options, suitable for medium-sized models with quantization.

Llama 3.1 8B Instruct (Quantized): Runs very well. While FP16 (~16GB) won't fit entirely, high-quality quantization (Q5_K_M, Q8_0) fits easily with great speed.
Gemma 2 7B / Gemma 3 7B QAT (Quantized): Excellent options that balance quality and VRAM usage for these cards.
Mistral 7B Instruct (Unquantized/Quantized): Similar to Llama 3.1 8B, it runs extremely well and easily fits even with less aggressive quantization.
Phi-3 Medium (Quantized): Runs comfortably with 4-bit quantization (~7GB VRAM needed).
Mixtral 8x7B (Heavily Quantized/Offloaded): Possible with very aggressive 3-bit or lower 4-bit quantization (e.g., Q3_K_M, Q4_0), and likely requiring some CPU offloading. The performance will be noticeably slower and quality may degrade.
DeepSeek-R1-Distill-Qwen-7B (Quantized): Also runs well, especially with Q4 quantization.
Fine-tuned 7B/8B Models: Excellent platform for running specialized 7B/8B models.

RTX 4060 Ti (16GB VRAM)

This card is interesting. It offers the VRAM of a 4080 but with less compute power and memory bandwidth. It can fit similar models but will run them slower.

Model Fit: Can technically load models similar to the 16GB 4070 Ti Super / 4080 (e.g., Llama 3.1 8B FP16, Mixtral Q4 with CPU offloading, Gemma 3 12B QAT).
Performance: Expect significantly lower tokens/second than the higher-tier 16GB cards due to fewer CUDA cores and narrower memory bus.
Recommendations: Llama 3.1 8B (excellent), Phi-3 Medium (good), Mixtral 8x7B (usable with Q4 and offloading), DeepSeek-R1-Distill-Qwen-7B (very usable), Fine-tuned 7B/13B models.

RTX 4060 Ti (8GB VRAM) & RTX 4060 (8GB VRAM)

These entry-level cards are the most constrained by VRAM. Focus on smaller models or heavily quantized versions of medium models.

Llama 3.1 8B Instruct (Quantized): Runs well with 4-bit quantization (Q4_K_M fits comfortably within 8GB).
Mistral 7B Instruct (Quantized): Similar to Llama 3.1 8B, runs well with 4-bit quantization.
Phi-3 Mini / Small (Quantized): These smaller models (3.8B parameters for Mini, 7B for Small) are ideal for 8GB cards and offer good performance and quality for their size.
Gemma 2B QAT / Gemma 7B QAT (Quantized): Very efficient and performant for their size, fitting well within 8GB.
DeepSeek-R1-Distill-Qwen-1.5B (Quantized): Light and fast, great on 8GB cards.
Other < 7B Models: Models like StableLM 3B and TinyLlama 1.1B run very fast.
Larger Models (e.g., 13B): Possible only with very aggressive quantization (e.g., Q2_K, Q3_K variants), expect slower performance and potential quality degradation.

Tools and Frameworks for Running Local LLMs

Several tools simplify the process of downloading and running LLMs locally:

Ollama: Provides a simple command-line interface and local server. Easy setup and model management.
```
# Pull and run Llama 3.1 8B
ollama run llama3.1:8b

# List downloaded models
ollama list
```
LM Studio / Jan: User-friendly graphical interfaces (GUIs) for downloading and interacting with various LLMs (often using llama.cpp in the backend).
Text Generation WebUI (Oobabooga): A comprehensive Gradio-based web interface supporting various models, loaders (Transformers, ExLlamaV2, Llama.cpp), and features like fine-tuning and chat modes.
Llama.cpp: A high-performance inference engine written in C++. Primarily uses the GGUF format and supports CPU and GPU (via CUDA/Metal) acceleration. It requires compilation but offers fine-grained control.
```
# Example: Run Llama 3.1 8B GGUF with GPU offload
./main -m ./models/llama-3.1-8b-instruct.Q4_K_M.gguf \
       -p "User: What is CUDA? Assistant:" \
       -n 512 --color -ngl 33 # -ngl: num GPU layers
```

Hugging Face Transformers: The standard Python library for NLP. Can load and run many models, often combined with accelerate for multi-GPU/CPU offloading and bitsandbytes for quantization.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map= "auto", # Auto-distribute layers
    load_in_4bit=True
)

# Basic generation (example)
inputs = tokenizer("Explain quantization:", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

vLLM / TensorRT-LLM: Specialized inference libraries focused on maximizing throughput and minimizing latency, particularly for serving scenarios. These often require model conversion steps and are targeted at advanced users prioritizing speed.

Benchmarking and Performance Tuning

Published benchmarks provide a starting point, but the best way to assess performance is to test models directly on your hardware and workload.

Benchmarking: Many tools include built-in benchmarks (e.g., llama.cpp has perplexity calculation and speed tests). Simple Python scripts using time can measure generation speed for specific prompts.
Primary Metrics: Tokens per second (generation speed) and time-to-first-token (latency).
Tuning:
- Quantization Level: Experiment with different GGUF quantizations (e.g., Q4_K_M vs. Q5_K_M vs. Q8_0) or bitsandbytes settings to balance VRAM, speed, and output quality.
- GPU Layer Offloading (-ngl in Llama.cpp): For GGUF models, adjust the number of layers offloaded to the GPU. Maximize this within your VRAM limits for best speed. Start high and decrease if you encounter out-of-memory errors.
- Batch Size: For frameworks supporting batching (like Transformers, vLLM), increasing batch size can improve throughput and VRAM usage.
- Context Length: Longer context windows consume more VRAM (due to the KV cache). Adjust based on your needs and available memory.
- Framework Choice: Different frameworks have varying performance characteristics. Llama.cpp, ExLlamaV2, and TensorRT-LLM are often among the fastest for inference.
- Software Stack: Ensure CUDA Toolkit, cuDNN, PyTorch, and framework versions are compatible and up-to-date.

Example: Running Llama 3.1 8B using Llama.cpp

This provides a concrete workflow:

Install Llama.cpp: Clone the repository and build it with CUDA support (follow instructions on the official llama.cpp GitHub page).

git clone [https://github.com/ggerganov/llama.cpp.git](https://github.com/ggerganov/llama.cpp.git)
cd llama.cpp
# Enable CUDA support during compilation
make LLAMA_CUBLAS=1
# (Adjust build flags as needed for your system)

Download a Model: Get a GGUF quantized version of Llama 3.1 8B Instruct. Q4_K_M is a good balance for 12GB VRAM.

# Example download (check Hugging Face for official sources)
mkdir models
wget -P ./models [https://huggingface.co/QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf?download=true](https://huggingface.co/QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf?download=true)

Run Inference: Execute the main binary, specifying the model, prompt, and GPU layers.
```
./main -m ./models/Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf \
    -p "User: Write a python function for fibonacci sequence.\nAssistant:" \
    -n 256 --color --ctx-size 2048 -ngl 35
```
- `-m': Specifies the model file.
- -p: The initial prompt.
- -n: Maximum number of tokens to generate.
- --color: Enables colored output.
- --ctx-size: Sets the context window size (affects VRAM).
- -ngl: Number of layers to offload to the GPU. For Llama 3.1 8B (33 layers total), -ngl 33 or higher offloads all layers if VRAM allows.

Conclusion

The NVIDIA RTX 40 series is a capable platform for running a wide range of LLMs locally and are still among the most available on the market today. VRAM capacity is the most critical factor determining which models are feasible, with the 24GB RTX 4090 offering the most flexibility and the 8GB RTX 4060/4060 Ti providing an entry point for smaller or heavily quantized models.

Quantization techniques like GGUF, QAT-optimized models, and 4-bit loading via libraries are essential for fitting larger models into available VRAM. Choosing the right model involves considering its size, task suitability, and license and balancing these against the VRAM and performance characteristics of your specific GPU.

Best Local LLMs for Every NVIDIA RTX 40 Series GPU