Just launched on LinkedIn! Follow for updates on AI/ML research and practical tips.

Follow on LinkedIn

Best Local LLMs for Every NVIDIA RTX 40 Series GPU

By Ryan A. on Apr 18, 2025

Guest Author

Running Large Language Models (LLMs) locally on consumer hardware is increasingly feasible, offering benefits like enhanced privacy, cost savings, and customization options. NVIDIA's RTX 40 series GPUs, with their significant VRAM and compute capabilities, provide a strong platform for this endeavor.

Selecting the right LLM for your specific RTX 40 series card requires understanding the interplay between GPU specifications, model characteristics, and the available software frameworks. Success depends on matching the model's requirements, particularly its VRAM footprint, to the capabilities of your hardware.

Understanding RTX 40 Series GPUs for LLMs

The suitability of an RTX 40 series GPU for running LLMs primarily hinges on its Video Random Access Memory (VRAM). Larger models, or models running with less compression (quantization), demand more VRAM. Other factors like CUDA core count, Tensor Core performance (especially for lower precision formats like FP16/INT8), and memory bandwidth also influence inference speed (tokens per second).

Here's a quick overview of the VRAM available across the RTX 40 series desktop lineup:

GPU Model VRAM (GB) Typical LLM Use Case Potential
RTX 4090 24 Large models (70B+ Q*), High performance
RTX 4080 / Super 16 Medium-Large models (30B-70B Q), Good perf.
RTX 4070 Ti Super 16 Medium-Large models (30B-70B Q), Good perf.
RTX 4070 Ti 12 Medium models (13B-30B Q), Solid performance
RTX 4070 / Super 12 Medium models (13B-30B Q), Solid performance
RTX 4060 Ti 16 Medium-Large models (30B-70B Q), Value Perf.
RTX 4060 Ti 8 Small-Medium models (7B-13B Q), Entry-level
RTX 4060 8 Small models (3B-7B Q), Basic use

* Q denotes quantized models. Performance varies significantly within tiers based on specific model, quantization, and framework.

Ensure you have the latest NVIDIA drivers installed, as they often include performance improvements and compatibility fixes relevant to CUDA and LLM workloads.

Factors for Choosing a Local LLM

Selecting an LLM involves balancing several technical considerations:

Model Size (Parameters)

LLMs are often categorized by their number of parameters (e.g., 7 billion, 13B, 70B). Larger models generally exhibit better reasoning and knowledge capabilities but require significantly more VRAM and compute power. The VRAM requirement scales roughly linearly with the number of parameters and depends on the precision used (e.g., FP16, INT8, INT4).

Quantization

Quantization is a technique used to reduce the memory footprint and sometimes accelerate the inference speed of LLMs by representing the model's weights and activations with lower-precision data types (e.g., 8-bit integers (INT8) or 4-bit integers (INT4)) instead of the standard 16-bit floating-point (FP16 or BF16). Common quantization formats include:

  • GGUF (GPT-Generated Unified Format): Used extensively by llama.cpp, it bundles the model and quantization information into a single file. It supports various quantization methods (e.g., Q4_K_M, Q5_K_S, Q8_0) optimized for different quality/performance trade-offs.
  • GPTQ (Generative Pre-trained Transformer Quantization): An early popular post-training quantization method, often requiring specific library support (like AutoGPTQ).
  • AWQ (Activation-aware Weight Quantization): Another method aiming to preserve model quality better during quantization.
  • BitsAndBytes: A library integrated with Hugging Face Transformers enabling on-the-fly quantization (e.g., loading models in 8-bit or 4-bit directly).

Quantization significantly lowers VRAM usage, making it possible to run larger models on GPUs with limited memory. For example, a 7B parameter model might require ~14GB in FP16 but only ~4-5GB using 4-bit quantization (like Q4_K_M).

Inference Speed

Measured in tokens per second (tok/s), inference speed determines how quickly the model generates text. It's affected by the GPU's compute power (CUDA/Tensor cores), memory bandwidth, model size, quantization level, batch size, and the efficiency of the inference framework being used.

Task Suitability

Models are often fine-tuned for specific tasks. Some are general-purpose chat models (e.g., Llama 3 Instruct), while others might specialize in coding (e.g., Code Llama), instruction following, or specific knowledge domains. Choose a model whose training aligns with your intended application.

Licensing

LLMs come with different licenses that dictate how they can be used, modified, and distributed. Common licenses include Apache 2.0 (permissive), MIT, and specific community licenses like the Llama 2 & 3 Community License, which might have restrictions on commercial use for very large companies. Always check the model's license before using it, especially for commercial applications.

Diagram illustrating the relationship between GPU VRAM, Model Size, Quantization, and Local LLM Execution.

Best LLMs per RTX 40 GPU Tier

These recommendations focus on popular, high-performing models and assume the use of quantization (primarily 4-bit or 5-bit GGUF variants like Q4_K_M or Q5_K_M) unless otherwise stated. Performance is qualitative.

RTX 4090 (24GB VRAM)

With the most VRAM in the consumer lineup, the 4090 can handle the largest models with reasonable quantization or smaller models with higher precision.

  • Llama 3 70B Instruct (Quantized): A top-performing open model. Runs well with 4-bit or 5-bit quantization (e.g., Q4_K_M, Q5_K_M GGUF). Expect excellent performance.
  • Mixtral 8x7B (Quantized): A high-quality Mixture-of-Experts (MoE) model. Requires significant VRAM even when quantized (~25-30GB for Q4_K_M), pushing the limits but potentially feasible with specific quantization/offloading strategies. Performance is generally good.
  • Command R+ (Quantized): Cohere's powerful 104B parameter model. Requires heavy quantization (e.g., Q4) and fits within 24GB, offering strong reasoning capabilities.
  • DeepSeek-R1-Distill-Qwen-7B / 8B (Quantized): Runs well on the 4090 with 4-bit quantization (~4.5–5GB VRAM). High quality, coding-capable distilled models.
  • Fine-tuned 30B+ Models: Ideal for running specialized, fine-tuned models based on architectures like Llama or Mistral at higher quality settings.

RTX 4080 / Super & RTX 4070 Ti Super (16GB VRAM)

These GPUs offer a good balance, capable of running large models with quantization.

  • Llama 3 70B Instruct (Quantized): Feasible with 4-bit quantization (e.g., Q4_0, Q4_K_S/M GGUF requires ~35-40GB, often needs CPU offloading or lower bit quantization like Q3_K_M). Performance will be usable but slower than the 4090. Correction: 70B Q4 typically requires ~40GB+, making it challenging even for 16GB without significant offloading. Better targets are:
  • Mixtral 8x7B (Quantized): Runs well with 4-bit quantization (e.g., Q4_K_M, ~25-30GB VRAM usage). May require some layers offloaded to CPU RAM depending on exact variant and context size, but generally usable.
  • Llama 3 8B Instruct (Unquantized/Quantized): Runs exceptionally well, even unquantized (FP16). Excellent performance.
  • Mistral 7B / Zephyr / OpenHermes (Unquantized/Quantized): Smaller, efficient models that run very fast.
  • Phi-3 Medium (Quantized): Microsoft's capable small model, runs very well.
  • DeepSeek-R1-Distill-Qwen-7B (Quantized): Runs comfortably at 4-bit (~4.5GB VRAM). Great mix of performance and size.
  • Fine-tuned 13B Models (Quantized): Can run fine-tuned 13B models comfortably.

RTX 4070 / Super & RTX 4070 Ti (12GB VRAM)

These 12GB cards are competent mid-range options, suitable for medium-sized models with quantization.

  • Llama 3 8B Instruct (Unquantized/Quantized): Runs very well. FP16 (~16GB) won't fit entirely, but high-quality quantization (Q5_K_M, Q8_0) fits easily with great speed.
  • Mistral 7B Instruct (Unquantized/Quantized): Similar to Llama 3 8B, runs extremely well, easily fits even with less aggressive quantization.
  • Phi-3 Medium (Quantized): Runs comfortably with 4-bit quantization (~7GB VRAM needed).
  • Mixtral 8x7B (Heavily Quantized): Possible with aggressive 3-bit or lower 4-bit quantization (e.g., Q3_K_M, Q4_0), potentially requiring some CPU offloading. Performance will be noticeably slower.
  • DeepSeek-R1-Distill-Qwen-7B (Quantized): Also runs well, especially with Q4 quantization.
  • Fine-tuned 7B/8B Models: Excellent platform for running specialized 7B/8B models.

RTX 4060 Ti (16GB VRAM)

This card is interesting, offering the VRAM of a 4080 but with less compute power and memory bandwidth. It can fit similar models but will run them slower.

  • Model Fit: Can technically load models similar to the 16GB 4070 Ti Super / 4080 (e.g., Mixtral Q4, potentially Llama 3 70B with heavy quantization/offloading).
  • Performance: Expect significantly lower tokens/second compared to the higher-tier 16GB cards due to fewer CUDA cores and narrower memory bus.
  • Recommendations: Llama 3 8B (excellent), Phi-3 Medium (excellent), Mixtral 8x7B (usable with Q4), DeepSeek-R1-Distill-Qwen-7B (very usable), Fine-tuned 7B/13B models.

RTX 4060 Ti (8GB VRAM) & RTX 4060 (8GB VRAM)

These entry-level cards are the most constrained by VRAM. Focus on smaller models or heavily quantized versions of medium models.

  • Llama 3 8B Instruct (Quantized): Runs well with 4-bit quantization (Q4_K_M fits comfortably within 8GB).
  • Mistral 7B Instruct (Quantized): Similar to Llama 3 8B, runs well with 4-bit quantization.
  • Phi-3 Mini / Small (Quantized): These smaller models (3.8B parameters) are ideal for 8GB cards, offering good performance and quality for their size.
  • DeepSeek-R1-Distill-Qwen-1.5B (Quantized): Light and fast, great on 8GB cards.
  • Other < 7B Models: Models like StableLM 3B, TinyLlama 1.1B run very fast.
  • Larger Models (e.g., 13B): Possible only with very aggressive quantization (e.g., Q2_K, Q3_K variants), expect slower performance and potential quality degradation.

Tools and Frameworks for Running Local LLMs

Several tools simplify the process of downloading and running LLMs locally:

  • Ollama: Provides a simple command-line interface and local server. Easy setup and model management.
    # Pull and run Llama 3 8B
    ollama run llama3:8b
    
    # List downloaded models
    ollama list
    
  • LM Studio / Jan: User-friendly graphical interfaces (GUIs) for downloading and interacting with various LLMs (often using llama.cpp in the backend).
  • Text Generation WebUI (Oobabooga): A comprehensive Gradio-based web interface supporting various models, loaders (Transformers, ExLlamaV2, Llama.cpp), and features like fine-tuning and chat modes.
  • Llama.cpp: A high-performance inference engine written in C++. Primarily uses the GGUF format and supports CPU and GPU (via CUDA/Metal) acceleration. Requires compilation but offers fine-grained control.
    # Example: Run Llama 3 8B GGUF with GPU offload
    ./main -m ./models/llama-3-8b-instruct.Q4_K_M.gguf \ 
           -p "User: What is CUDA? Assistant:" \ 
           -n 512 --color -ngl 33 # -ngl: num GPU layers
    
  • Hugging Face Transformers: The standard Python library for NLP. Can load and run many models, often combined with accelerate for multi-GPU/CPU offloading and bitsandbytes for quantization.
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
    
    model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    # Load with 4-bit quantization
    model = AutoModelForCausalLM.from_pretrained(
        model_id, 
        torch_dtype=torch.bfloat16, 
        device_map="auto", # Auto-distribute layers
        load_in_4bit=True
    )
    
    # Basic generation (example)
    inputs = tokenizer("Explain quantization:", return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=100)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    
  • vLLM / TensorRT-LLM: Specialized inference libraries focused on maximizing throughput and minimizing latency, particularly for serving scenarios. These often require model conversion steps and are targeted at advanced users prioritizing speed.

Benchmarking and Performance Tuning

Published benchmarks provide a starting point, but the best way to assess performance is to test models directly on your hardware and workload.

  • Benchmarking: Many tools include built-in benchmarks (e.g., llama.cpp has perplexity calculation and speed tests). Simple Python scripts using time can measure generation speed for specific prompts.
  • Primary Metrics: Tokens per second (generation speed) and time-to-first-token (latency).
  • Tuning:
    • Quantization Level: Experiment with different GGUF quantizations (e.g., Q4_K_M vs. Q5_K_M vs. Q8_0) or bitsandbytes settings to balance VRAM, speed, and output quality.
    • GPU Layer Offloading (-ngl in Llama.cpp): For GGUF models, adjust the number of layers offloaded to the GPU. Maximize this within your VRAM limits for best speed. Start high and decrease if you encounter out-of-memory errors.
    • Batch Size: For frameworks supporting batching (like Transformers, vLLM), increasing batch size can improve throughput but also increases VRAM usage.
    • Context Length: Longer context windows consume more VRAM (due to the KV cache). Adjust based on your needs and available memory.
    • Framework Choice: Different frameworks have varying performance characteristics. Llama.cpp, ExLlamaV2, and TensorRT-LLM are often among the fastest for inference.
    • Software Stack: Ensure CUDA Toolkit, cuDNN, PyTorch, and framework versions are compatible and up-to-date.

Example: Running Llama 3 8B on an RTX 4070 (12GB) using Llama.cpp

This provides a concrete workflow:

  1. Install Llama.cpp: Clone the repository and build it with CUDA support (follow instructions on the official llama.cpp GitHub page).
    git clone https://github.com/ggerganov/llama.cpp.git
    cd llama.cpp
    # Enable CUDA support during compilation
    make LLAMA_CUBLAS=1 
    # (Adjust build flags as needed for your system)
    
  2. Download a Model: Get a GGUF quantized version of Llama 3 8B Instruct. Q4_K_M is a good balance for 12GB VRAM.
    # Example download (check Hugging Face for official sources)
    mkdir models
    wget -P ./models https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf?download=true
    
  3. Run Inference: Execute the main binary, specifying the model, prompt, and GPU layers.
    ./main -m ./models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf \ 
       -p "User: Write a python function for fibonacci sequence.\nAssistant:" \ 
       -n 256 --color --ctx-size 2048 -ngl 35
    
    • -m: Specifies the model file.
    • -p: The initial prompt.
    • -n: Maximum number of tokens to generate.
    • --color: Enables colored output.
    • --ctx-size: Sets the context window size (affects VRAM).
    • -ngl: Number of layers to offload to the GPU. For Llama 3 8B (33 layers total), -ngl 33 or higher offloads all layers if VRAM allows.
  • Expected Outcome: On an RTX 4070, a Q4_K_M Llama 3 8B model should fit comfortably within the 12GB VRAM (using ~5-6GB) and achieve good inference speeds (often >30 tok/s, depending on specific settings and hardware). Monitor VRAM usage with nvidia-smi.

Conclusion

The NVIDIA RTX 40 series GPUs present a capable platform for running a wide range of LLMs locally. VRAM capacity is the most critical factor determining which models are feasible, with the 24GB RTX 4090 offering the most flexibility and the 8GB RTX 4060/4060 Ti providing an entry point for smaller or heavily quantized models.

Quantization techniques like GGUF and 4-bit loading via libraries are essential for fitting larger models into available VRAM. Choosing the right model involves considering its size, task suitability, license, and balancing these against the VRAM and performance characteristics of your specific RTX 40 GPU.

By understanding these factors and utilizing the available tools and frameworks, technical professionals can effectively run powerful LLMs directly on their desktop hardware, gaining advantages in privacy, customization, and offline accessibility.

© 2025 ApX Machine Learning. All rights reserved.

;