Can You Run This LLM? VRAM Calculator (Nvidia GPU and Apple Silicon)

New MCP Server:Get LLM requirements, benchmarks, and more!

LLM Inference: VRAM & Performance Calculator

Inference

Fine-tuning

Select Model

Inference Quantization

Precision for model weights during inference. Lower uses less VRAM but may affect quality.

KV Cache Quantization

KV Cache precision. Lower values reduce VRAM, especially for long sequences.

Hardware Configuration

Select your GPU or set custom VRAM

Num GPUs

Devices for parallel inference

Batch Size:

Log Scale

Inputs processed simultaneously per step (affects throughput & latency)

Sequence Length: 1,024

Max tokens per input; impacts KV cache (also affected by attention structure) & activations.

16K

33K

66K

131K

Concurrent Users:

Log Scale

Number of users running inference simultaneously (affects memory usage and per-user performance)

Enable Offloading to CPU/RAM or NVMe

Submit Feedback / Report Issue

Performance & Memory Results

0.0%

VRAM

Ready

0 GB

of 12 GB VRAM

Generation Speed: ...

Total Throughput: ...

Mode: Inference | Batch: 1

Now available via MCP: https://apxml.com/mcp

Inference Simulation

(FP16 Weights / FP16 KV Cache) on 16GB Custom GPU

Input sequence length: 1,024 tokens

Configure model and hardware to enable simulation

How Calculations Are Made

Memory usage is estimated using models that factor in architecture (parameters, layers, hidden dimensions, active experts, etc.), quantization, sequence length, and batch size. Performance estimates consider model/hardware analysis and benchmarks, though benchmark accuracy varies. Results are approximate.

Learn more about how VRAM requirements are calculated →

Follow For Updates

Donate (Buy me a Coffee)

Recent Updates

Sept 29, 2025 - Improve KV cache scaling calculation for MOE models. Fix bug with expert calculation.
Sept 9, 2025 - Update GPU list with newer releases.
July 31, 2025 - Fix TPS scaling factor bug. Update activation calculation formula.
June 24, 2025 - Add log scale for batch size and sequence length inputs.
June 7, 2025 - Fix KV cache calculation for non-MHA attention structures.

Frequently Asked Questions

How accurate is this calculator?

How is the TPS (Tokens per Second) calculated?

Why do MoE models use so much VRAM?

Why is the VRAM requirement higher here than when I run a model on Ollama?

About Status Advertise Terms of Use Privacy Policy