LLM Inference: VRAM & Performance Calculator

Precision for model weights during inference. Lower uses less VRAM but may affect quality.

Select your GPU or set custom VRAM

Devices for parallel inference

Batch Size

1

Inputs processed simultaneously per step (affects throughput & latency)

1
8
16
32

Sequence Length: 2,048

Max tokens per input sequence (affects KV cache & activations)

8K
16K
33K
66K
131K

0.0%

VRAM

Ready

0 GB

of 12 GB VRAM

Generation Speed: ...

Total Throughput: ...

(FP16)

Mode: Inference | Batch: 1

Inference Simulation

(FP16) on RTX 3060 (12GB)

Input sequence length: 2,048 tokens

Configure model and hardware to enable simulation

How Calculations Are Made

Memory usage is estimated using models that factor in architecture (parameters, layers, hidden dimensions, active experts, etc.), quantization, sequence length, and batch size. Performance estimates consider model/hardware analysis and benchmarks, though benchmark accuracy varies. Results are approximate.

Learn more about how VRAM requirements are calculated →

Updates

  • 3 May 2025 - Fix MOE calculation bug for Qwen 3. Add more GPU options.
  • 2 May 2025 - Fix RTX 5090 VRAM. Temporarily disabled hidden layer size in calculation for Qwen 3*

* Newly released models may have more inaccuracies. Will improve the calculation for Qwen 3 once the model code is open sourced, to better understand the architecture.

;