ApX logo

LLM Inference: VRAM & Performance Calculator

Precision for model weights during inference. Lower uses less VRAM but may affect quality.

KV Cache precision. Lower values reduce VRAM, especially for long sequences.

Select your GPU or set custom VRAM

Devices for parallel inference

Batch Size:

1

Inputs processed simultaneously per step (affects throughput & latency)

1
8
16
32

Sequence Length: 2,048

Max tokens per input; impacts KV cache (also affected by attention structure) & activations.

8K
16K
33K
66K
131K

Concurrent Users:

1

Number of users running inference simultaneously (affects memory usage and per-user performance)

1
4
8
16
32
Submit Feedback / Report Issue

Performance & Memory Results

0.0%

VRAM

Ready

0 GB

of 12 GB VRAM

Generation Speed: ...

Total Throughput: ...

Mode: Inference | Batch: 1

Inference Simulation

(FP16 Weights / FP16 KV Cache) on 16GB Custom GPU

Input sequence length: 2,048 tokens

Configure model and hardware to enable simulation

How Calculations Are Made

Memory usage is estimated using models that factor in architecture (parameters, layers, hidden dimensions, active experts, etc.), quantization, sequence length, and batch size. Performance estimates consider model/hardware analysis and benchmarks, though benchmark accuracy varies. Results are approximate.

Learn more about how VRAM requirements are calculated →

Recent Updates

  • June 24, 2025 - Add log scale for batch size and sequence length inputs.
  • June 7, 2025 - Fix KV cache calculation for non-MHA attention structures.
  • June 4, 2025 - Improve calculation speed.
  • May 27, 2025 - Add memory offloading options for CPU RAM and NVMe storage.
  • May 10, 2025 - Fix DeepSeek V3 MOE calculation bug. Improve MOE calculation and enable precision selection for fine-tuning.