ApX logo

LLM Directory & Rankings:Explore The Best Local LLMs

LLM Inference: VRAM & Performance Calculator

Precision for model weights during inference. Lower uses less VRAM but may affect quality.

KV Cache precision. Lower values reduce VRAM, especially for long sequences.

Select your GPU or set custom VRAM

Devices for parallel inference

Batch Size:

1

Inputs processed simultaneously per step (affects throughput & latency)

1
8
16
32

Sequence Length: 1,024

Max tokens per input; impacts KV cache (also affected by attention structure) & activations.

8K
16K
33K
66K
131K

Concurrent Users:

1

Number of users running inference simultaneously (affects memory usage and per-user performance)

1
4
8
16
32
Submit Feedback / Report Issue

Performance & Memory Results

0.0%

VRAM

Ready

0 GB

of 12 GB VRAM

Generation Speed: ...

Total Throughput: ...

Mode: Inference | Batch: 1

Inference Simulation

(FP16 Weights / FP16 KV Cache) on 16GB Custom GPU

Input sequence length: 1,024 tokens

Configure model and hardware to enable simulation

How Calculations Are Made

Memory usage is estimated using models that factor in architecture (parameters, layers, hidden dimensions, active experts, etc.), quantization, sequence length, and batch size. Performance estimates consider model/hardware analysis and benchmarks, though benchmark accuracy varies. Results are approximate.

Learn more about how VRAM requirements are calculated →

Recent Updates

  • Sept 29, 2025 - Improve KV cache scaling calculation for MOE models. Fix bug with expert calculation.
  • Sept 9, 2025 - Update GPU list with newer releases.
  • July 31, 2025 - Fix TPS scaling factor bug. Update activation calculation formula.
  • June 24, 2025 - Add log scale for batch size and sequence length inputs.
  • June 7, 2025 - Fix KV cache calculation for non-MHA attention structures.

Frequently Asked Questions