ApX logoApX logo

Local LLMs:Find the Best Model to Run on Your Hardware

LLM Inference: VRAM & Performance Calculator

Precision for model weights during inference. Lower uses less VRAM but may affect quality.

KV Cache precision. Lower values reduce VRAM, especially for long sequences.

Select your GPU or set custom VRAM

Devices for parallel inference

Input Parameters

Batch Size:

1

Inputs processed simultaneously per step (affects throughput & latency)

1
8
16
32

Sequence Length: 1,024

Max tokens per input; impacts KV cache (also affected by attention structure) & activations.

8K
16K
33K
66K
131K

Concurrent Users:

1

Number of users running inference simultaneously (affects memory usage and per-user performance)

1
4
8
16
32
Open Source
Kerb: LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for RAG, Agents, and Structured Outputs.

Submit Feedback / Report Issue

Performance & Memory Results

0.0%

VRAM

Ready

0 GB

of 12 GB VRAM

Generation Speed: ...

Time to First Token: ~0ms

Total Throughput: ...

Mode: Inference | Batch: 1

Inference Simulation

(FP16 Weights / FP16 KV Cache) on 16GB Custom GPU

Input sequence length: 1,024 tokens

Configure model and hardware to enable simulation

How Calculations Are Made

Memory usage is estimated using models that factor in architecture (parameters, layers, hidden dimensions, active experts, etc.), quantization, sequence length, and batch size. Performance estimates consider model/hardware analysis and benchmarks, though benchmark accuracy varies. Results are approximate.

Learn more about how VRAM requirements are calculated →

Recent Updates

  • Feb 18, 2026 - Improve batch size scaling for fine-tuning
  • Feb 3, 2026 - Add training cost estimation calculation
  • Dec 8, 2025 - Fix per-user speed calculation to properly account for queuing when concurrent users exceed batch size.
  • Dec 5, 2025 - Fix TFTT calculation bug where Flash Attention optimization was applied incorrectly. Fix TPS calculation for MoE models to account for active experts.
  • Dec 1, 2025 - Add multi-GPU scaling factor configuration. Fix AMD APU RAM availability.

Frequently Asked Questions