ApX logo

New LLM Toolkit:For Developers Building LLM Applications

LLM Inference: VRAM & Performance Calculator

Precision for model weights during inference. Lower uses less VRAM but may affect quality.

KV Cache precision. Lower values reduce VRAM, especially for long sequences.

Select your GPU or set custom VRAM

Devices for parallel inference

Input Parameters

Batch Size:

1

Inputs processed simultaneously per step (affects throughput & latency)

1
8
16
32

Sequence Length: 1,024

Max tokens per input; impacts KV cache (also affected by attention structure) & activations.

8K
16K
33K
66K
131K

Concurrent Users:

1

Number of users running inference simultaneously (affects memory usage and per-user performance)

1
4
8
16
32
Open Source
Kerb: LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for RAG, Agents, and Structured Outputs.

Submit Feedback / Report Issue

Performance & Memory Results

0.0%

VRAM

Ready

0 GB

of 12 GB VRAM

Generation Speed: ...

Time to First Token: ~0ms

Total Throughput: ...

Mode: Inference | Batch: 1

Inference Simulation

(FP16 Weights / FP16 KV Cache) on 16GB Custom GPU

Input sequence length: 1,024 tokens

Configure model and hardware to enable simulation

How Calculations Are Made

Memory usage is estimated using models that factor in architecture (parameters, layers, hidden dimensions, active experts, etc.), quantization, sequence length, and batch size. Performance estimates consider model/hardware analysis and benchmarks, though benchmark accuracy varies. Results are approximate.

Learn more about how VRAM requirements are calculated →

Recent Updates

  • Dec 8, 2025 - Fix per-user speed calculation to properly account for queuing when concurrent users exceed batch size.
  • Dec 5, 2025 - Fix TFTT calculation bug where Flash Attention optimization was applied incorrectly. Fix TPS calculation for MoE models to account for active experts.
  • Dec 1, 2025 - Add multi-GPU scaling factor configuration. Fix AMD APU RAM availability.
  • Oct 28, 2025 - Add Time to First Token (TFTT) estimation. Update calculation to account for modern optimizations in inference frameworks.
  • Oct 22, 2025 - Fixed activation memory partitioning in distributed training, improved large model (>100B) architecture estimation with log-based scaling, and fix negative value bug with heavy-offloading.

Frequently Asked Questions