Precision for model weights during inference. Lower uses less VRAM but may affect quality.
Select your GPU or set custom VRAM
Devices for parallel inference
Batch Size
1
Inputs processed simultaneously per step (affects throughput & latency)
Sequence Length: 2,048
Max tokens per input sequence (affects KV cache & activations)
0.0%
VRAM
0 GB
of 12 GB VRAM
Generation Speed: ...
Total Throughput: ...
(FP16)
Mode: Inference | Batch: 1
(FP16) on RTX 3060 (12GB)
Input sequence length: 2,048 tokens
Configure model and hardware to enable simulation
Memory usage is estimated using models that factor in architecture (parameters, layers, hidden dimensions, active experts, etc.), quantization, sequence length, and batch size. Performance estimates consider model/hardware analysis and benchmarks, though benchmark accuracy varies. Results are approximate.
Learn more about how VRAM requirements are calculated →* Newly released models may have more inaccuracies. Will improve the calculation for Qwen 3 once the model code is open sourced, to better understand the architecture.