Calculations require an API token created in settings. Include it as a Bearer credential in the request header.
Authorization: Bearer YOUR_API_KEY_HERE/api/v1/vram-calculation/inferenceComputes required GPU memory allocation breakdown and performance metrics specifically for LLM inference workloads.
| Field | Type | Description & Enum Choices |
|---|---|---|
gpu_key * | string | GPU profile key, such as rtx_4090 or h100_80. See GET /gpus for a full list. |
llm_key * | string | Model variant slug, such as llama-3-8b-instruct or custom. See GET /llms for a full list. |
quantization | string | Weights quantization target. Choices: fp32, fp16, fp8, int8, int4, q8, q4_k_m, etc. Default is fp16. |
kv_cache_quantization | string | KV cache sharding/precision. Choices: fp16, fp8, int8, int4. Default is fp16. |
custom_vram | number | Custom GPU memory capacity in GB. Required if gpu_key is custom_discrete or custom_apple_silicon. |
num_gpus | integer | Number of GPUs distributed. Default is 1. |
batch_size | integer | Batch size. Default is 1. |
seq_length | integer | Context sequence length in tokens. Default is 2048. |
concurrent_users | integer | Number of concurrent users. Default is 1. |
enable_prefix_caching | boolean | Enable KV cache prefix caching. Default is false. |
shared_prefix_ratio | number | Fraction of prompt sequence that is a shared prefix. Range: 0.0 to 1.0. Default is 0.0. |
enable_continuous_batching | boolean | Enable continuous batching. Default is false. |
inference_parallelism | string | Parallelism strategy. Choices: pipeline, tensor. Default is pipeline. |
interconnect_type | string | Interconnect type for multi-GPU setups. E.g., pcie4, nvlink_gen4. See code for full list. |
enable_offloading | boolean | Enable GPU offloading. Default is false. |
offload_target | string | Storage target for offloaded layers. Choices: cpu_ram, nvme. |
num_offload_layers | integer | Number of model layers to offload. |
percentage_offload | number | Percentage of layers to offload, from 0 to 100. |
offload_kv_cache | boolean | Offload KV Cache to storage target. Default is false. |
Shell Request
curl -X POST https://apxml.com/api/v1/vram-calculation/inference \
-H "Authorization: Bearer YOUR_API_KEY_HERE" \
-H "Content-Type: application/json" \
-d '{
"gpu_key": "rtx_4090",
"llm_key": "llama-3-8b-instruct",
"quantization": "q4_k_m",
"batch_size": 1,
"seq_length": 2048
}'Python Script
import requests
url = "https://apxml.com/api/v1/vram-calculation/inference"
headers = {
"Authorization": "Bearer YOUR_API_KEY_HERE",
"Content-Type": "application/json"
}
payload = {
"gpu_key": "rtx_4090",
"llm_key": "llama-3-8b-instruct",
"quantization": "q4_k_m",
"batch_size": 1,
"seq_length": 2048
}
response = requests.post(url, headers=headers, json=payload)
print(response.json())/api/v1/vram-calculation/finetuningComputes required GPU memory allocation breakdown, optimization impacts, and dataset training speeds for LLM fine-tuning and training workloads.
| Field | Type | Description & Enum Choices |
|---|---|---|
gpu_key * | string | GPU profile key, such as rtx_4090 or h100_80. See GET /gpus for a full list. |
llm_key * | string | Model variant slug, such as llama-3-8b-instruct or custom. See GET /llms for a full list. |
finetuning_method | string | Finetuning method. Choices: full, lora, qlora. Default is lora. |
fine_tuning_quantization | string | Base model precision for full finetuning. Used when method is full. Choices: fp32, fp16, fp8. Default is fp16. |
lora_rank | integer | LoRA rank size. Used when method is lora or qlora. Default is 16. |
gradient_accumulation_steps | integer | Gradient accumulation steps. Default is 1. |
batch_size | integer | Global batch size. Default is 1. |
seq_length | integer | Context sequence length in tokens. Default is 2048. |
num_samples | integer | Dataset size in total samples. |
tokens_per_sample | integer | Average tokens per sample. Defaults to seq_length. |
num_epochs | integer | Number of training epochs. Default is 3. |
optimization_config | object | Advanced optimization flags object. Contains boolean flags like flash_attention, gradient_checkpointing, use_8bit_optimizer, use_paged_optimizer, use_fused_kernels, activation_offloading, etc., and integer zero_stage. |
custom_vram | number | Custom GPU memory capacity in GB. Required if gpu_key is custom_discrete or custom_apple_silicon. |
num_gpus | integer | Number of GPUs distributed. Default is 1. |
enable_offloading | boolean | Enable offloading to host system. Default is false. |
offload_target | string | Storage target for offloaded layers. Choices: cpu_ram, nvme. |
Shell Request
curl -X POST https://apxml.com/api/v1/vram-calculation/finetuning \
-H "Authorization: Bearer YOUR_API_KEY_HERE" \
-H "Content-Type: application/json" \
-d '{
"gpu_key": "rtx_4090",
"llm_key": "llama-3-8b-instruct",
"finetuning_method": "lora",
"lora_rank": 16,
"batch_size": 4,
"seq_length": 2048
}'Python Script
import requests
url = "https://apxml.com/api/v1/vram-calculation/finetuning"
headers = {
"Authorization": "Bearer YOUR_API_KEY_HERE",
"Content-Type": "application/json"
}
payload = {
"gpu_key": "rtx_4090",
"llm_key": "llama-3-8b-instruct",
"finetuning_method": "lora",
"lora_rank": 16,
"batch_size": 4,
"seq_length": 2048
}
response = requests.post(url, headers=headers, json=payload)
print(response.json())Inference Output
{
"vram_usage": 6.84,
"vram_percentage": 28.5,
"actual_vram_percentage": 28.5,
"memory_status": "okay",
"static_shared_memory": 5.43,
"per_user_memory": 0.52,
"estimated_latency_tps": 45.2,
"estimated_throughput_tps": 45.2,
"per_user_tps": 45.2,
"ms_per_token": 22.1,
"tftt": 120,
"estimated_system_ram_required": 16.0,
"offloaded_memory": 0.0,
"estimated_power_draw": 320,
"memory_details": {
"Base Model Weights": 5.43,
"KV Cache": 0.52,
"Activations": 0.35,
"Framework Overhead": 0.54
},
"memory_breakdown": [
{
"label": "Base Model Weights",
"value": 79.4,
"size_gb": 5.43
},
{
"label": "KV Cache",
"value": 7.6,
"size_gb": 0.52
}
]
}Fine-tuning Output
{
"vram_usage": 32.4,
"vram_percentage": 40.5,
"actual_vram_percentage": 40.5,
"memory_status": "okay",
"static_shared_memory": 24.5,
"training_tps": 12.0,
"samples_per_second": 1.5,
"steps_per_second": 0.18,
"total_tokens": 10000000.0,
"total_training_time_hours": 4.5,
"estimated_system_ram_required": 32.0,
"offloaded_memory": 0.0,
"estimated_power_draw": 650,
"active_optimizations": [
"gradient_checkpointing",
"flash_attention"
],
"optimization_preset": "custom",
"memory_details": {
"Base Model Weights": 24.5,
"Optimizer States": 29.4,
"Gradients": 10.8
},
"memory_breakdown": [
{
"label": "Base Model Weights",
"value": 38.2,
"size_gb": 24.5
},
{
"label": "Optimizer States",
"value": 45.1,
"size_gb": 29.4
},
{
"label": "Gradients",
"value": 16.7,
"size_gb": 10.8
}
]
}The memory_status field returns one of the following lowercase enum keys:
| Enum Value | Description |
|---|---|
okay | Fits with safe margin representing 50% or less of VRAM |
moderate | Fits with moderate usage representing 75% or less of VRAM |
high | Tight fit representing 90% or less of VRAM |
very_high | Very close to capacity, exceeding 90% of VRAM |
insufficient | Exceeds available VRAM capacity |
error | GPU configuration or sizing calculation error |
/api/v1/vram-calculation/gpusRetrieve valid gpu_key parameters, hardware models, and VRAM capacities.
curl https://apxml.com/api/v1/vram-calculation/gpus \
-H "Authorization: Bearer YOUR_API_KEY_HERE"[
{
"key": "rtx_4090",
"label": "NVIDIA GeForce RTX 4090",
"memory": 24
},
{
"key": "h100_80",
"label": "NVIDIA H100 80GB",
"memory": 80
}
]/api/v1/vram-calculation/llmsRetrieve supported llm_key parameters mapped to correct model variant configurations.
curl https://apxml.com/api/v1/vram-calculation/llms \
-H "Authorization: Bearer YOUR_API_KEY_HERE"[
{
"key": "llama-3-8b-instruct",
"name": "Llama 3 8B Instruct"
},
{
"key": "llama-3.1-70b-instruct",
"name": "Llama 3.1 70B Instruct"
}
]APX AI
Online