ApX logoApX logo
REST API DOCUMENTATION

VRAM Calculator API

Integrate programmatic LLM model requirements checks into your Dashboard, CI/CD workflows, or custom deployment scripts.

API Reference

Authentication

Calculations require an API token created in settings. Include it as a Bearer credential in the request header.

Authorization: Bearer YOUR_API_KEY_HERE
Generate Token
POST
/api/v1/vram-calculation/inference
Pro

Computes required GPU memory allocation breakdown and performance metrics specifically for LLM inference workloads.

Request JSON Body Parameters (Inference)

FieldTypeDescription & Enum Choices
gpu_key *stringGPU profile key, such as rtx_4090 or h100_80. See GET /gpus for a full list.
llm_key *stringModel variant slug, such as llama-3-8b-instruct or custom. See GET /llms for a full list.
quantizationstringWeights quantization target. Choices: fp32, fp16, fp8, int8, int4, q8, q4_k_m, etc. Default is fp16.
kv_cache_quantizationstringKV cache sharding/precision. Choices: fp16, fp8, int8, int4. Default is fp16.
custom_vramnumberCustom GPU memory capacity in GB. Required if gpu_key is custom_discrete or custom_apple_silicon.
num_gpusintegerNumber of GPUs distributed. Default is 1.
batch_sizeintegerBatch size. Default is 1.
seq_lengthintegerContext sequence length in tokens. Default is 2048.
concurrent_usersintegerNumber of concurrent users. Default is 1.
enable_prefix_cachingbooleanEnable KV cache prefix caching. Default is false.
shared_prefix_rationumberFraction of prompt sequence that is a shared prefix. Range: 0.0 to 1.0. Default is 0.0.
enable_continuous_batchingbooleanEnable continuous batching. Default is false.
inference_parallelismstringParallelism strategy. Choices: pipeline, tensor. Default is pipeline.
interconnect_typestringInterconnect type for multi-GPU setups. E.g., pcie4, nvlink_gen4. See code for full list.
enable_offloadingbooleanEnable GPU offloading. Default is false.
offload_targetstringStorage target for offloaded layers. Choices: cpu_ram, nvme.
num_offload_layersintegerNumber of model layers to offload.
percentage_offloadnumberPercentage of layers to offload, from 0 to 100.
offload_kv_cachebooleanOffload KV Cache to storage target. Default is false.

cURL Example (Inference)

Shell Request

curl -X POST https://apxml.com/api/v1/vram-calculation/inference \
  -H "Authorization: Bearer YOUR_API_KEY_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "gpu_key": "rtx_4090",
    "llm_key": "llama-3-8b-instruct",
    "quantization": "q4_k_m",
    "batch_size": 1,
    "seq_length": 2048
  }'

Python Example (Inference)

Python Script

import requests

url = "https://apxml.com/api/v1/vram-calculation/inference"
headers = {
    "Authorization": "Bearer YOUR_API_KEY_HERE",
    "Content-Type": "application/json"
}
payload = {
    "gpu_key": "rtx_4090",
    "llm_key": "llama-3-8b-instruct",
    "quantization": "q4_k_m",
    "batch_size": 1,
    "seq_length": 2048
}

response = requests.post(url, headers=headers, json=payload)
print(response.json())
POST
/api/v1/vram-calculation/finetuning
Pro

Computes required GPU memory allocation breakdown, optimization impacts, and dataset training speeds for LLM fine-tuning and training workloads.

Request JSON Body Parameters (Fine-tuning)

FieldTypeDescription & Enum Choices
gpu_key *stringGPU profile key, such as rtx_4090 or h100_80. See GET /gpus for a full list.
llm_key *stringModel variant slug, such as llama-3-8b-instruct or custom. See GET /llms for a full list.
finetuning_methodstringFinetuning method. Choices: full, lora, qlora. Default is lora.
fine_tuning_quantizationstringBase model precision for full finetuning. Used when method is full. Choices: fp32, fp16, fp8. Default is fp16.
lora_rankintegerLoRA rank size. Used when method is lora or qlora. Default is 16.
gradient_accumulation_stepsintegerGradient accumulation steps. Default is 1.
batch_sizeintegerGlobal batch size. Default is 1.
seq_lengthintegerContext sequence length in tokens. Default is 2048.
num_samplesintegerDataset size in total samples.
tokens_per_sampleintegerAverage tokens per sample. Defaults to seq_length.
num_epochsintegerNumber of training epochs. Default is 3.
optimization_configobjectAdvanced optimization flags object. Contains boolean flags like flash_attention, gradient_checkpointing, use_8bit_optimizer, use_paged_optimizer, use_fused_kernels, activation_offloading, etc., and integer zero_stage.
custom_vramnumberCustom GPU memory capacity in GB. Required if gpu_key is custom_discrete or custom_apple_silicon.
num_gpusintegerNumber of GPUs distributed. Default is 1.
enable_offloadingbooleanEnable offloading to host system. Default is false.
offload_targetstringStorage target for offloaded layers. Choices: cpu_ram, nvme.

cURL Example (Fine-tuning)

Shell Request

curl -X POST https://apxml.com/api/v1/vram-calculation/finetuning \
  -H "Authorization: Bearer YOUR_API_KEY_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "gpu_key": "rtx_4090",
    "llm_key": "llama-3-8b-instruct",
    "finetuning_method": "lora",
    "lora_rank": 16,
    "batch_size": 4,
    "seq_length": 2048
  }'

Python Example (Fine-tuning)

Python Script

import requests

url = "https://apxml.com/api/v1/vram-calculation/finetuning"
headers = {
    "Authorization": "Bearer YOUR_API_KEY_HERE",
    "Content-Type": "application/json"
}
payload = {
    "gpu_key": "rtx_4090",
    "llm_key": "llama-3-8b-instruct",
    "finetuning_method": "lora",
    "lora_rank": 16,
    "batch_size": 4,
    "seq_length": 2048
}

response = requests.post(url, headers=headers, json=payload)
print(response.json())

Response Body (Inference Example)

Inference Output

{
  "vram_usage": 6.84,
  "vram_percentage": 28.5,
  "actual_vram_percentage": 28.5,
  "memory_status": "okay",
  "static_shared_memory": 5.43,
  "per_user_memory": 0.52,
  "estimated_latency_tps": 45.2,
  "estimated_throughput_tps": 45.2,
  "per_user_tps": 45.2,
  "ms_per_token": 22.1,
  "tftt": 120,
  "estimated_system_ram_required": 16.0,
  "offloaded_memory": 0.0,
  "estimated_power_draw": 320,
  "memory_details": {
    "Base Model Weights": 5.43,
    "KV Cache": 0.52,
    "Activations": 0.35,
    "Framework Overhead": 0.54
  },
  "memory_breakdown": [
    {
      "label": "Base Model Weights",
      "value": 79.4,
      "size_gb": 5.43
    },
    {
      "label": "KV Cache",
      "value": 7.6,
      "size_gb": 0.52
    }
  ]
}

Response Body (Fine-tuning Example)

Fine-tuning Output

{
  "vram_usage": 32.4,
  "vram_percentage": 40.5,
  "actual_vram_percentage": 40.5,
  "memory_status": "okay",
  "static_shared_memory": 24.5,
  "training_tps": 12.0,
  "samples_per_second": 1.5,
  "steps_per_second": 0.18,
  "total_tokens": 10000000.0,
  "total_training_time_hours": 4.5,
  "estimated_system_ram_required": 32.0,
  "offloaded_memory": 0.0,
  "estimated_power_draw": 650,
  "active_optimizations": [
    "gradient_checkpointing",
    "flash_attention"
  ],
  "optimization_preset": "custom",
  "memory_details": {
    "Base Model Weights": 24.5,
    "Optimizer States": 29.4,
    "Gradients": 10.8
  },
  "memory_breakdown": [
    {
      "label": "Base Model Weights",
      "value": 38.2,
      "size_gb": 24.5
    },
    {
      "label": "Optimizer States",
      "value": 45.1,
      "size_gb": 29.4
    },
    {
      "label": "Gradients",
      "value": 16.7,
      "size_gb": 10.8
    }
  ]
}

Memory Status Enum Options

The memory_status field returns one of the following lowercase enum keys:

Enum ValueDescription
okayFits with safe margin representing 50% or less of VRAM
moderateFits with moderate usage representing 75% or less of VRAM
highTight fit representing 90% or less of VRAM
very_highVery close to capacity, exceeding 90% of VRAM
insufficientExceeds available VRAM capacity
errorGPU configuration or sizing calculation error

Hardware & Model Listing Routes

GET
/api/v1/vram-calculation/gpus
Pro

Retrieve valid gpu_key parameters, hardware models, and VRAM capacities.

curl https://apxml.com/api/v1/vram-calculation/gpus \
  -H "Authorization: Bearer YOUR_API_KEY_HERE"
[
  {
    "key": "rtx_4090",
    "label": "NVIDIA GeForce RTX 4090",
    "memory": 24
  },
  {
    "key": "h100_80",
    "label": "NVIDIA H100 80GB",
    "memory": 80
  }
]
GET
/api/v1/vram-calculation/llms
Pro

Retrieve supported llm_key parameters mapped to correct model variant configurations.

curl https://apxml.com/api/v1/vram-calculation/llms \
  -H "Authorization: Bearer YOUR_API_KEY_HERE"
[
  {
    "key": "llama-3-8b-instruct",
    "name": "Llama 3 8B Instruct"
  },
  {
    "key": "llama-3.1-70b-instruct",
    "name": "Llama 3.1 70B Instruct"
  }
]