REST API DOCUMENTATION

VRAM Calculator API

Integrate programmatic LLM model requirements checks into your Dashboard, CI/CD workflows, or custom deployment scripts.

Get API Key Explore Endpoints

API Reference

Authentication

Calculations require an API token created in settings. Include it as a Bearer credential in the request header.

Authorization: Bearer YOUR_API_KEY_HERE

Never expose your API key in client-side applications. Use backend proxies when calling from browsers.

Generate Token

POST

/api/v1/vram-calculation/inference

Pro

Computes required GPU memory allocation breakdown and performance metrics specifically for LLM inference workloads.

Request JSON Body Parameters (Inference)

Field	Type	Description & Enum Choices
`gpu_key` *	string	GPU profile key, such as `rtx_4090` or `h100_80`. See GET /gpus for a full list.
`llm_key` *	string	Model variant slug, such as `llama-3-8b-instruct` or `custom`. See GET /llms for a full list.
`quantization`	string	Weights quantization target. Choices: `fp32`, `fp16`, `fp8`, `int8`, `int4`, `q8`, `q4_k_m`, etc. Default is `fp16`.
`kv_cache_quantization`	string	KV cache sharding/precision. Choices: `fp16`, `fp8`, `int8`, `int4`. Default is `fp16`.
`custom_vram`	number	Custom GPU memory capacity in GB. Required if gpu_key is `custom_discrete` or `custom_apple_silicon`.
`custom_bandwidth`	number	Custom GPU memory bandwidth in GB/s. Required if gpu_key is `custom_discrete` or `custom_apple_silicon`.
`custom_tflops`	number	Custom GPU FP16 compute performance in TFLOPS. Required if gpu_key is `custom_discrete` or `custom_apple_silicon`.
`num_gpus`	integer	Number of GPUs distributed on a single node. Defaults to 1 when cluster parameters are not used; mutually exclusive with `num_nodes` and `gpus_per_node`.
`num_nodes`	integer	Number of physical server chassis/nodes in the cluster. Required when using cluster configuration; mutually exclusive with `num_gpus`.
`gpus_per_node`	integer	Number of GPUs populated per server node in the cluster. Required when using cluster configuration; mutually exclusive with `num_gpus`.
`batch_size`	integer	Batch size. Default is 1.
`seq_length`	integer	Context sequence length in tokens. Default is 2048.
`concurrent_users`	integer	Number of concurrent users. Default is 1.
`enable_prefix_caching`	boolean	Enable KV cache prefix caching. Default is `false`.
`shared_prefix_ratio`	number	Fraction of prompt sequence that is a shared prefix. Range: 0.0 to 1.0. Default is 0.0.
`enable_continuous_batching`	boolean	Enable continuous batching. Default is `false`.
`inference_parallelism`	string	Parallelism strategy. Choices: `pipeline`, `tensor`. Default is `pipeline`.
`interconnect_type`	string	Interconnect type for multi-GPU setups. E.g., `pcie4`, `nvlink_gen4`. See code for full list.
`enable_offloading`	boolean	Enable GPU offloading. Default is `false`.
`offload_target`	string	Storage target for offloaded layers. Choices: `cpu_ram`, `nvme`.
`num_offload_layers`	integer	Number of model layers to offload.
`percentage_offload`	number	Percentage of layers to offload, from 0 to 100.
`offload_kv_cache`	boolean	Offload KV Cache to storage target. Default is `false`.
`tp_degree`	integer	Tensor Parallelism (TP) degree/shards to partition weights across GPUs. Default is 1.
`pp_degree`	integer	Pipeline Parallelism (PP) degree/shards to distribute model layers sequentially across nodes/devices. Default is 1.
`inter_node_interconnect`	string	Network connection interconnect between different nodes in the cluster. E.g., `ethernet100g`, `ethernet400g`, `infiniband_hdr`.

cURL Example (Inference)

Shell Request

curl -X POST https://apxml.com/api/v1/vram-calculation/inference \
  -H "Authorization: Bearer YOUR_API_KEY_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "gpu_key": "rtx_4090",
    "llm_key": "llama-3-8b-instruct",
    "quantization": "q4_k_m",
    "batch_size": 1,
    "seq_length": 2048
  }'

Python Example (Inference)

Python Script

import requests

url = "https://apxml.com/api/v1/vram-calculation/inference"
headers = {
    "Authorization": "Bearer YOUR_API_KEY_HERE",
    "Content-Type": "application/json"
}
payload = {
    "gpu_key": "rtx_4090",
    "llm_key": "llama-3-8b-instruct",
    "quantization": "q4_k_m",
    "batch_size": 1,
    "seq_length": 2048
}

response = requests.post(url, headers=headers, json=payload)
print(response.json())

POST

/api/v1/vram-calculation/finetuning

Pro

Computes required GPU memory allocation breakdown, optimization impacts, and dataset training speeds for LLM fine-tuning and training workloads.

Request JSON Body Parameters (Fine-tuning)

Field	Type	Description & Enum Choices
`gpu_key` *	string	GPU profile key, such as `rtx_4090` or `h100_80`. See GET /gpus for a full list.
`llm_key` *	string	Model variant slug, such as `llama-3-8b-instruct` or `custom`. See GET /llms for a full list.
`finetuning_method`	string	Finetuning method. Choices: `full`, `lora`, `qlora`. Default is `lora`.
`fine_tuning_quantization`	string	Base model precision for full finetuning. Used when method is `full`. Choices: `fp32`, `fp16`, `fp8`. Default is `fp16`.
`lora_rank`	integer	LoRA rank size. Used when method is `lora` or `qlora`. Default is 16.
`gradient_accumulation_steps`	integer	Gradient accumulation steps. Default is 1.
`batch_size`	integer	Global batch size. Default is 1.
`seq_length`	integer	Context sequence length in tokens. Default is 2048.
`num_samples`	integer	Dataset size in total samples.
`tokens_per_sample`	integer	Average tokens per sample. Defaults to seq_length.
`num_epochs`	integer	Number of training epochs. Default is 3.
`optimization_config`	object	Advanced optimization flags object. Contains boolean flags like `flash_attention`, `gradient_checkpointing`, `use_8bit_optimizer`, `use_paged_optimizer`, `use_fused_kernels`, `activation_offloading`, etc., and integer zero_stage.
`custom_vram`	number	Custom GPU memory capacity in GB. Required if gpu_key is `custom_discrete` or `custom_apple_silicon`.
`custom_bandwidth`	number	Custom GPU memory bandwidth in GB/s. Required if gpu_key is `custom_discrete` or `custom_apple_silicon`.
`custom_tflops`	number	Custom GPU FP16 compute performance in TFLOPS. Required if gpu_key is `custom_discrete` or `custom_apple_silicon`.
`num_gpus`	integer	Number of GPUs distributed on a single node. Defaults to 1 when cluster parameters are not used; mutually exclusive with `num_nodes` and `gpus_per_node`.
`num_nodes`	integer	Number of physical server chassis/nodes in the cluster. Required when using cluster configuration; mutually exclusive with `num_gpus`.
`gpus_per_node`	integer	Number of GPUs populated per server node in the cluster. Required when using cluster configuration; mutually exclusive with `num_gpus`.
`enable_offloading`	boolean	Enable offloading to host system. Default is `false`.
`offload_target`	string	Storage target for offloaded layers. Choices: `cpu_ram`, `nvme`.
`tp_degree`	integer	Tensor Parallelism (TP) degree/shards to partition weights across GPUs. Default is 1.
`pp_degree`	integer	Pipeline Parallelism (PP) degree/shards to distribute model layers sequentially across nodes/devices. Default is 1.
`inter_node_interconnect`	string	Network connection interconnect between different nodes in the cluster. E.g., `ethernet100g`, `ethernet400g`, `infiniband_hdr`.

cURL Example (Fine-tuning)

Shell Request

curl -X POST https://apxml.com/api/v1/vram-calculation/finetuning \
  -H "Authorization: Bearer YOUR_API_KEY_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "gpu_key": "rtx_4090",
    "llm_key": "llama-3-8b-instruct",
    "finetuning_method": "lora",
    "lora_rank": 16,
    "batch_size": 4,
    "seq_length": 2048
  }'

Python Example (Fine-tuning)

Python Script

import requests

url = "https://apxml.com/api/v1/vram-calculation/finetuning"
headers = {
    "Authorization": "Bearer YOUR_API_KEY_HERE",
    "Content-Type": "application/json"
}
payload = {
    "gpu_key": "rtx_4090",
    "llm_key": "llama-3-8b-instruct",
    "finetuning_method": "lora",
    "lora_rank": 16,
    "batch_size": 4,
    "seq_length": 2048
}

response = requests.post(url, headers=headers, json=payload)
print(response.json())

Response Body (Inference Example)

Inference Output

{
  "vram_usage": 6.84,
  "vram_percentage": 28.5,
  "actual_vram_percentage": 28.5,
  "memory_status": "okay",
  "static_shared_memory": 5.43,
  "per_user_memory": 0.52,
  "estimated_latency_tps": 45.2,
  "estimated_throughput_tps": 45.2,
  "per_user_tps": 45.2,
  "ms_per_token": 22.1,
  "ttft": 120,
  "ttft_ms": 120,
  "estimated_system_ram_required": 16.0,
  "offloaded_memory": 0.0,
  "estimated_power_draw": 320,
  "memory_details": {
    "Base Model Weights": 5.43,
    "KV Cache": 0.52,
    "Activations": 0.35,
    "Framework Overhead": 0.54
  },
  "memory_breakdown": [
    {
      "label": "Base Model Weights",
      "value": 79.4,
      "size_gb": 5.43
    },
    {
      "label": "KV Cache",
      "value": 7.6,
      "size_gb": 0.52
    }
  ]
}

Response Body (Fine-tuning Example)

Fine-tuning Output

{
  "vram_usage": 32.4,
  "vram_percentage": 40.5,
  "actual_vram_percentage": 40.5,
  "memory_status": "okay",
  "static_shared_memory": 24.5,
  "training_tps": 12.0,
  "samples_per_second": 1.5,
  "steps_per_second": 0.18,
  "total_tokens": 10000000.0,
  "total_training_time_hours": 4.5,
  "estimated_system_ram_required": 32.0,
  "offloaded_memory": 0.0,
  "estimated_power_draw": 650,
  "active_optimizations": [
    "gradient_checkpointing",
    "flash_attention"
  ],
  "optimization_preset": "custom",
  "memory_details": {
    "Base Model Weights": 24.5,
    "Optimizer States": 29.4,
    "Gradients": 10.8
  },
  "memory_breakdown": [
    {
      "label": "Base Model Weights",
      "value": 38.2,
      "size_gb": 24.5
    },
    {
      "label": "Optimizer States",
      "value": 45.1,
      "size_gb": 29.4
    },
    {
      "label": "Gradients",
      "value": 16.7,
      "size_gb": 10.8
    }
  ]
}

Memory Status Enum Options

The memory_status field returns one of the following lowercase enum keys:

Enum Value	Description
`okay`	Fits with safe margin representing 50% or less of VRAM
`moderate`	Fits with moderate usage representing 75% or less of VRAM
`high`	Tight fit representing 90% or less of VRAM
`very_high`	Very close to capacity, exceeding 90% of VRAM
`insufficient`	Exceeds available VRAM capacity
`error`	GPU configuration or sizing calculation error

Hardware & Model Listing Routes

GET

/api/v1/vram-calculation/gpus

Pro

Retrieve valid gpu_key parameters, hardware models, and VRAM capacities.

curl https://apxml.com/api/v1/vram-calculation/gpus \
  -H "Authorization: Bearer YOUR_API_KEY_HERE"

JSON Response

[
  {
    "key": "rtx_4090",
    "label": "NVIDIA GeForce RTX 4090",
    "memory": 24
  },
  {
    "key": "h100_80",
    "label": "NVIDIA H100 80GB",
    "memory": 80
  }
]

GET

/api/v1/vram-calculation/llms

Pro

Retrieve supported llm_key parameters mapped to correct model variant configurations.

curl https://apxml.com/api/v1/vram-calculation/llms \
  -H "Authorization: Bearer YOUR_API_KEY_HERE"

JSON Response

[
  {
    "key": "llama-3-8b-instruct",
    "name": "Llama 3 8B Instruct"
  },
  {
    "key": "llama-3.1-70b-instruct",
    "name": "Llama 3.1 70B Instruct"
  }
]