REST API 文档

显存计算器 API

将大语言模型需求规划以编程式集成到您的仪表盘、CI/CD 工作流或自定义部署脚本中。

API 参考

身份验证

计算需要包含设置中创建的 API 令牌。请在请求头中将其作为 Bearer 凭据传递。

Authorization: Bearer YOUR_API_KEY_HERE

切勿在客户端应用中公开您的 API 密钥。在浏览器中调用时，应使用后端代理。

生成令牌

POST

/api/v1/vram-calculation/inference

Pro

计算大语言模型推理工作负载所需的 GPU 显存分配详情及性能指标。

请求 JSON Body 参数 (推理)

字段	类型	描述与枚举选项
`gpu_key` *	string	GPU 配置密钥，如 `rtx_4090` 或 `h100_80`。完整列表请参阅 GET /gpus。
`llm_key` *	string	模型变体标识符，如 `llama-3-8b-instruct` 或 `custom`。完整列表请参阅 GET /llms。
`quantization`	string	权重量化目标。可选值：`fp32`，`fp16`，`fp8`，`int8`，`int4`，`q8`，`q4_k_m` 等。默认为 `fp16`。
`kv_cache_quantization`	string	KV 缓存量化精度。可选值：`fp16`，`fp8`，`int8`，`int4`。默认为 `fp16`。
`custom_vram`	number	自定义 GPU 显存容量 (GB)。如果 gpu_key 为 `custom_discrete` 或 `custom_apple_silicon`，则为必填项。
`custom_bandwidth`	number	自定义 GPU 内存带宽 (GB/s)。如果 gpu_key 为 `custom_discrete` 或 `custom_apple_silicon`，则为必填项。
`custom_tflops`	number	自定义 GPU FP16 计算算力 (TFLOPS)。如果 gpu_key 为 `custom_discrete` 或 `custom_apple_silicon`，则为必填项。
`num_gpus`	integer	单个节点上配置的 GPU 数量。在未使用集群参数时默认为 1；与 `num_nodes` 和 `gpus_per_node` 互斥。
`num_nodes`	integer	集群中物理服务器节点/机架的数量。使用集群配置时必填；与 `num_gpus` 互斥。
`gpus_per_node`	integer	集群中每个服务器节点配置的 GPU 数量。使用集群配置时必填；与 `num_gpus` 互斥。
`batch_size`	integer	批次大小。默认为 1。
`seq_length`	integer	以 token 为单位的上下文序列长度。默认为 2048。
`concurrent_users`	integer	并发用户数。默认为 1。
`enable_prefix_caching`	boolean	启用 KV 缓存前缀缓存。默认为 `false`。
`shared_prefix_ratio`	number	共享前缀占提示序列的比例。范围：0.0 到 1.0。默认为 0.0。
`enable_continuous_batching`	boolean	启用连续批处理。默认为 `false`。
`inference_parallelism`	string	并行策略。可选值：`pipeline`，`tensor`。默认为 `pipeline`。
`interconnect_type`	string	多 GPU 设置的互连类型。例如：`pcie4`，`nvlink_gen4`。完整列表请参阅代码。
`enable_offloading`	boolean	启用 GPU 卸载。默认为 `false`。
`offload_target`	string	卸载层的存储目标。可选值：`cpu_ram`，`nvme`。
`num_offload_layers`	integer	要卸载的模型层数。
`percentage_offload`	number	要卸载的层数百分比，范围从 0 到 100。
`offload_kv_cache`	boolean	将 KV 缓存卸载到存储目标。默认为 `false`。
`tp_degree`	integer	用于在多个 GPU 上分割权重分片的张量并行 (TP) 维度。默认为 1。
`pp_degree`	integer	用于在不同节点/设备间按顺序分配模型层的流水线并行 (PP) 维度。默认为 1。
`inter_node_interconnect`	string	集群中不同节点之间的网络互连连接。例如：`ethernet100g`, `ethernet400g`, `infiniband_hdr`。

cURL 示例 (推理)

Shell 请求

curl -X POST https://apxml.com/api/v1/vram-calculation/inference \
  -H "Authorization: Bearer YOUR_API_KEY_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "gpu_key": "rtx_4090",
    "llm_key": "llama-3-8b-instruct",
    "quantization": "q4_k_m",
    "batch_size": 1,
    "seq_length": 2048
  }'

Python 示例 (推理)

Python 脚本

import requests

url = "https://apxml.com/api/v1/vram-calculation/inference"
headers = {
    "Authorization": "Bearer YOUR_API_KEY_HERE",
    "Content-Type": "application/json"
}
payload = {
    "gpu_key": "rtx_4090",
    "llm_key": "llama-3-8b-instruct",
    "quantization": "q4_k_m",
    "batch_size": 1,
    "seq_length": 2048
}

response = requests.post(url, headers=headers, json=payload)
print(response.json())

POST

/api/v1/vram-calculation/finetuning

Pro

计算大语言模型微调与训练工作负载所需的 GPU 显存分配详情、优化影响以及数据集训练速度。

请求 JSON Body 参数 (微调)

字段	类型	描述与枚举选项
`gpu_key` *	string	GPU 配置密钥，如 `rtx_4090` 或 `h100_80`。完整列表请参阅 GET /gpus。
`llm_key` *	string	模型变体标识符，如 `llama-3-8b-instruct` 或 `custom`。完整列表请参阅 GET /llms。
`finetuning_method`	string	微调方法。可选值：`full`，`lora`，`qlora`。默认为 `lora`。
`fine_tuning_quantization`	string	全量微调的基础模型精度。当方法为 `full` 时使用。可选值：`fp32`，`fp16`，`fp8`。默认为 `fp16`。
`lora_rank`	integer	LoRA 秩大小。当方法为 `lora` 或 `qlora` 时使用。默认为 16。
`gradient_accumulation_steps`	integer	梯度累积步数。默认为 1。
`batch_size`	integer	全局批次大小。默认为 1。
`seq_length`	integer	以 token 为单位的上下文序列长度。默认为 2048。
`num_samples`	integer	数据集中总样本数。
`tokens_per_sample`	integer	每样本的平均 token 数。默认为 seq_length。
`num_epochs`	integer	训练周期数。默认为 3。
`optimization_config`	object	高级优化配置标志。包含布尔标志如 `flash_attention`，`gradient_checkpointing`，`use_8bit_optimizer`，`use_paged_optimizer`，`use_fused_kernels`，`activation_offloading` 等，以及整数 zero_stage。
`custom_vram`	number	自定义 GPU 显存容量 (GB)。如果 gpu_key 为 `custom_discrete` 或 `custom_apple_silicon`，则为必填项。
`custom_bandwidth`	number	自定义 GPU 内存带宽 (GB/s)。如果 gpu_key 为 `custom_discrete` 或 `custom_apple_silicon`，则为必填项。
`custom_tflops`	number	自定义 GPU FP16 计算算力 (TFLOPS)。如果 gpu_key 为 `custom_discrete` 或 `custom_apple_silicon`，则为必填项。
`num_gpus`	integer	单个节点上配置的 GPU 数量。在未使用集群参数时默认为 1；与 `num_nodes` 和 `gpus_per_node` 互斥。
`num_nodes`	integer	集群中物理服务器节点/机架的数量。使用集群配置时必填；与 `num_gpus` 互斥。
`gpus_per_node`	integer	集群中每个服务器节点配置的 GPU 数量。使用集群配置时必填；与 `num_gpus` 互斥。
`enable_offloading`	boolean	启用卸载到主机系统。默认为 `false`。
`offload_target`	string	卸载层的存储目标。可选值：`cpu_ram`，`nvme`。
`tp_degree`	integer	用于在多个 GPU 上分割权重分片的张量并行 (TP) 维度。默认为 1。
`pp_degree`	integer	用于在不同节点/设备间按顺序分配模型层的流水线并行 (PP) 维度。默认为 1。
`inter_node_interconnect`	string	集群中不同节点之间的网络互连连接。例如：`ethernet100g`, `ethernet400g`, `infiniband_hdr`。

cURL 示例 (微调)

Shell 请求

curl -X POST https://apxml.com/api/v1/vram-calculation/finetuning \
  -H "Authorization: Bearer YOUR_API_KEY_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "gpu_key": "rtx_4090",
    "llm_key": "llama-3-8b-instruct",
    "finetuning_method": "lora",
    "lora_rank": 16,
    "batch_size": 4,
    "seq_length": 2048
  }'

Python 示例 (微调)

Python 脚本

import requests

url = "https://apxml.com/api/v1/vram-calculation/finetuning"
headers = {
    "Authorization": "Bearer YOUR_API_KEY_HERE",
    "Content-Type": "application/json"
}
payload = {
    "gpu_key": "rtx_4090",
    "llm_key": "llama-3-8b-instruct",
    "finetuning_method": "lora",
    "lora_rank": 16,
    "batch_size": 4,
    "seq_length": 2048
}

response = requests.post(url, headers=headers, json=payload)
print(response.json())

响应 Body (推理示例)

推理输出

{
  "vram_usage": 6.84,
  "vram_percentage": 28.5,
  "actual_vram_percentage": 28.5,
  "memory_status": "okay",
  "static_shared_memory": 5.43,
  "per_user_memory": 0.52,
  "estimated_latency_tps": 45.2,
  "estimated_throughput_tps": 45.2,
  "per_user_tps": 45.2,
  "ms_per_token": 22.1,
  "ttft": 120,
  "ttft_ms": 120,
  "estimated_system_ram_required": 16.0,
  "offloaded_memory": 0.0,
  "estimated_power_draw": 320,
  "memory_details": {
    "Base Model Weights": 5.43,
    "KV Cache": 0.52,
    "Activations": 0.35,
    "Framework Overhead": 0.54
  },
  "memory_breakdown": [
    {
      "label": "Base Model Weights",
      "value": 79.4,
      "size_gb": 5.43
    },
    {
      "label": "KV Cache",
      "value": 7.6,
      "size_gb": 0.52
    }
  ]
}

响应 Body (微调示例)

微调输出

{
  "vram_usage": 32.4,
  "vram_percentage": 40.5,
  "actual_vram_percentage": 40.5,
  "memory_status": "okay",
  "static_shared_memory": 24.5,
  "training_tps": 12.0,
  "samples_per_second": 1.5,
  "steps_per_second": 0.18,
  "total_tokens": 10000000.0,
  "total_training_time_hours": 4.5,
  "estimated_system_ram_required": 32.0,
  "offloaded_memory": 0.0,
  "estimated_power_draw": 650,
  "active_optimizations": [
    "gradient_checkpointing",
    "flash_attention"
  ],
  "optimization_preset": "custom",
  "memory_details": {
    "Base Model Weights": 24.5,
    "Optimizer States": 29.4,
    "Gradients": 10.8
  },
  "memory_breakdown": [
    {
      "label": "Base Model Weights",
      "value": 38.2,
      "size_gb": 24.5
    },
    {
      "label": "Optimizer States",
      "value": 45.1,
      "size_gb": 29.4
    },
    {
      "label": "Gradients",
      "value": 16.7,
      "size_gb": 10.8
    }
  ]
}

显存状态枚举选项

memory_status 字段会返回以下小写枚举键之一：

枚举值	描述
`okay`	在显存安全范围内，占显存 50% 或更低
`moderate`	处于中度显存占用范围内，占显存 75% 或更低
`high`	占用较高，占显存 90% 或更低
`very_high`	非常接近最大显存容量，超过 90%
`insufficient`	超过可用 GPU 显存容量
`error`	GPU 配置或规格计算错误

硬件与模型列表路由

GET

/api/v1/vram-calculation/gpus

Pro

获取有效的 gpu_key 参数、硬件模型及显存大小。

curl https://apxml.com/api/v1/vram-calculation/gpus \
  -H "Authorization: Bearer YOUR_API_KEY_HERE"

JSON 响应

[
  {
    "key": "rtx_4090",
    "label": "NVIDIA GeForce RTX 4090",
    "memory": 24
  },
  {
    "key": "h100_80",
    "label": "NVIDIA H100 80GB",
    "memory": 80
  }
]

GET

/api/v1/vram-calculation/llms

Pro

获取支持 of llm_key 参数，以及相对应的模型变体配置。

curl https://apxml.com/api/v1/vram-calculation/llms \
  -H "Authorization: Bearer YOUR_API_KEY_HERE"

JSON 响应

[
  {
    "key": "llama-3-8b-instruct",
    "name": "Llama 3 8B Instruct"
  },
  {
    "key": "llama-3.1-70b-instruct",
    "name": "Llama 3.1 70B Instruct"
  }
]