Integrating Quantized Models into Production Pipelines
Was this section helpful?
Machine Learning Engineering, Andriy Burkov, 2020 (True Positive Inc.) - Provides practical guidance on MLOps practices, including model deployment, monitoring, and lifecycle management for machine learning systems.
NVIDIA TensorRT-LLM Documentation, NVIDIA, 2024 (NVIDIA) - Official documentation for optimizing and deploying large language models with NVIDIA GPUs, covering advanced techniques like quantization and efficient inference kernels.
vLLM Documentation, vLLM Project Contributors, 2024 - Official documentation for vLLM, a high-throughput and memory-efficient LLM inference and serving engine, detailing API design and deployment considerations.