llama.cpp GitHub Repository, Georgi Gerganov and llama.cpp contributors, 2024 - The official repository for llama.cpp, a C/C++ inference engine for LLMs optimized for CPU, supporting GGUF and various quantization formats for efficient on-device execution.
NVIDIA TensorRT Developer Guide, NVIDIA Corporation, 2024 (NVIDIA) - Official documentation for NVIDIA TensorRT, a software development kit for high-performance deep learning inference, supporting quantized models on NVIDIA GPUs.
TensorFlow Lite Developer Guide, Google, 2024 (Google) - The official developer guide for TensorFlow Lite, providing tools and methods for deploying machine learning models, including quantized LLMs, on mobile, edge, and embedded devices.
Deploy models for inference with Amazon SageMaker, Amazon Web Services, 2024 (Amazon Web Services) - Official AWS documentation on deploying models for inference using Amazon SageMaker, a managed service that facilitates hosting and managing ML models in the cloud.