Designing Machine Learning Systems, Chip Huyen, 2022 (O'Reilly Media) - A comprehensive guide covering the full lifecycle of machine learning systems, including architecture design for model serving, hardware selection, and optimization techniques.
NVIDIA Triton Inference Server Documentation, NVIDIA Corporation, 2023 (NVIDIA Corporation) - Official documentation for NVIDIA's open-source inference serving software, detailing how to deploy and optimize models on various hardware, including GPU batching strategies.
AWS Inferentia and AWS Neuron SDK Documentation, Amazon Web Services, 2023 - Official resources explaining AWS Inferentia accelerators, their architecture, and how to use the AWS Neuron SDK for compiling and deploying models for cost-efficient inference at scale.
Designing and deploying a machine learning prediction service, Google Cloud, 2023 (Google Cloud) - An architectural guide from Google Cloud offering considerations and best practices for building scalable and reliable machine learning prediction services, encompassing various deployment options and hardware choices.