Designing Machine Learning Systems: New Rules for an AI-Driven Economy, Chip Huyen, 2022 (O'Reilly Media) - A comprehensive guide to building, deploying, and monitoring machine learning systems, including discussions on essential operational metrics and infrastructure considerations for production ML.
Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy, 2016 (O'Reilly Media) - Introduces fundamental concepts of site reliability engineering, covering service level objectives (SLOs), metrics, monitoring, and incident response, which are applicable to robust ML model deployment.
LLM Serving: A Holistic View, Yongqiang Tian, Huadong Wang, Qizhen Zhang, Zhihao Zhang, Haoran Xu, Yimin Zhang, Bowen Huang, Xuanwei Zhang, Menglu Yu, Weijian Xu, Yongqiang Yao, Kaiyu Li, 2023arXiv preprint arXiv:2305.15854DOI: 10.48550/arXiv.2305.15854 - Provides a survey of techniques and challenges for serving large language models, with insights into latency, throughput, and resource optimization that are relevant for large generative models like diffusion models.
Continuous Delivery for Machine Learning: Principles and Patterns for Productive ML Systems, Annika Backes, Emily Gorcenski, Gareth Jones, Daniel Lopez-Portillo, Mark Treveil, 2020 (O'Reilly Media) - Offers patterns and practices for deploying and maintaining machine learning systems, with specific sections on monitoring system health, model performance, and data quality in production environments.