Site Reliability Engineering: How Google Runs Production Systems, Niall Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff, 2017 (O'Reilly Media, Inc.) - A foundational text on operating large-scale systems, including detailed discussions on release engineering, canary deployments, and incident management, which are applicable to advanced LLM deployment.
MLOps: Continuous delivery and automation for machine learning, Google Cloud, 2024 (Google Cloud) - An authoritative guide from Google Cloud outlining MLOps principles, including strategies for automated model deployment, testing, and monitoring, which are directly relevant to implementing advanced deployment patterns for LLMs.