Site Reliability Engineering: How Google Runs Production Systems, Benjamin Treynor Sloss, Betsy Beyer, 2016 (O'Reilly Media) - Provides fundamental principles and practices for system reliability, monitoring, alerting, and incident response, applicable to complex production systems like feature stores.
Tecton: A Modern Feature Store for ML, Mike Zadeh, Sam Steingold, Ben Zaitlen, Matt Gormley, Jeremy Hyrkas, Max Hjelm, Benji Cooper, Andrew Lee, Adarsha Nadig, Josh Rosen, Ryan Smith, Kevin Tian, Justin Trevor, Xiaoyong Wang, Mike Williams, David Li, 2021Proceedings of the VLDB Endowment, Vol. 14 (VLDB Endowment)DOI: 10.14778/3476311.3476344 - Describes a modern feature store architecture, including components that require operational monitoring and touches on reliability for ML systems.
Reliable Machine Learning: Applying SRE Principles to ML in Production, Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd Underwood, 2022 (O'Reilly Media) - Focuses on applying Site Reliability Engineering (SRE) principles specifically to machine learning systems, including data quality monitoring, model monitoring, and ensuring operational reliability of ML pipelines.