Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy, 2017 (O'Reilly Media, Inc.) - This foundational book provides principles and practices for operating large-scale distributed systems, including comprehensive guidance on monitoring, defining Service Level Objectives (SLOs), capacity planning, and alerting strategies, highly relevant to production vector search.
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, Martin Kleppmann, 2017 (O'Reilly Media) - A comprehensive resource for understanding the design and operation of distributed data systems. It covers fundamental concepts of scalability, reliability, and maintainability, including deep insights into performance metrics like latency and throughput, and resource management, directly applicable to scaling vector search.