Designing for High Availability and Fault Tolerance
Was this section helpful?
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, Martin Kleppmann, 2017 (O'Reilly Media) - This book provides a comprehensive overview of the fundamental concepts behind building reliable, scalable, and maintainable data systems. It includes detailed discussions on replication, partitioning (sharding), consistency, fault tolerance, and distributed transaction management, making it relevant for understanding resilient RAG components like vector databases and data pipelines.
Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Murphy, 2017 (O'Reilly Media, Inc.) - A foundational text on Site Reliability Engineering (SRE) practices, detailing Google's approach to achieving high availability, fault tolerance, and operational excellence for large-scale distributed systems. It covers critical topics like service level objectives (SLOs), monitoring, incident response, and the aspects of maintaining reliable systems, applicable to production RAG deployments.