Designing for High Availability and Fault Tolerance
New · Open Source
Kerb - LLM Development Toolkit
Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.
Was this section helpful?
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, Martin Kleppmann, 2017 (O'Reilly Media) - This book provides a comprehensive overview of the fundamental concepts behind building reliable, scalable, and maintainable data systems. It includes detailed discussions on replication, partitioning (sharding), consistency, fault tolerance, and distributed transaction management, making it relevant for understanding resilient RAG components like vector databases and data pipelines.
Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Murphy, 2017 (O'Reilly Media, Inc.) - A foundational text on Site Reliability Engineering (SRE) practices, detailing Google's approach to achieving high availability, fault tolerance, and operational excellence for large-scale distributed systems. It covers critical topics like service level objectives (SLOs), monitoring, incident response, and the aspects of maintaining reliable systems, applicable to production RAG deployments.