In Search of an Understandable Consensus Algorithm, Diego Ongaro and John Ousterhout, 2014USENIX Annual Technical Conference (USENIX Association)DOI: 10.5555/2645852.2645867 - This paper introduces the Raft consensus algorithm, widely used for leader election and managing replicated logs in distributed systems, providing a clear explanation for understanding failover in HA systems.
Distributed Systems: Concepts and Design, George Coulouris, Jean Dollimore, Tim Kindberg, and Gordon Blair, 2011 (Addison-Wesley/Pearson) - A classic textbook providing an academic treatment of distributed systems design, covering fundamental concepts of replication, consistency, fault tolerance, and distributed algorithms relevant to building reliable production systems.