Designing Machine Learning Systems: An Iterative Process for Production-Ready AI, Chip Huyen, 2022 (O'Reilly Media) - Covers essential principles for serving machine learning models, including performance metrics such as latency and throughput, and the challenges of deploying models in production, directly relevant to LLM performance management.
A Survey of Large Language Models, Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, Ji-Rong Wen, 2023arXiv preprint arXiv:2303.18223DOI: 10.48550/arXiv.2303.18223 - Offers a comprehensive overview of LLMs, including a section on inference optimization techniques (e.g., quantization, efficient decoding, batching) that directly affect the latency and throughput metrics discussed.