Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, 2023NeurIPS 2023 Datasets and Benchmarks TrackDOI: 10.48550/arXiv.2306.05685 - This paper details the effectiveness and limitations of using large language models as judges to evaluate the quality of other LLM outputs, directly relevant to semantic correctness testing.
Building LLM-Powered Applications: From Prompt Engineering to Production, Josh Harrison, Andrew Ng, Jon Krohn, Sinan Ozdemir, 2023 (O'Reilly Media) - This book offers practical guidance on the end-to-end development of LLM applications, covering design, testing strategies, and deployment considerations for building robust systems.