Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.
Was this section helpful?
Holistic Evaluation of Language Models, Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, Yuta Koreeda, 2023Transactions on Machine Learning Research (TMLR)DOI: 10.48550/arXiv.2211.09110 - Presents a broad framework for evaluating large language models (LLMs) across various criteria and tasks, providing context for automated testing strategies.
BERTScore: Evaluating Text Generation with BERT, Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1904.09675 - Introduces a widely used metric for evaluating text generation based on semantic similarity using contextual embeddings, relevant for automated evaluation.
LangChain Evaluation, LangChain Team, 2024 (LangChain) - Official documentation for a popular LLM framework's evaluation capabilities, demonstrating practical implementation of automated testing workflows.