Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.
Was this section helpful?
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, 2023NeurIPS 2023 Datasets and Benchmarks TrackDOI: 10.48550/arXiv.2306.05685 - Examines the effectiveness of employing large language models as evaluators for other LLMs, sharing insights into this evaluation method.