BLEU: a Method for Automatic Evaluation of Machine Translation, Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, 2002Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics)DOI: 10.3115/1073083.1073135 - 这篇有影响力的论文介绍了BLEU,一种广泛使用的自动指标,通过与参考译文比较来评估机器生成文本的质量。它与需要高精度匹配参考文本的任务相关。
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, 2023NeurIPS 2023 Datasets and Benchmarks TrackDOI: 10.48550/arXiv.2306.05685 - 这项研究探讨了使用大型语言模型作为评估器(“LLM即评委”)评估其他大型语言模型的可靠性。它为这种模型评估方法引入了基准和方案。