ReAct: Synergizing Reasoning and Acting in Language Models, Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao, 2022International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2210.03629 - This paper introduces the ReAct framework, which generates explicit reasoning traces (Thoughts) interleaved with actions and observations, making the agent's internal processes inspectable for evaluation.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan, 2023Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.2305.10601 - This work details a reasoning framework that explores multiple reasoning paths and self-evaluates intermediate steps, directly relevant for evaluating structured planning and hypothesis generation capabilities.
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, 2023NeurIPS 2023 Datasets and Benchmarks TrackDOI: 10.48550/arXiv.2306.05685 - This paper examines the effectiveness of using large language models as evaluators, a core automated evaluation technique mentioned, and offers a framework for robust assessment.