Representation Engineering: A Top-Down Approach to AI Alignment, Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks, 2023ArXivDOI: 10.48550/arXiv.2310.01405 - 介绍了表示工程,它涉及识别和操纵LLM中的概念表示,以提高安全性并引导模型行为,与探针在对齐方面的应用高度相关。