Representation Engineering: A Top-Down Approach to AI Alignment, Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks, 2023ArXivDOI: 10.48550/arXiv.2310.01405 - Introduces representation engineering, which involves identifying and manipulating concept representations within LLMs to enhance safety and steer model behavior, highly relevant to the applications of probing for alignment.