A Mathematical Framework for Mechanistic Interpretability, Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt, 2022arXiv preprintDOI: 10.48550/arXiv.2211.00593 - Provides a framework for understanding and analyzing the internal computations of large language models, including neurons, circuits, and causal interventions.
In-context Learning and Induction Heads, Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah, 2022arXiv (arXiv)DOI: 10.48550/arXiv.2209.11895 - Introduces the concept of 'induction heads' as a type of circuit in Transformers and uses causal analysis like path patching to understand their function.