The Role of Interpretability in AI Safety

Was this section helpful?

References

Toward Transparent and Aligned Language Models, Andreas Weller, Sara Hooker, Mark M. D. E. K. M. Neerincx, Jessica Clark, 2023 Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37 (AAAI Press) DOI: 10.1609/aaai.v37i13.26463 - This survey paper reviews the progress and challenges in achieving transparency and alignment in LLMs, highlighting the role of interpretability in understanding and controlling model behavior for safety.