Universal and Transferable Adversarial Attacks on Aligned Language Models, Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson, 2023arXiv preprint arXiv:2307.15043DOI: 10.48550/arXiv.2307.15043 - This paper demonstrates how adversarial attacks, often relying on context manipulation, can bypass LLM safety alignments. It addresses instruction persistence and prompt injection via history.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks, Xingyu Wang, Yufei Wang, Wenqi Ding, Shichang Zhang, Zihan Wang, Kaiyi Zhang, Lida Zhao, Zhuo Zhang, Xiang Li, Xinlei He, Jinyuan Jia, Michael R. Lyu, Lichao Sun, 2023arXiv preprint arXiv:2311.10900DOI: 10.48550/arXiv.2311.10900 - This paper proposes a defense mechanism against jailbreaking attacks, which are a direct consequence of poorly managed context and malicious instruction persistence.
Re-evaluating the Safety of Retrieval-Augmented Generation for Large Language Models, Malik Kaddour, Sarah E. Allen, Jonathan Pilault, Pascal Vincent, Ryan Cotterell, Sarath Chandar, 2024arXiv preprint arXiv:2403.01258 - This paper specifically examines the safety implications of Retrieval-Augmented Generation (RAG), a technique for long-term memory, addressing how retrieved information can introduce new safety risks.