While previous chapters focused on steering Large Language Models (LLMs) towards desired behaviors, this chapter addresses the security aspects inherent in deploying these powerful systems. Even models aligned through techniques like RLHF can exhibit vulnerabilities when faced with specifically crafted inputs designed to circumvent safety measures or cause unintended actions. Understanding these potential failure points is essential for building dependable AI applications.
Here, you will learn to:
We will examine the mechanisms behind these attacks and the practical steps involved in constructing defenses, equipping you to develop more secure LLM systems.
5.1 Taxonomy of Adversarial Attacks on LLMs
5.2 Jailbreaking Techniques and Examples
5.3 Prompt Injection Attacks
5.4 Data Poisoning Attacks during Training/Fine-tuning
5.5 Membership Inference and Privacy Attacks
5.6 Adversarial Training for LLM Robustness
5.7 Input Sanitization and Output Filtering Defenses
5.8 Formal Verification Approaches (Limitations and Potential)
5.9 Practice: Crafting and Defending Against Basic Jailbreaks
© 2025 ApX Machine Learning