This chapter moves into more sophisticated methods attackers might use to evade detection or extract information from Large Language Models. We will look at techniques that often require a greater understanding of the model's workings or more intricate manipulation of inputs and system interactions.
You will learn about gradient-based attacks, which can be particularly effective if an attacker has some knowledge of the model architecture, often represented by its parameters θ in a function like M(x;θ). We will also cover transfer attacks using substitute models, membership inference for identifying training data, and model stealing techniques. Furthermore, this chapter addresses methods for bypassing input filters and output sanitizers, chaining multiple attack techniques for increased impact, and strategies employed in low-resource or black-box scenarios.
Understanding these advanced offensive tactics is key to developing more resilient defenses and anticipating a wider range of potential threats to LLM systems. The chapter includes a practical exercise simulating an information exfiltration scenario to help apply these concepts.
4.1 Gradient-Based Attack Methods: An Overview
4.2 Transfer Attacks: Using Substitute Models
4.3 Membership Inference Attacks Against LLMs
4.4 Model Inversion and Stealing Techniques for LLMs
4.5 Bypassing Input Filters and Output Sanitizers
4.6 Chaining Multiple Attack Techniques
4.7 Low-Resource and Black-Box Attack Strategies
4.8 Practice: Simulating an Information Exfiltration Scenario
© 2025 ApX Machine Learning