Having reviewed recurrent architectures in the previous chapter, we now turn to the Transformer model. Recurrent models process sequences token by token, posing challenges for parallel computation and modeling long-distance relationships. The Transformer architecture overcomes these limitations by replacing recurrence entirely with attention mechanisms.
This chapter dissects the Transformer's structure. You will learn about:
By the end of this chapter, you will understand the fundamental components that enable the Transformer's effectiveness in sequence modeling tasks.
4.1 Overcoming Recurrence with Attention
4.2 Scaled Dot-Product Attention
4.3 Multi-Head Attention Mechanism
4.4 Positional Encoding Techniques
4.5 Encoder and Decoder Stacks
4.6 The Role of Layer Normalization and Residual Connections