To understand the motivation behind the Transformer architecture, we first revisit the primary methods previously used for sequence modeling. Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Gated Recurrent Units (GRUs) represented significant steps in handling sequential data. However, their inherent structure presents several limitations that hindered progress, especially with increasingly long sequences.
This chapter examines these specific challenges. We will discuss:
Recognizing these constraints provides the necessary background for appreciating the architectural innovations introduced by the Transformer model in subsequent chapters.
1.1 Sequential Computation in Recurrent Networks
1.2 The Vanishing and Exploding Gradient Problems
1.3 Long Short-Term Memory (LSTM) Gating Mechanisms
1.4 Gated Recurrent Units (GRUs) Architecture
1.5 Challenges with Long-Range Dependencies
1.6 Parallelization Constraints in Recurrent Models
© 2025 ApX Machine Learning