You've learned about feedforward neural networks, also known as Multi-Layer Perceptrons (MLPs), and how they can learn complex, non-linear relationships in data. Their structure, typically involving densely connected layers where each neuron connects to every neuron in the previous layer, makes them powerful function approximators when the input is a fixed-size vector of features.
However, this very structure introduces limitations when dealing with specific kinds of data that possess inherent structure not easily captured by a simple vector representation. Let's examine two significant areas where standard feedforward networks often fall short: processing grid-structured data like images, and handling sequential data like text or time series.
Images are not just arbitrary collections of pixels; they have a strong spatial structure. Pixels close to each other are often related, forming textures, edges, and objects. When you feed an image into a standard MLP, the typical first step is to flatten the image matrix (e.g., height x width x channels) into a single, long vector.
A 2D image is typically flattened into a 1D vector before being fed into an MLP. Notice how pixels that were adjacent vertically (e.g., 1 and 4) are now separated in the vector.
This flattening process discards the explicit 2D spatial arrangement. Pixels that were close neighbors in the image (like one directly above another) might end up far apart in the input vector. An MLP treats each element of this vector somewhat independently in the first layer, making it difficult to learn features based on local spatial patterns efficiently.
Furthermore, consider the connections. In a densely connected layer, every neuron connects to every input pixel. For even moderately sized images (e.g., 256x256 pixels), this results in an enormous number of parameters (weights) in the very first hidden layer. This parameter explosion leads to several problems:
Another challenge is translation invariance. An object (like a cat) should still be recognizable as a cat whether it appears in the top-left corner or the bottom-right corner of the image. MLPs lack built-in translation invariance. Because each input pixel is treated uniquely by the weights connecting to it, the network essentially needs to learn to detect the features of a cat separately for every possible position it might appear in. This is highly inefficient.
Data like natural language text, speech audio, or financial time series are inherently sequential. The order of elements matters significantly. For example, "dog bites man" has a very different meaning from "man bites dog".
Standard MLPs run into two main difficulties with sequences:
Similar to the image case, MLPs also lack parameter sharing across time steps. If a specific pattern or feature is important regardless of where it appears in the sequence (e.g., detecting a specific phrase), an MLP would need different weights to detect that pattern at the beginning, middle, or end of the sequence. This is inefficient and requires more data to learn effectively.
These limitations highlight the need for network architectures specifically designed to handle spatial hierarchies (like in images) and temporal dependencies (like in sequences). Convolutional Neural Networks (CNNs) address the challenges of spatial data, while Recurrent Neural Networks (RNNs) are tailored for sequential data. We will explore the fundamentals of these architectures next.
© 2025 ApX Machine Learning