Having established the fundamental components of neural networks, neurons, layers, and activation functions, we now address a core design question: how do we structure these components to create an effective feedforward network? Deciding on the architecture, specifically the number of hidden layers and the number of neurons within each layer, significantly impacts a model's ability to learn and generalize from data. There's no single magic formula; architecture design is often guided by heuristics, problem complexity, and empirical experimentation.
How Many Hidden Layers?
The "depth" of a network refers to the number of hidden layers it contains.
- Zero Hidden Layers: This configuration results in a single-layer perceptron (or logistic/softmax regression if using appropriate activations). As discussed in Chapter 1, these models can only learn linearly separable patterns.
- One Hidden Layer: A network with a single hidden layer is theoretically powerful. The Universal Approximation Theorem states that a feedforward network with one hidden layer containing a finite number of neurons and a suitable non-linear activation function can approximate any continuous function to an arbitrary degree of accuracy, given enough neurons. This is quite remarkable. For many simpler problems, a single hidden layer is often sufficient.
- Two or More Hidden Layers (Deep Networks): While one hidden layer can theoretically approximate any function, deep networks (those with multiple hidden layers) often learn complex patterns more efficiently. Deeper architectures allow the network to learn hierarchical features. Early layers might learn simple features (like edges or textures in an image), while later layers combine these to detect more complex structures (like shapes or objects). For a given level of accuracy on complex tasks, a deep network might require fewer total neurons (and thus parameters) than a wide, shallow network with only one hidden layer.
However, increasing depth comes with challenges:
- Vanishing/Exploding Gradients: Training very deep networks can be difficult due to gradients becoming extremely small or large during backpropagation (we'll discuss this more later). Techniques like careful weight initialization, non-saturating activation functions (like ReLU), and batch normalization help mitigate this.
- Computational Cost: More layers mean more computations during both training (forward and backward passes) and inference.
- Overfitting: Deeper networks have more parameters and capacity, increasing the risk of overfitting the training data if not properly regularized.
Guideline: Start simple. Begin with one or two hidden layers. If the model underfits (fails to capture the patterns in the training data), consider gradually increasing the depth or width.
How Many Neurons per Hidden Layer?
The "width" of a layer refers to the number of neurons it contains. This determines the layer's representational capacity at that level of abstraction.
- Too Few Neurons: If a hidden layer has too few neurons, the network might lack the capacity to learn the underlying complexities of the data. This leads to underfitting, where the model performs poorly even on the training data.
- Too Many Neurons: Using significantly more neurons than necessary can make the network prone to overfitting. The model might start memorizing the training examples, including their noise, instead of learning the generalizable patterns. This results in good performance on the training set but poor performance on unseen data. Having too many neurons also increases computational requirements and training time. While techniques like regularization (Chapter 6) can help control overfitting even in wider networks, it's still often inefficient to use an excessive number of neurons.
Common Heuristics and Patterns:
- Input/Output Relationship: A common practice is to choose the number of neurons in hidden layers to be somewhere between the size of the input layer and the size of the output layer.
- Powers of 2: Often, layer sizes are chosen as powers of 2 (e.g., 32, 64, 128, 256, 512), partly due to computational efficiencies on hardware like GPUs, though this is not a strict requirement.
- Funnel Structure: A popular approach is to gradually decrease the number of neurons in successive hidden layers (e.g., Input -> 512 -> 256 -> 128 -> Output). This resembles compressing the information down to the essential features needed for the final prediction.
- Constant Width: Sometimes, all hidden layers might have the same number of neurons.
Guideline: The optimal number of neurons is highly dependent on the specific dataset and problem. It's one of the main hyperparameters you'll need to tune. Start with a reasonable number based on input/output sizes or common practices (e.g., a funnel structure) and experiment. Monitor validation performance to check for underfitting or overfitting and adjust the layer sizes accordingly.
A feedforward network often uses a "funnel" structure, reducing the number of neurons in successive hidden layers.
Output Layer Configuration
Remember that the output layer's design is dictated entirely by the task:
- Regression: Typically one neuron with a linear (or no) activation function.
- Binary Classification: One neuron with a Sigmoid activation function.
- Multi-class Classification: N neurons (where N is the number of classes) with a Softmax activation function.
Iteration and Experimentation
Designing network architecture is rarely a one-shot process. It typically involves:
- Starting with a reasonable baseline architecture based on the problem type and heuristics.
- Training the model and evaluating its performance on a validation set.
- Observing if the model is underfitting or overfitting.
- Iteratively adjusting the architecture (adding/removing layers, changing layer widths) and hyperparameters (like learning rate, discussed later) based on the evaluation results.
- Employing techniques like regularization (Chapter 6) to improve generalization.
Automated hyperparameter tuning techniques can systematically explore different architectures, but understanding these design principles provides a solid foundation for making informed choices and guiding the search process.
For instance, defining a simple feedforward network in PyTorch might look like this, explicitly showing the layer dimensions:
import torch
import torch.nn as nn
class SimpleMLP(nn.Module):
def __init__(self, input_size, hidden_size1, hidden_size2, output_size):
super(SimpleMLP, self).__init__()
self.layer1 = nn.Linear(input_size, hidden_size1)
self.relu1 = nn.ReLU()
self.layer2 = nn.Linear(hidden_size1, hidden_size2)
self.relu2 = nn.ReLU()
self.output_layer = nn.Linear(hidden_size2, output_size)
# Output activation (e.g., nn.Sigmoid() or nn.Softmax(dim=1))
# would typically be applied after this layer or handled by the loss function.
def forward(self, x):
x = self.layer1(x)
x = self.relu1(x)
x = self.layer2(x)
x = self.relu2(x)
x = self.output_layer(x)
return x
# Example instantiation:
input_dim = 784 # e.g., for flattened 28x28 MNIST images
h1_dim = 128
h2_dim = 64
output_dim = 10 # e.g., for 10 digit classes
model = SimpleMLP(input_dim, h1_dim, h2_dim, output_dim)
print(model)
This example defines a network with an input layer, two hidden layers (128 and 64 neurons respectively) using ReLU activations, and an output layer. The specific number of neurons (128, 64) represents design choices based on the principles discussed. As you progress, you'll develop intuition and utilize systematic methods to refine these architectural decisions for your specific problems.