Okay, you've grasped the core mechanics: how individual neurons compute, how data flows forward to make predictions, and how backpropagation combined with gradient descent enables the network to learn from errors. Now, it's time to act as the architect, designing the blueprint for your neural network before construction (training) begins. Setting up the network architecture involves making specific choices about its structure, which fundamentally defines how it will process information and learn.
Think of this step like deciding how many floors and rooms a building should have. These decisions depend on the building's purpose. Similarly, your network's architecture depends heavily on the specific problem you're trying to solve and the nature of your data.
The primary components you need to define are:
- The number of layers: How many layers will your network have? At a minimum, you need an input layer and an output layer. Often, one or more hidden layers sit between them.
- The number of neurons (or units) in each layer: How computationally "wide" should each layer be?
- The activation function for each layer: Which function will introduce non-linearity after the linear computations in each layer?
Let's break down the considerations for each part.
Input Layer
The input layer is the network's entry point. It doesn't perform computations like other layers (no weighted sums or activation functions applied in the traditional sense). Its main role is to receive the input data.
- Number of Neurons: The number of neurons in the input layer is determined directly by the number of features in your dataset. If you're working with tabular data that has 15 features per sample, your input layer will have 15 neurons. If you're processing grayscale images of size 28x28 pixels, you'd typically flatten the image into a vector of 28×28=784 pixels, meaning your input layer would need 784 neurons. You've already encountered how to prepare data (scaling, encoding) in Chapter 2; the result of that preprocessing dictates the shape of the input layer.
Hidden Layers
Hidden layers are the intermediate layers between the input and output layers. They are where most of the complex feature learning happens.
- Number of Layers (Depth): Adding more hidden layers (making the network "deeper") allows the network to potentially learn more complex, hierarchical features. Early layers might learn simple patterns (like edges or textures in an image), while deeper layers combine these to recognize more abstract concepts (like objects or faces). However, deeper networks can be more challenging to train (e.g., vanishing gradients, discussed later) and computationally more expensive. For many standard problems, starting with one or two hidden layers is a reasonable baseline.
- Number of Neurons per Layer (Width): Increasing the number of neurons in a hidden layer (making it "wider") gives the network more capacity to learn patterns within that layer. Too few neurons might lead to underfitting (the network can't capture the data's complexity), while too many can increase computational cost and the risk of overfitting (the network learns the training data too well, including noise, and performs poorly on new data). There's no single magic formula; choosing the width often involves some experimentation, but common practice is to have hidden layers with neuron counts related to the input/output sizes, sometimes decreasing in size towards the output layer.
- Activation Functions: Hidden layers almost always use non-linear activation functions. As discussed in Chapter 1, this non-linearity is essential for the network to learn complex mappings beyond simple linear relationships. ReLU (Rectified Linear Unit) is currently the most popular choice for hidden layers due to its computational efficiency and effectiveness in mitigating the vanishing gradient problem. Tanh or Sigmoid are also possibilities, though less common in modern deep hidden layers. You generally use the same activation function for all neurons within a single hidden layer.
Output Layer
The output layer is the final layer, producing the network's prediction. Its configuration is heavily dependent on the type of task:
Example: Designing a Simple Classifier Architecture
Let's say we want to build a network to classify images from a dataset like MNIST (handwritten digits 0-9). The images are 28x28 pixels grayscale.
- Input Layer: The images are 28×28=784 pixels. After flattening, our input layer needs 784 neurons.
- Hidden Layers: We could start with a simple architecture, perhaps two hidden layers. Let's choose 128 neurons for the first hidden layer and 64 for the second. This is a common heuristic, decreasing the width as we go deeper. For activation, we'll use ReLU for both hidden layers.
- Output Layer: We have 10 possible classes (digits 0 through 9). So, the output layer needs 10 neurons. Since it's a multi-class classification problem, we'll use the Softmax activation function.
Here is a visualization of this architecture:
A simple feedforward neural network architecture for MNIST classification. Input layer receives flattened pixel data, passes through two hidden ReLU layers of decreasing width, and finally to a 10-neuron Softmax output layer for class probabilities. Dotted lines indicate full connectivity between adjacent layers.
This defines our network's structure. The next step, which we'll cover shortly, is initializing the weights and biases for all the connections between these neurons. Remember that architecture design is often an iterative process. You might start with a simple structure like this, train it, evaluate its performance, and then revisit the architecture (e.g., change layer sizes, add/remove layers, try different activations) to improve results. Deep learning frameworks like TensorFlow and PyTorch provide convenient ways to define these layers and experiment with different structures, abstracting away much of the manual setup.