All Courses

Flattening and Dense Layers in CNNs

The convolutional and pooling layers you've learned about act as powerful feature extractors. They process input images (or other grid-like data) and produce feature maps, multi-dimensional tensors representing detected patterns like edges, textures, or more complex shapes. For instance, after several Conv2D and MaxPooling2D layers, you might have a tensor with a shape like $(height, width, channels)$ , say $(7, 7, 64)$ . This tensor retains spatial information; the values in the 64 channels correspond to specific features detected at different locations within the downsampled $7 \times 7$ grid.

However, for tasks like classification, we typically need to make a final prediction based on the entire set of extracted features. Standard fully connected (Dense) layers expect their input as a one-dimensional vector, where each element represents a single feature value, without any inherent 2D or 3D spatial structure. The output of our convolutional base, being a multi-dimensional tensor like $(7, 7, 64)$ , isn't directly compatible with Dense layers.

The Flatten Layer: Bridging the Gap

This is where the Flatten layer comes in. Its job is simple but essential: it takes the multi-dimensional output from the convolutional base and reshapes, or "flattens," it into a single, long one-dimensional vector. It does this by essentially unstacking the elements row by row, channel by channel.

For example, if the input tensor has shape $(height, width, channels)$ , the Flatten layer will produce a vector of length $height \times width \times channels$ . Using our previous example of a $(7, 7, 64)$ tensor, the Flatten layer would transform it into a vector with $7 \times 7 \times 64 = 3136$ elements.

Flow showing the Flatten layer reshaping the multi-dimensional output of the convolutional base into a 1D vector.

The Flatten layer itself doesn't learn anything; it contains no trainable weights. It's purely a structural transformation required to connect the feature extraction part of the CNN (the convolutional base) to the classification or regression part (the head).

Dense Layers: Learning from Flattened Features

Once the feature maps are flattened into a vector, we can feed this vector into one or more standard Dense (fully connected) layers. These layers work just like the ones you encountered in basic neural networks. Each neuron in a Dense layer receives input from all neurons in the previous layer (in this case, all elements of the flattened vector).

The purpose of these Dense layers in a CNN architecture is to learn combinations of the features extracted by the convolutional base. While the convolutional layers learned local patterns (edges, textures in small patches), the Dense layers learn global patterns across the entire input image, based on which features were activated where. They combine these high-level features to make the final prediction.

Typically, a CNN includes one or more Dense layers after the Flatten layer:

Intermediate Dense Layers: Often use ReLU activation (relu) to introduce non-linearity and learn complex feature combinations. The number of units in these layers is a hyperparameter to be tuned (e.g., 128, 256, 512).
Output Dense Layer: The final layer produces the network's prediction.
- For binary classification, it has 1 unit with a sigmoid activation.
- For multi-class classification, it has N units (where N is the number of classes) with a softmax activation.
- For regression, it has 1 or more units (depending on the number of target values) typically with a linear activation (or no activation specified).

Implementation in Keras

Adding Flatten and Dense layers to a Keras model is straightforward. You simply add them after the last convolutional or pooling layer. Here's how it looks in a Sequential model context:

import keras
from keras import layers

# Assume 'model' is a Sequential model already containing Conv2D/MaxPooling2D layers
# Example input shape to the model: (28, 28, 1) for MNIST

model = keras.Sequential(
    [
        keras.Input(shape=(28, 28, 1)),
        # --- Convolutional Base ---
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        # --- Classifier Head ---
        layers.Flatten(), # Flatten the 3D feature map to 1D
        layers.Dropout(0.5), # Optional: Dropout for regularization
        layers.Dense(10, activation="softmax"), # Output layer for 10 classes (e.g., MNIST digits)
    ]
)

model.summary()

In this example:

The convolutional base extracts features, resulting in multi-dimensional feature maps.
layers.Flatten() takes the output of the last MaxPooling2D layer and converts it into a 1D vector.
A layers.Dropout(0.5) layer is optionally added for regularization (we'll cover this in Chapter 6).
The final layers.Dense(10, activation="softmax") layer performs the classification, outputting probabilities for each of the 10 classes.

The combination of the convolutional base for feature extraction and the Flatten plus Dense layers for classification forms the standard architecture for many successful CNNs used in image recognition and other domains.

A Note on Alternatives: While Flatten followed by Dense is common, other techniques like GlobalAveragePooling2D or GlobalMaxPooling2D exist. These layers also bridge the gap between convolutional maps and the final output, often reducing the number of parameters and potentially improving generalization. They work by taking the average or maximum value across the spatial dimensions (height, width) of each feature map, resulting in a single value per channel, thus creating a 1D vector directly. We focus on Flatten here as it's a fundamental concept, but be aware of these alternatives for more advanced architectures.

Was this section helpful?