After defining the structure of layers like Dense
, the next important consideration is how information flows and transforms between these layers. If we simply stacked Dense
layers without any modification to their outputs, the entire network, no matter how deep, would behave like a single, large linear transformation. This is because composing linear functions results in another linear function. To model complex, non-linear relationships present in most real-world data (like images, text, or structured datasets), we need to introduce non-linearity into our network. This is precisely the role of activation functions.
An activation function takes the output signal from the preceding layer (often a weighted sum of inputs plus a bias) and applies a fixed mathematical operation to it, element-wise for most common activations. This transformed output then becomes the input for the next layer.
Let's look at some of the most frequently used activation functions and how to implement them in Keras.
Imagine trying to separate two classes of data points that are not linearly separable (e.g., points forming concentric circles). A simple linear model (y=wx+b) can only draw straight lines (or hyperplanes in higher dimensions). No matter how many linear layers you stack, you'll still only be able to produce a linear decision boundary. Activation functions break this linearity, allowing networks to learn complex curves and shapes, effectively warping the data representation space at each layer to make patterns more separable.
The sigmoid function squashes its input into a range between 0 and 1. Its mathematical form is:
σ(x)=1+e−x1Historically popular, its output range (0 to 1) makes it suitable for the output layer of a binary classification model, where the output can be interpreted as a probability. However, it has fallen out of favor for hidden layers due to a couple of drawbacks:
In Keras, you can use it directly within a layer:
import keras
from keras import layers
# Example usage in a Dense layer
output_layer = layers.Dense(1, activation='sigmoid')
The hyperbolic tangent function, or tanh, is closely related to the sigmoid but squashes its input into the range between -1 and 1.
tanh(x)=ex+e−xex−e−x=2σ(2x)−1Being zero-centered (its outputs range from -1 to 1) often helps with model convergence compared to sigmoid, making it a preferred choice over sigmoid for hidden layers in the past. However, it still suffers from the vanishing gradient problem for large positive or negative inputs.
Usage in Keras:
# Example usage in a Dense layer
hidden_layer = layers.Dense(64, activation='tanh')
The Rectified Linear Unit, or ReLU, has become the default activation function for hidden layers in most deep learning applications. It's simple yet remarkably effective.
ReLU(x)=max(0,x)It outputs the input directly if it's positive, and zero otherwise.
Advantages:
Disadvantage:
Usage in Keras:
# Example usage in a Dense layer (most common for hidden layers)
hidden_layer = layers.Dense(128, activation='relu')
The Softmax function is typically used in the output layer of a multi-class classification network. Unlike the previous functions that operate element-wise independently, Softmax operates on the entire vector of outputs from the final layer. It converts a vector of raw scores (logits) into a probability distribution, where each element is between 0 and 1, and all elements sum up to 1.
For a vector of inputs x, the Softmax output for the i-th element is:
Softmax(x)i=∑jexjexiWhere the sum in the denominator is over all elements in the input vector x. This ensures the outputs represent probabilities for each class.
Usage in Keras (typically on the final layer for multi-class classification):
# Example for a 10-class classification problem
output_layer = layers.Dense(10, activation='softmax')
The following chart illustrates the shapes of the Sigmoid, Tanh, and ReLU functions:
Comparison of Sigmoid (blue), Tanh (green), and ReLU (red) activation functions across a range of input values. Note the different output ranges and shapes. Softmax is not shown as it operates on a vector rather than a single scalar input.
Here are some general guidelines:
As seen in the examples, the most common way to add an activation function is via the activation
argument within a layer definition:
model = keras.Sequential([
layers.Dense(128, activation='relu', input_shape=(784,)), # Input layer with ReLU
layers.Dense(64, activation='relu'), # Hidden layer with ReLU
layers.Dense(10, activation='softmax') # Output layer for 10 classes
])
Alternatively, especially when using the Functional API or if you want to apply an activation after a layer that doesn't have the activation
argument (like batch normalization sometimes), you can use the Activation
layer:
from keras.layers import Activation
# Functional API example snippet
inputs = keras.Input(shape=(784,))
x = layers.Dense(128)(inputs)
x = Activation('relu')(x) # Apply ReLU using the Activation layer
x = layers.Dense(64)(x)
x = Activation('relu')(x)
outputs = layers.Dense(10, activation='softmax')(x) # Use argument here
model = keras.Model(inputs=inputs, outputs=outputs)
Understanding activation functions is fundamental to building effective neural networks. They are the source of non-linearity that enables networks to learn complex mappings from inputs to outputs. Choosing the right activation function, especially for the output layer, depends heavily on the specific task (classification, regression) you are addressing. In the next sections, we will continue assembling these building blocks to construct and examine complete Keras models.
© 2025 ApX Machine Learning