When constructing neural networks that use layers like Dense layers, an important consideration is how information flows and transforms between these layers. If these Dense layers were simply stacked without any modification to their outputs, the entire network, regardless of its depth, would behave as a single, large linear transformation. This happens because composing linear functions results in another linear function. To model complex, non-linear relationships present in most data (like images, text, or structured datasets), introducing non-linearity into the network is necessary. This is precisely the role of activation functions.
An activation function takes the output signal from the preceding layer (often a weighted sum of inputs plus a bias) and applies a fixed mathematical operation to it, element-wise for most common activations. This transformed output then becomes the input for the next layer.
Let's look at some of the most frequently used activation functions and how to implement them in Keras.
Imagine trying to separate two classes of data points that are not linearly separable (e.g., points forming concentric circles). A simple linear model (y=wx+b) can only draw straight lines (or hyperplanes in higher dimensions). No matter how many linear layers you stack, you'll still only be able to produce a linear decision boundary. Activation functions break this linearity, allowing networks to learn complex curves and shapes, effectively warping the data representation space at each layer to make patterns more separable.
The sigmoid function squashes its input into a range between 0 and 1. Its mathematical form is:
σ(x)=1+e−x1Historically popular, its output range (0 to 1) makes it suitable for the output layer of a binary classification model, where the output can be interpreted as a probability. However, it has fallen out of favor for hidden layers due to a couple of drawbacks:
In Keras, you can use it directly within a layer:
import keras
from keras import layers
# Example usage in a Dense layer
output_layer = layers.Dense(1, activation='sigmoid')
The hyperbolic tangent function, or tanh, is closely related to the sigmoid but squashes its input into the range between -1 and 1.
tanh(x)=ex+e−xex−e−x=2σ(2x)−1Being zero-centered (its outputs range from -1 to 1) often helps with model convergence compared to sigmoid, making it a preferred choice over sigmoid for hidden layers in the past. However, it still suffers from the vanishing gradient problem for large positive or negative inputs.
Usage in Keras:
# Example usage in a Dense layer
hidden_layer = layers.Dense(64, activation='tanh')
The Rectified Linear Unit, or ReLU, has become the default activation function for hidden layers in most deep learning applications. It's simple yet remarkably effective.
ReLU(x)=max(0,x)It outputs the input directly if it's positive, and zero otherwise.
Advantages:
Disadvantage:
Usage in Keras:
# Example usage in a Dense layer (most common for hidden layers)
hidden_layer = layers.Dense(128, activation='relu')
The Softmax function is typically used in the output layer of a multi-class classification network. Unlike the previous functions that operate element-wise independently, Softmax operates on the entire vector of outputs from the final layer. It converts a vector of raw scores (logits) into a probability distribution, where each element is between 0 and 1, and all elements sum up to 1.
For a vector of inputs x, the Softmax output for the i-th element is:
Softmax(x)i=∑jexjexiWhere the sum in the denominator is over all elements in the input vector x. This ensures the outputs represent probabilities for each class.
Usage in Keras (typically on the final layer for multi-class classification):
# Example for a 10-class classification problem
output_layer = layers.Dense(10, activation='softmax')
The following chart illustrates the shapes of the Sigmoid, Tanh, and ReLU functions:
Comparison of Sigmoid (blue), Tanh (green), and ReLU (red) activation functions across a range of input values. Note the different output ranges and shapes. Softmax is not shown as it operates on a vector rather than a single scalar input.
Here are some general guidelines:
As seen in the examples, the most common way to add an activation function is via the activation argument within a layer definition:
model = keras.Sequential([
layers.Dense(128, activation='relu', input_shape=(784,)), # Input layer with ReLU
layers.Dense(64, activation='relu'), # Hidden layer with ReLU
layers.Dense(10, activation='softmax') # Output layer for 10 classes
])
Alternatively, especially when using the Functional API or if you want to apply an activation after a layer that doesn't have the activation argument (like batch normalization sometimes), you can use the Activation layer:
from keras.layers import Activation
# Functional API example snippet
inputs = keras.Input(shape=(784,))
x = layers.Dense(128)(inputs)
x = Activation('relu')(x) # Apply ReLU using the Activation layer
x = layers.Dense(64)(x)
x = Activation('relu')(x)
outputs = layers.Dense(10, activation='softmax')(x) # Use argument here
model = keras.Model(inputs=inputs, outputs=outputs)
Understanding activation functions is fundamental to building effective neural networks. They are the source of non-linearity that enables networks to learn complex mappings from inputs to outputs. Choosing the right activation function, especially for the output layer, depends heavily on the specific task (classification, regression) you are addressing. In the next sections, we will continue assembling these building blocks to construct and examine complete Keras models.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with