All Courses

Activation Functions

In the previous sections, you learned how to stack layers using Keras's Sequential and Functional APIs. However, simply stacking layers like Dense often isn't enough. If we only used layers that perform linear operations (like matrix multiplication followed by adding a bias), stacking them would still result in a linear function overall. A composition of linear functions is just another linear function. To model complex, real-world patterns, neural networks need to introduce non-linearity. This is where activation functions come into play.

An activation function is applied element-wise to the output of a layer (often referred to as the pre-activation or logits), transforming it before it's passed to the next layer. This non-linear transformation allows the network to learn much more complex mappings between inputs and outputs.

Why Non-Linearity is Important

Consider a simple network with only linear layers. Each layer computes $output = W \cdot input + b$ , where $W$ is the weight matrix and $b$ is the bias vector. If you stack two such layers, the output becomes:

$output_2 = W_2 \cdot (W_1 \cdot input + b_1) + b_2 = (W_2 W_1) \cdot input + (W_2 b_1 + b_2)$

This is still in the form $W' \cdot input + b'$ , which is a linear transformation. No matter how many linear layers you stack, the network can only represent linear relationships. Activation functions break this linearity, enabling networks to approximate arbitrarily complex functions.

Common Activation Functions

Keras provides several built-in activation functions. Let's look at some of the most frequently used ones.

Rectified Linear Unit (ReLU)

The Rectified Linear Unit, or ReLU, is one of the most popular activation functions in deep learning, especially for hidden layers. It's computationally efficient and generally performs well.

Its definition is simple: it returns the input directly if the input is positive, and returns zero otherwise.

f(x) = \max(0, x)

Pros:
- Computationally very efficient (just a threshold operation).
- Avoids the vanishing gradient problem for positive inputs, leading to faster training compared to sigmoid or tanh in many cases.
Cons:
- Dying ReLU problem: Neurons can become inactive if their output is consistently zero (e.g., due to a large negative bias or large gradient update). Once a neuron's output becomes zero, the gradient flowing through it will also be zero, preventing its weights from being updated.
- The output is not zero-centered.

ReLU is often the default choice for hidden layers in feedforward and convolutional neural networks.

Sigmoid

The sigmoid function squashes its input into the range (0, 1).

f(x) = \sigma(x) = \frac{1}{1 + e^{-x}}

Pros:
- Outputs values between 0 and 1, which is useful for representing probabilities in binary classification problems.
Cons:
- Vanishing gradients: For very large positive or negative inputs, the function saturates (output is close to 1 or 0), and the gradient becomes extremely small. This can significantly slow down or stall the learning process in deep networks.
- Outputs are not zero-centered. This can sometimes slow down convergence.

Sigmoid is primarily used in the output layer of a binary classification model, where the output needs to be interpreted as a probability. It's less common in hidden layers nowadays due to the prevalence of ReLU and its variants.

Softmax

The softmax function is a generalization of the sigmoid function used for multi-class classification problems. It takes a vector of arbitrary real-valued scores (logits) as input and transforms them into a vector of values between 0 and 1 that sum to 1. These outputs can be interpreted as probabilities for each class.

For an input vector $x = [x_1, x_2, ..., x_N]$ , the softmax output for the $i$ -th element is:

f(x)_i = \frac{e^{x_i}}{\sum_{j=1}^{N} e^{x_j}}

Pros:
- Outputs a probability distribution over multiple classes.
Cons:
- Typically only suitable for the output layer in multi-class classification tasks.

Softmax is the standard activation function for the final layer in a multi-class classification network.

Hyperbolic Tangent (Tanh)

The hyperbolic tangent, or tanh, function squashes its input into the range (-1, 1).

f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

Pros:
- Output is zero-centered, which can sometimes help with convergence compared to sigmoid.
Cons:
- Still suffers from the vanishing gradient problem, similar to sigmoid, although its gradients are generally larger.

Tanh was previously popular for hidden layers but has largely been replaced by ReLU and its variants. It is still sometimes used, particularly in recurrent neural networks (RNNs), though modern RNN architectures often use other gating mechanisms.

Visualization of ReLU, Sigmoid, and Tanh activation functions. Note their different output ranges and shapes, especially around $x=0$ .

Choosing Activation Functions

The choice of activation function depends on the layer's position and the specific task:

Hidden Layers: ReLU is generally the recommended starting point due to its efficiency and performance. Variants like Leaky ReLU (which allows a small gradient when the input is negative) or ELU (Exponential Linear Unit) can sometimes help mitigate the "dying ReLU" problem. Tanh is occasionally used, but less common now. Sigmoid is rarely used in hidden layers.
Output Layer:
- Binary Classification: Sigmoid is typically used to output a probability between 0 and 1.
- Multi-class Classification: Softmax is used to output a probability distribution across multiple classes.
- Regression: No activation function (or a linear activation, $f(x)=x$ ) is usually applied, as the output needs to be able to take any real value.

Implementing Activations in Keras

You can specify the activation function for most Keras layers using the activation argument:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Using the activation argument within a Dense layer
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(784,)), # ReLU for hidden layer
    layers.Dense(10, activation='softmax') # Softmax for output layer (multi-class)
])

# Equivalent using an Activation layer explicitly
model_explicit = keras.Sequential([
    layers.Dense(64, input_shape=(784,)),
    layers.Activation('relu'), # Apply ReLU separately
    layers.Dense(10),
    layers.Activation('softmax') # Apply Softmax separately
])

# You can also pass the function object directly
model_object = keras.Sequential([
    layers.Dense(64, activation=tf.nn.relu, input_shape=(784,)),
    layers.Dense(10, activation=tf.nn.softmax)
])

model.summary()

Keras recognizes activation functions by their string names (e.g., 'relu', 'sigmoid', 'softmax', 'tanh', 'linear'). Using the activation argument is the most common and concise way. The separate layers.Activation layer provides flexibility, especially when using the Functional API or custom architectures where you might want to apply an activation independently.

Understanding activation functions is fundamental to building effective neural networks. They are the critical components that introduce the necessary non-linearity, allowing models to learn complex patterns past simple linear relationships. As you build more sophisticated models using the Functional API or custom layers, you'll see how strategically placing these non-linear transformations enables powerful computations.

Was this section helpful?