Now that we understand the core mechanisms of DQN, including experience replay and target networks, let's consider how to actually build the neural network that approximates the action-value function, Q(s,a;θ). The design of this network is highly dependent on the nature of the state representation provided by the environment.
The first step is determining the input layer's structure. What does the agent "see"?
Vector-Based States: If the state is represented as a vector of numerical features (e.g., positions, velocities, sensor readings), a standard Multilayer Perceptron (MLP), also known as a fully connected network, is often sufficient. The input layer will have a number of neurons equal to the number of features in the state vector.
Image-Based States: For environments where the state is represented by images (like pixels from a screen), Convolutional Neural Networks (CNNs) are the standard choice. CNNs are adept at recognizing spatial hierarchies and patterns in grid-like data. The input layer will typically accept the dimensions of the image (height, width, color channels). Often, multiple consecutive frames are stacked together as input to provide the network with temporal information, like velocity or direction of movement.
Following the input layer, one or more hidden layers process the information.
MLPs: Hidden layers in an MLP are typically fully connected layers. Each neuron in a layer receives input from all neurons in the previous layer. The number of layers and the number of neurons in each layer are hyperparameters you need to choose. Start simple (e.g., one or two hidden layers with a moderate number of neurons) and increase complexity if needed. The Rectified Linear Unit (ReLU) activation function (f(x)=max(0,x)) is a common and effective choice for hidden layers in DQNs.
CNNs: For image inputs, the initial hidden layers are typically convolutional layers followed by pooling layers. Convolutional layers apply filters to detect local patterns (edges, textures), while pooling layers reduce the spatial dimensions, making the representation more manageable and invariant to small translations. After several convolutional and pooling layers, the resulting feature maps are usually flattened into a vector and fed into one or more fully connected layers, similar to an MLP. Again, ReLU is a standard activation function for these layers. The famous DQN paper that mastered Atari games used a CNN architecture consisting of several convolutional layers followed by fully connected layers.
The final layer of the network is crucial: it must output the estimated Q-values.
Here are conceptual diagrams illustrating common structures:
1. MLP for Vector States:
An MLP takes a state vector s=(s1,...,sn) and passes it through fully connected (FC) hidden layers (often with ReLU activations) to produce Q-value estimates for each action a1,...,am.
2. CNN for Image States (Simplified):
A CNN processes input image(s) through convolutional and pooling layers to extract spatial features. These features are then flattened and passed through fully connected layers to output Q-values for each action.
Choosing the right architecture is a significant part of applying DQN successfully. By considering the nature of your state space and leveraging common network design patterns, you can build effective function approximators for your reinforcement learning agents.
© 2025 ApX Machine Learning