In Reinforcement Learning, the learning process revolves around the interaction between two primary components: the agent and the environment. Think of the agent as the learner or decision-maker, and the environment as the system it interacts with, encompassing everything outside the agent.
The agent is the entity we are training. Its goal is typically to maximize some notion of cumulative reward over time. It perceives the environment's current situation, referred to as its state, and based on this state, it selects an action to perform.
Examples of agents include:
The agent's internal mechanism for choosing actions based on states is called its policy, which we will discuss in more detail later. For now, understand that the agent is the core learning component.
The environment represents everything the agent interacts with. It receives the agent's chosen action and responds by transitioning to a new state and providing a numerical reward signal. The environment defines the "rules of the game" or the physics of the world the agent operates within.
Following the examples above, the corresponding environments would be:
The environment is responsible for:
The core of RL is the continuous loop of interaction between the agent and the environment. At each discrete time step, denoted by t, the following sequence occurs:
This loop repeats, allowing the agent to learn through trial and error, associating actions in particular states with the rewards they tend to produce.
The fundamental interaction loop in Reinforcement Learning. The agent takes actions, and the environment responds with new states and rewards.
It's important to establish a clear boundary between the agent and the environment. The boundary is typically drawn at the edge of what the agent can directly control. For example, in robotics, the agent might control the voltages sent to the robot's motors, but the physics of how those voltages cause movement, friction, and sensor readings are part of the environment. The agent cannot change the laws of physics; it can only choose actions within the constraints imposed by the environment. Similarly, the reward computation mechanism is considered part of the environment, not the agent. The agent's objective is to maximize rewards generated by the environment.
Understanding this separation is fundamental. The agent learns a policy to interact optimally with the dynamics and reward structure defined by the environment. In the subsequent sections, we will formalize these concepts further, starting with states, actions, and rewards.
© 2025 ApX Machine Learning