Reinforcement Learning

Rollout

Definition In Reinforcement Learning, a Rollout refers to the process where an Agent interacts with an Environment for a sequence of time steps by following a specific Policy ($\pi$). It is the act of "running" the agent to see what happens, effectively generating a sample of experience.

1. The Rollout Process

A rollout starts from a specific state $s_t$. The agent performs a series of actions until a fixed time horizon is reached or a terminal state (e.g., "Game Over") occurs. This sequence of events is called a Trajectory ($\tau$): $$\tau = (s_0, a_0, r_1, s_1, a_1, r_2, \dots, s_T)$$

2. Why Rollouts Matter

Data Collection (On-Policy): Algorithms like PPO (Proximal Policy Optimization) perform rollouts to collect fresh data. The agent plays a few rounds, uses that data to update its brain (policy), and then discards the old data to perform new rollouts.
Evaluation & Estimation: By performing multiple rollouts from the same state, we can calculate the Empirical Return (the sum of rewards). This average helps estimate the Value Function $V(s)$—essentially answering, "How good is this position in the long run?"
Lookahead Search: In Monte Carlo Tree Search (MCTS)—famously used in AlphaGo—rollouts (also called "simulations") are used to play out a game to the end very quickly to see which move leads to a win.

3. Terminology

Term	Description
Step	A single interaction (Action $\rightarrow$ Reward $\rightarrow$ Next State).
Episode	A complete rollout from start to finish.
Trajectory	The specific data sequence recorded during a rollout.
Horizon ($H$)	The maximum number of steps allowed in a single rollout.
### 4. Analogy: The "Practice Round"

Think of a Rollout as a practice round in a video game. 1. You start the level (Initial State). 2. You play based on your current skill (Policy). 3. You see how many points you get and where you end up (Rewards/Transitions). 4. You use that experience to play better next time (Optimization).

TIS

Trajectory Importance Sampling (TIS) is a technique used to estimate the performance of a target policy ($\pi$) using data (trajectories) collected by a different behavior policy ($\beta$).

It is the foundation of Off-Policy Learning, allowing an agent to learn from its own past experiences or from demonstrations by others.

1. The Core Problem

In RL, we want to calculate the expected return $J(\pi)$ of a policy. Usually, this requires "rolling out" the policy $\pi$ many times. However, if we only have data from policy $\beta$, the rewards we see are biased because $\beta$ visits states and chooses actions differently than $\pi$ would.

2. The TIS Mechanism

To correct this bias, we weight each trajectory $\tau$ by the Importance Sampling Ratio ($\rho$). This ratio represents how much more (or less) likely a specific trajectory is under the target policy compared to the behavior policy.

For a trajectory $\tau = (s_0, a_0, \dots, s_T, a_T)$, the weight is calculated as:

$$\rho_{0:T} = \prod_{t=0}^{T} \frac{\pi(a_t | s_t)}{\beta(a_t | s_t)}$$

If $\rho > 1$: The trajectory is more likely under $\pi$; we give it more weight.
If $\rho < 1$: The trajectory is less likely under $\pi$; we give it less weight.
If $\rho = 0$: The trajectory is impossible under $\pi$; it is ignored.

3. Why is TIS Important?

Sample Efficiency: It allows the agent to reuse "old" data stored in a Replay Buffer instead of throwing it away after every policy update.
Safety: You can evaluate a potentially dangerous new policy ($\pi$) using data safely collected by a known, stable policy ($\beta$).
Learning from Others: It enables an agent to learn from human demonstrations or other agents.

4. The Challenges of TIS

While mathematically sound, TIS has a major practical drawback: High Variance.

Problem	Description
Vanishing/Exploding Gradients	Because the ratio is a product of many fractions, if the trajectory is long, the weight can quickly become near-zero or extremely large.
Policy Divergence	If the target policy $\pi$ becomes too different from the behavior policy $\beta$, the weights become unstable, leading to noisy training.
Coverage Requirement	$\beta$ must have a non-zero probability of taking any action that $\pi$ might take.

5. Common Solutions

To fix the variance issues in TIS, researchers use:

Step-wise Importance Sampling: Weighting individual rewards at each step instead of the whole trajectory.
Clipping: Limiting the maximum value of the ratio (used in PPO).
Normalized Importance Sampling: Ensuring the weights sum to 1 to reduce extreme fluctuations.

R2, R3, & GSPO

In modern Reinforcement Learning—especially in Multi-Agent RL (MARL) and Large Language Model (LLM) alignment—standard PPO is often augmented by these iterative frameworks

1. R2: Regret-based Learning

Concept: Focuses on Regret Minimization rather than simple Reward Maximization.
Mechanism: The agent evaluates the difference between the reward obtained from the chosen action and the reward that could have been obtained from the optimal action in hindsight.
Use Case: Essential for solving Imperfect Information Games (e.g., Poker) where achieving a Nash Equilibrium is the goal.

2. R3: Relative Reward Regret

Concept: An evolution of R2 that introduces Relative Comparison.
Mechanism: Instead of looking at absolute scores, R3 measures performance relative to a baseline or a pool of previous agent versions.
Advantage: It filters out environmental noise and prevents "strategy cycling," where an agent forgets how to beat an old version while learning to beat a new one.

3. GSPO: Guided Self-Play Optimization

Concept: A hybrid framework combining Self-Play (agent vs. agent) with External Guidance (Human feedback or Teacher models).
The "Self-Play" Part: The agent generates data by playing against itself or its previous iterations to explore complex strategies.
The "Guided" Part: A reward model (or a set of constraints) acts as a guide to ensure the self-play doesn't drift into nonsensical or "cheating" behaviors.
Application: Heavily used in LLM Reasoning (RLHF/RLAIF) to help models self-correct their logic while staying aligned with human truthfulness.

Summary Comparison Table

Concept	Primary Focus	Best For...
R2	Minimizing "could-have-beens"	Game theory and Poker
R3	Relative performance vs. baseline	Stable MARL training
GSPO	Self-evolution with alignment	LLM reasoning and complex strategy