Reward Tracing¶

`coax.reward_tracing.NStep`	A short-term cache for $n$-step bootstrapping.
`coax.reward_tracing.MonteCarlo`	A short-term cache for episodic Monte Carlo sampling.
`coax.reward_tracing.TransitionBatch`	A container object for a batch of MDP transitions.

The term reward tracing refers to the process of turning raw experience into TransitionBatch objects. These TransitionBatch objects are then used to learn, i.e. to update our function approximators.

Reward tracing typically entails keeping some episodic cache in order to relate a state $S_t$ or state-action pair $(S_t, A_t)$ to a collection of objects that can be used to construct a target (feedback signal):

\[\left(R^{(n)}_t, I^{(n)}_t, S_{t+n}, A_{t+n}\right)\]

where

\[\begin{split}R^{(n)}_t\ &=\ \sum_{k=0}^{n-1}\gamma^kR_{t+k} \\ I^{(n)}_t\ &=\ \left\{\begin{matrix} 0 & \text{if $S_{t+n}$ is a terminal state} \\ \gamma^n & \text{otherwise} \end{matrix}\right.\end{split}\]

For example, in $n$-step SARSA target is constructed as:

\[G^{(n)}_t\ =\ R^{(n)}_t + I^{(n)}_t\,q(S_{t+n}, A_{t+n})\]

Object Reference¶

class coax.reward_tracing.NStep(n, gamma, record_extra_info=False)[source]¶

A short-term cache for $n$-step bootstrapping.

Parameters:

n (positive int) – The number of steps over which to bootstrap.
gamma (float between 0 and 1) – The amount by which to discount future rewards.
record_extra_info (bool, optional) – Store all states, actions and rewards in the extra_info field of the TransitionBatch, e.g. for coax.regularizers.NStepEntropyRegularizer.

add(s, a, r, done, logp=0.0, w=1.0)[source]¶

Add a transition to the experience cache.

Parameters:

s (state observation) – A single state observation.
a (action) – A single action.
r (float) – A single observed reward.
done (bool) – Whether the episode has finished.
logp (float, optional) – The log-propensity $\log\pi(a|s)$.
w (float, optional) – Sample weight associated with the given state-action pair.

flush()¶

Flush all transitions from the cache.

Returns:: transitions (TransitionBatch) – A TransitionBatch object.

pop()[source]¶

Pop a single transition from the cache.

Returns:: transition (TransitionBatch) – A TransitionBatch object with batch_size=1.

reset()[source]¶: Reset the cache to the initial state.

class coax.reward_tracing.MonteCarlo(gamma)[source]¶

A short-term cache for episodic Monte Carlo sampling.

Parameters:: gamma (float between 0 and 1) – The amount by which to discount future rewards.

add(s, a, r, done, logp=0.0, w=1.0)[source]¶

Add a transition to the experience cache.

Parameters:

s (state observation) – A single state observation.
a (action) – A single action.
r (float) – A single observed reward.
done (bool) – Whether the episode has finished.
logp (float, optional) – The log-propensity $\log\pi(a|s)$.
w (float, optional) – Sample weight associated with the given state-action pair.

flush()¶

Flush all transitions from the cache.

Returns:: transitions (TransitionBatch) – A TransitionBatch object.

pop()[source]¶

Pop a single transition from the cache.

Returns:: transition (TransitionBatch) – A TransitionBatch object with batch_size=1.

reset()[source]¶: Reset the cache to the initial state.

class coax.reward_tracing.TransitionBatch(S, A, logP, Rn, In, S_next, A_next=None, logP_next=None, W=None, idx=None, extra_info=None)[source]¶

A container object for a batch of MDP transitions.

Parameters:

S (pytree with ndarray leaves) – A batch of state observations $S_t$.
A (ndarray) – A batch of actions $A_t$.
logP (ndarray) – A batch of log-propensities $\log\pi(A_t|S_t)$.
Rn (ndarray) –
A batch of partial ($\gamma$-discounted) returns. For instance, in $n$-step bootstrapping these are given by:

\[\begin{split}R^{(n)}_t\ &=\ \sum_{k=0}^{n-1}\gamma^kR_{t+k} \\\end{split}\]

In other words, it’s the part of the $n$-step return without the bootstrapping term.
In (ndarray) –
A batch of bootstrap factors. For instance, in $n$-step bootstrapping these are given by $I^{(n)}_t=\gamma^n$ when bootstrapping and $I^{(n)}_t=0$ otherwise. Bootstrap factors are used in constructing the $n$-step bootstrapped target:

\[G^{(n)}_t\ =\ R^{(n)}_t + I^{(n)}_t\,Q(S_{t+1}, A_{t+1})\]
S_next (pytree with ndarray leaves) – A batch of next-state observations $S_{t+n}$. This is typically used to contruct the TD target in $n$-step bootstrapping.
A_next (ndarray, optional) – A batch of next-actions $A_{t+n}$. This is typically used to contruct the TD target in $n$-step bootstrapping when using SARSA updates.
logP_next (ndarray, optional) – A batch of log-propensities $\log\pi(A_{t+n}|S_{t+n})$.
W (ndarray, optional) – A batch of importance weights associated with the sampling procedure that generated each transition. For example, we need these values when we sample transitions from a PrioritizedReplayBuffer.

copy(deep=False)¶

Create a copy of the current instance.

Parameters:: deep (bool, optional) – Whether the copy should be a deep copy.
Returns:: copy – A deep copy of the current instance.

classmethod from_single(s, a, logp, r, done, gamma, s_next=None, a_next=None, logp_next=None, w=1, idx=None, extra_info=None)[source]¶

Create a TransitionBatch (with batch_size=1) from a single transition.

Variables:

s (state observation) – A single state observation $S_t$.
a (action) – A single action $A_t$.
logp (non-positive float) – The log-propensity $\log\pi(A_t|S_t)$.
r (float or array of floats) – A single reward $R_t$.
done (bool) – Whether the episode has finished.
info (dict or None) – Some additional info about the current time step.
s_next (state observation) – A single next-state observation $S_{t+1}$.
a_next (action) – A single next-action $A_{t+1}$.
logp_next (non-positive float) – The log-propensity $\log\pi(A_{t+1}|S_{t+1})$.
w (positive float, optional) – The importance weight associated with the sampling procedure that generated this transition.
idx (int, optional) – The identifier of this particular transition.

to_singles()[source]¶

Get an iterator of single transitions.

Returns:

transition_batches (iterator of TransitionBatch) – An iterator of TransitionBatch objects with batch_size=1.

Note: The iterator walks through the individual transitions in reverse order.