Reward Tracing¶
A short-term cache for \(n\)-step bootstrapping. |
|
A short-term cache for episodic Monte Carlo sampling. |
|
A container object for a batch of MDP transitions. |
The term reward tracing refers to the process of turning raw experience into
TransitionBatch
objects. These
TransitionBatch
objects are then used to learn, i.e.
to update our function approximators.
Reward tracing typically entails keeping some episodic cache in order to relate a state \(S_t\) or state-action pair \((S_t, A_t)\) to a collection of objects that can be used to construct a target (feedback signal):
where
For example, in \(n\)-step SARSA target is constructed as:
Object Reference¶
- class coax.reward_tracing.NStep(n, gamma, record_extra_info=False)[source]¶
A short-term cache for \(n\)-step bootstrapping.
- Parameters:
n (positive int) – The number of steps over which to bootstrap.
gamma (float between 0 and 1) – The amount by which to discount future rewards.
record_extra_info (bool, optional) – Store all states, actions and rewards in the extra_info field of the TransitionBatch, e.g. for
coax.regularizers.NStepEntropyRegularizer
.
- add(s, a, r, done, logp=0.0, w=1.0)[source]¶
Add a transition to the experience cache.
- Parameters:
s (state observation) – A single state observation.
a (action) – A single action.
r (float) – A single observed reward.
done (bool) – Whether the episode has finished.
logp (float, optional) – The log-propensity \(\log\pi(a|s)\).
w (float, optional) – Sample weight associated with the given state-action pair.
- flush()¶
Flush all transitions from the cache.
- Returns:
transitions (TransitionBatch) – A
TransitionBatch
object.
- pop()[source]¶
Pop a single transition from the cache.
- Returns:
transition (TransitionBatch) – A
TransitionBatch
object withbatch_size=1
.
- class coax.reward_tracing.MonteCarlo(gamma)[source]¶
A short-term cache for episodic Monte Carlo sampling.
- Parameters:
gamma (float between 0 and 1) – The amount by which to discount future rewards.
- add(s, a, r, done, logp=0.0, w=1.0)[source]¶
Add a transition to the experience cache.
- Parameters:
s (state observation) – A single state observation.
a (action) – A single action.
r (float) – A single observed reward.
done (bool) – Whether the episode has finished.
logp (float, optional) – The log-propensity \(\log\pi(a|s)\).
w (float, optional) – Sample weight associated with the given state-action pair.
- flush()¶
Flush all transitions from the cache.
- Returns:
transitions (TransitionBatch) – A
TransitionBatch
object.
- pop()[source]¶
Pop a single transition from the cache.
- Returns:
transition (TransitionBatch) – A
TransitionBatch
object withbatch_size=1
.
- class coax.reward_tracing.TransitionBatch(S, A, logP, Rn, In, S_next, A_next=None, logP_next=None, W=None, idx=None, extra_info=None)[source]¶
A container object for a batch of MDP transitions.
- Parameters:
S (pytree with ndarray leaves) – A batch of state observations \(S_t\).
A (ndarray) – A batch of actions \(A_t\).
logP (ndarray) – A batch of log-propensities \(\log\pi(A_t|S_t)\).
Rn (ndarray) –
A batch of partial (\(\gamma\)-discounted) returns. For instance, in \(n\)-step bootstrapping these are given by:
\[\begin{split}R^{(n)}_t\ &=\ \sum_{k=0}^{n-1}\gamma^kR_{t+k} \\\end{split}\]In other words, it’s the part of the \(n\)-step return without the bootstrapping term.
In (ndarray) –
A batch of bootstrap factors. For instance, in \(n\)-step bootstrapping these are given by \(I^{(n)}_t=\gamma^n\) when bootstrapping and \(I^{(n)}_t=0\) otherwise. Bootstrap factors are used in constructing the \(n\)-step bootstrapped target:
\[G^{(n)}_t\ =\ R^{(n)}_t + I^{(n)}_t\,Q(S_{t+1}, A_{t+1})\]S_next (pytree with ndarray leaves) – A batch of next-state observations \(S_{t+n}\). This is typically used to contruct the TD target in \(n\)-step bootstrapping.
A_next (ndarray, optional) – A batch of next-actions \(A_{t+n}\). This is typically used to contruct the TD target in \(n\)-step bootstrapping when using SARSA updates.
logP_next (ndarray, optional) – A batch of log-propensities \(\log\pi(A_{t+n}|S_{t+n})\).
W (ndarray, optional) – A batch of importance weights associated with the sampling procedure that generated each transition. For example, we need these values when we sample transitions from a
PrioritizedReplayBuffer
.
- copy(deep=False)¶
Create a copy of the current instance.
- Parameters:
deep (bool, optional) – Whether the copy should be a deep copy.
- Returns:
copy – A deep copy of the current instance.
- classmethod from_single(s, a, logp, r, done, gamma, s_next=None, a_next=None, logp_next=None, w=1, idx=None, extra_info=None)[source]¶
Create a TransitionBatch (with batch_size=1) from a single transition.
- Variables:
s (state observation) – A single state observation \(S_t\).
a (action) – A single action \(A_t\).
logp (non-positive float) – The log-propensity \(\log\pi(A_t|S_t)\).
r (float or array of floats) – A single reward \(R_t\).
done (bool) – Whether the episode has finished.
info (dict or None) – Some additional info about the current time step.
s_next (state observation) – A single next-state observation \(S_{t+1}\).
a_next (action) – A single next-action \(A_{t+1}\).
logp_next (non-positive float) – The log-propensity \(\log\pi(A_{t+1}|S_{t+1})\).
w (positive float, optional) – The importance weight associated with the sampling procedure that generated this transition.
idx (int, optional) – The identifier of this particular transition.
- to_singles()[source]¶
Get an iterator of single transitions.
- Returns:
transition_batches (iterator of TransitionBatch) – An iterator of
TransitionBatch
objects withbatch_size=1
.Note: The iterator walks through the individual transitions in reverse order.