You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As show in the nnagent.py, the author use average return of a batch as the loss function. However, it seems that such loss function only contains instantaneous reward, not average cumulated reward. To be specific, supposing we have a batch of experience as follows
However, it seems that such loss function only contains instantaneous reward, not average cumulated reward.
If there is no commission fee, when the action won't affect the state transition, optimizing the immediate rewards is equivalent to optimizing the long-term value.
And this point, together with the differentiable reward function, gives superior sample efficiency compared with common purpose RL.
To deal with the commission fee, we treat it as a regularization term.
As show in the
nnagent.py
, the author use average return of a batch as the loss function. However, it seems that such loss function only contains instantaneous reward, not average cumulated reward. To be specific, supposing we have a batch of experience as followsmini_batch =$(s_t, a_t, r_t, ..., s_(t+T), a_(t+T), r_(t+T))$
The text was updated successfully, but these errors were encountered: