Backward View and Reward Redistribution for Delayed Rewards

Activity: Talk or presentationContributed talkscience-to-science

Description

Most reinforcement learning approaches rely on a forward view to predict the expected return. Examples are Monte Carlo and temporal difference methods like SARSA or Q-learning which estimate the state-value or value function, policy gradients using value or advantage functions, and Monte Carlo Tree Search. However the forward view faces problems with probabilistic environments and with high branching factors of the state transitions since it has to average over all possible futures. These problems become more severe for delayed rewards. The number of paths to the reward grows exponentially with the delay steps; the reward information must be propagated further back; averaging becomes more difficult; the variance of many values of state-action pairs is increased. We suggest avoiding these problems by a backward view where episodes that have been observed are analyzed. We avoid probabilities and guessing about possible futures, while identifying key events and important states that led to a reward. The backward view allows for a reward redistribution which largely reduces the delays of the rewards while the expected return of a policy is not changed. The optimal reward redistribution via a return decomposition gives an immediate feedback to the agent about each executed action. If the expectation of the return increases then a positive reward is given and if the expectation of the return decreases then a negative reward is given. We introduce RUDDER, a return decomposition method, which creates a new MDP with same optimal policies as the original MDP but with redistributed rewards that have largely reduced delays. If the return decomposition is optimal, then the new MDP does not have delayed rewards and TD estimates are unbiased. In this case, the rewards track Q-values so that the future expected reward is always zero. On artificial tasks with different lengths of reward delays, we show that RUDDER is exponentially faster than TD, MC, and MC Tree Search (MCTS).
Period14 Jul 2018
Event titleCredit assignment in Deep Learning and Deep Reinforcement Learning Workshop at ICML 2018
Event typeConference
LocationSwedenShow on map

Fields of science

  • 305 Other Human Medicine, Health Sciences
  • 102019 Machine learning
  • 304 Medical Biotechnology
  • 303 Health Sciences
  • 302 Clinical Medicine
  • 301 Medical-Theoretical Sciences, Pharmacy
  • 102 Computer Sciences
  • 106005 Bioinformatics
  • 106007 Biostatistics
  • 304003 Genetic engineering
  • 106041 Structural biology
  • 102010 Database systems
  • 101018 Statistics
  • 106023 Molecular biology
  • 106002 Biochemistry
  • 102001 Artificial intelligence
  • 102015 Information systems
  • 101004 Biomathematics
  • 102004 Bioinformatics

JKU Focus areas

  • Health System Research
  • Computation in Informatics and Mathematics
  • Clinical Research on Aging
  • Nano-, Bio- and Polymer-Systems: From Structure to Function
  • Medical Sciences (in general)