Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution

Research output: Chapter in Book/Report/Conference proceedingConference proceedingspeer-review

Abstract

Reinforcement Learning algorithms require a large number of samples to solvecomplex tasks with sparse and delayed rewards. Complex tasks can often be hierar-chically decomposed into sub-tasks. A step in theQ-function can be associatedwith solving a sub-task, where the expectation of the return increases. RUDDERhas been introduced to identify these steps and then redistribute reward to them,thus immediately giving reward if sub-tasks are solved. Since the problem ofdelayed rewards is mitigated, learning is considerably sped up. However, forcomplex tasks, current exploration strategies as deployed in RUDDER strugglewith discovering episodes with high rewards. Therefore, we assume that episodeswith high rewards are given as demonstrations and do not have to be discoveredby exploration. Typically the number of demonstrations is small and RUDDER’sLSTM model as a deep learning method does not learn well. Hence, we introduceAlign-RUDDER, which is RUDDER with two major modifications. First, Align-RUDDER assumes that episodes with high rewards are given as demonstrations,replacing RUDDER’s safe exploration and lessons replay buffer. Second, we re-place RUDDER’s LSTM model by a profile model that is obtained from multiplesequence alignment of demonstrations. Profile models can be constructed fromas few as two demonstrations as known from bioinformatics. Align-RUDDERinherits the concept of reward redistribution, which considerably reduces the delayof rewards, thus speeding up learning. Align-RUDDER outperforms competitorson complex artificial tasks with delayed reward and few demonstrations. On theMineCraftObtainDiamondtask, Align-RUDDER is able to mine a diamond,though not frequently.
Original languageEnglish
Title of host publicationProceedings of the 39th International Conference on Machine Learning
Number of pages42
Publication statusPublished - 2022

Fields of science

  • 305907 Medical statistics
  • 202017 Embedded systems
  • 202036 Sensor systems
  • 101004 Biomathematics
  • 101014 Numerical mathematics
  • 101015 Operations research
  • 101016 Optimisation
  • 101017 Game theory
  • 101018 Statistics
  • 101019 Stochastics
  • 101024 Probability theory
  • 101026 Time series analysis
  • 101027 Dynamical systems
  • 101028 Mathematical modelling
  • 101029 Mathematical statistics
  • 101031 Approximation theory
  • 102 Computer Sciences
  • 102001 Artificial intelligence
  • 102003 Image processing
  • 102004 Bioinformatics
  • 102013 Human-computer interaction
  • 102018 Artificial neural networks
  • 102019 Machine learning
  • 102032 Computational intelligence
  • 102033 Data mining
  • 305901 Computer-aided diagnosis and therapy
  • 305905 Medical informatics
  • 202035 Robotics
  • 202037 Signal processing
  • 103029 Statistical physics
  • 106005 Bioinformatics
  • 106007 Biostatistics

JKU Focus areas

  • Digital Transformation
  • JKU LIT SAL eSPML Lab

    Baumgartner, S. (Researcher), Bognar, G. (Researcher), Hochreiter, S. (Researcher), Hofmarcher, M. (Researcher), Kovacs, P. (Researcher), Schmid, S. (Researcher), Shtainer, A. (Researcher), Springer, A. (Researcher), Wille, R. (Researcher) & Huemer, M. (PI)

    01.07.202031.12.2023

    Project: OtherOther project

Cite this