安排的价值映射用于增强学习

论文标题

安排的价值映射用于增强学习

Orchestrated Value Mapping for Reinforcement Learning

论文作者

Fatemi, Mehdi, Tavakoli, Arash

论文摘要

我们提出了一种基于两个不同原则的一般融合研究算法，该算法是：（1）使用广泛类别的任意函数将值估计映射到不同的空间，以及（2）将奖励信号线性分解为多个通道。第一个原理使将特定属性纳入可以增强学习的价值估计器中。另一方面，第二个原理允许将值函数表示为多个效用函数的组成。可以将其用于各种目的，例如处理高度变化的奖励量表，结合有关奖励来源和整体学习的先验知识。通过在多个奖励通道上策划多样化的映射函数来实现收敛算法，从而产生了一般的蓝图，以实现收敛算法。该蓝图概括和集合算法，例如Q-学习，日志Q-学习和Q分解。此外，我们对该一般类别的融合证明在这些算法中的某些算法中放松了某些必需的假设。基于我们的理论，我们讨论了一些有趣的配置作为特殊情况。最后，为了说明我们理论打开的设计空间的潜力，我们实例化了一种特定的算法并评估其在Atari Suite上的性能。

We present a general convergent class of reinforcement learning algorithms that is founded on two distinct principles: (1) mapping value estimates to a different space using arbitrary functions from a broad class, and (2) linearly decomposing the reward signal into multiple channels. The first principle enables incorporating specific properties into the value estimator that can enhance learning. The second principle, on the other hand, allows for the value function to be represented as a composition of multiple utility functions. This can be leveraged for various purposes, e.g. dealing with highly varying reward scales, incorporating a priori knowledge about the sources of reward, and ensemble learning. Combining the two principles yields a general blueprint for instantiating convergent algorithms by orchestrating diverse mapping functions over multiple reward channels. This blueprint generalizes and subsumes algorithms such as Q-Learning, Log Q-Learning, and Q-Decomposition. In addition, our convergence proof for this general class relaxes certain required assumptions in some of these algorithms. Based on our theory, we discuss several interesting configurations as special cases. Finally, to illustrate the potential of the design space that our theory opens up, we instantiate a particular algorithm and evaluate its performance on the Atari suite.

下载PDF全文

下载文献需遵守相关版权规定

论文标题