连续选择的动态决策频率

论文标题

连续选择的动态决策频率

Dynamic Decision Frequency with Continuous Options

论文作者

Karimi, Amirmohammad, Jin, Jun, Luo, Jun, Mahmood, A. Rupam, Jagersand, Martin, Tosatto, Samuele

论文摘要

在经典的增强学习算法中，代理以离散和固定的时间间隔做出决策。决策之间的持续时间变成了一个关键的超参数，因为设置太短可能会通过要求代理做出许多决策以实现其目标的同时设定太长时间，从而增加问题的困难，从而导致代理商失去对系统的控制。但是，物理系统不一定需要恒定的控制频率，对于学习代理，通常可以在可能的情况下以低频操作，并且必要时具有高频。我们提出了一个称为连续时间连续选项（CTCO）的框架，其中代理将选项选择作为可变持续时间的子核心。这些选项是时间连续的，可以在任何所需的频率下与系统进行交互，从而提供平稳的动作变化。我们通过将其性能与具有各种动作周期时间的模拟连续控制任务进行比较，证明了CTCO的性能与经典的RL和时间 - 临时RL方法的有效性。我们表明，我们的算法的性能不受环境相互作用频率的选择影响。此外，我们证明了CTCO在现实世界中的视觉到达任务中促进探索的功效，该任务是具有稀疏奖励的7 DOF机器人手臂。

In classic reinforcement learning algorithms, agents make decisions at discrete and fixed time intervals. The duration between decisions becomes a crucial hyperparameter, as setting it too short may increase the problem's difficulty by requiring the agent to make numerous decisions to achieve its goal while setting it too long can result in the agent losing control over the system. However, physical systems do not necessarily require a constant control frequency, and for learning agents, it is often preferable to operate with a low frequency when possible and a high frequency when necessary. We propose a framework called Continuous-Time Continuous-Options (CTCO), where the agent chooses options as sub-policies of variable durations. These options are time-continuous and can interact with the system at any desired frequency providing a smooth change of actions. We demonstrate the effectiveness of CTCO by comparing its performance to classical RL and temporal-abstraction RL methods on simulated continuous control tasks with various action-cycle times. We show that our algorithm's performance is not affected by the choice of environment interaction frequency. Furthermore, we demonstrate the efficacy of CTCO in facilitating exploration in a real-world visual reaching task for a 7 DOF robotic arm with sparse rewards.

下载PDF全文

下载文献需遵守相关版权规定

论文标题