论文标题
通过轨迹空间平滑学习指导奖励
Learning Guidance Rewards with Trajectory-space Smoothing
论文作者
论文摘要
长期的时间信用分配是深入强化学习(RL)的重要挑战。它是指代理将动作归因于长时间间隔后可能发生的后果的能力。现有的政策颁布和Q学习算法通常依赖于提供丰富的短期监督并帮助信贷分配的密集环境奖励。但是,他们努力通过行动和相应的奖励反馈之间的延迟来解决任务。为了使信贷分配更轻松,最近的作品提出了算法,以学习可以代替稀疏或延迟的环境奖励来使用的密集的“指导”奖励。本文同样是同样的 - 从涉及在轨迹空间中平滑的替代RL目标开始,我们到达了一种新的学习指导奖励算法。我们表明,指导奖励具有直观的解释,并且可以在不训练任何其他神经网络的情况下获得。由于易于集成,我们将指导奖励用于几种流行的算法(Q-学习,参与者 - 批判性,分销-RL),并在单一和多代理任务中进行结果,当环境奖励稀疏或延迟时,阐明了我们方法的益处。
Long-term temporal credit assignment is an important challenge in deep reinforcement learning (RL). It refers to the ability of the agent to attribute actions to consequences that may occur after a long time interval. Existing policy-gradient and Q-learning algorithms typically rely on dense environmental rewards that provide rich short-term supervision and help with credit assignment. However, they struggle to solve tasks with delays between an action and the corresponding rewarding feedback. To make credit assignment easier, recent works have proposed algorithms to learn dense "guidance" rewards that could be used in place of the sparse or delayed environmental rewards. This paper is in the same vein -- starting with a surrogate RL objective that involves smoothing in the trajectory-space, we arrive at a new algorithm for learning guidance rewards. We show that the guidance rewards have an intuitive interpretation, and can be obtained without training any additional neural networks. Due to the ease of integration, we use the guidance rewards in a few popular algorithms (Q-learning, Actor-Critic, Distributional-RL) and present results in single-agent and multi-agent tasks that elucidate the benefit of our approach when the environmental rewards are sparse or delayed.