论文标题

PD-MORL:偏好驱动的多目标增强学习算法

PD-MORL: Preference-Driven Multi-Objective Reinforcement Learning Algorithm

论文作者

Basaklar, Toygun, Gumussoy, Suat, Ogras, Umit Y.

论文摘要

通过最大化偏好矢量加权的联合目标函数,已经出现了多目标增强学习(MORL)方法,以解决许多现实世界中的许多现实问题。这些方法发现固定的定制策略对应于训练过程中指定的偏好向量。但是,在现实生活中,设计约束和目标通常会动态变化。此外,存储每个潜在偏好的策略是不可扩展的。因此,通过单个训练在给定域中的整个偏好空间中获得一组Pareto前解决方案至关重要。为此,我们提出了一种新型的Morl算法,该算法训练一个通用网络,以涵盖可扩展到连续机器人任务的整个偏好空间。提出的方法是偏好驱动的MORL(PD-MORL),利用偏好作为更新网络参数的指导。它还采用一种新型的并行方法来提高样品效率。我们表明,PD-MORL可实现高达25%的高量量,以挑战连续控制任务,并且与先前的方法相比,使用可训练的参数的数量级较少。

Multi-objective reinforcement learning (MORL) approaches have emerged to tackle many real-world problems with multiple conflicting objectives by maximizing a joint objective function weighted by a preference vector. These approaches find fixed customized policies corresponding to preference vectors specified during training. However, the design constraints and objectives typically change dynamically in real-life scenarios. Furthermore, storing a policy for each potential preference is not scalable. Hence, obtaining a set of Pareto front solutions for the entire preference space in a given domain with a single training is critical. To this end, we propose a novel MORL algorithm that trains a single universal network to cover the entire preference space scalable to continuous robotic tasks. The proposed approach, Preference-Driven MORL (PD-MORL), utilizes the preferences as guidance to update the network parameters. It also employs a novel parallelization approach to increase sample efficiency. We show that PD-MORL achieves up to 25% larger hypervolume for challenging continuous control tasks and uses an order of magnitude fewer trainable parameters compared to prior approaches.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源