PD-MORL：偏好驱动的多目标增强学习算法

论文标题

PD-MORL：偏好驱动的多目标增强学习算法

PD-MORL: Preference-Driven Multi-Objective Reinforcement Learning Algorithm

论文作者

Basaklar, Toygun, Gumussoy, Suat, Ogras, Umit Y.

论文摘要

通过最大化偏好矢量加权的联合目标函数，已经出现了多目标增强学习（MORL）方法，以解决许多现实世界中的许多现实问题。这些方法发现固定的定制策略对应于训练过程中指定的偏好向量。但是，在现实生活中，设计约束和目标通常会动态变化。此外，存储每个潜在偏好的策略是不可扩展的。因此，通过单个训练在给定域中的整个偏好空间中获得一组Pareto前解决方案至关重要。为此，我们提出了一种新型的Morl算法，该算法训练一个通用网络，以涵盖可扩展到连续机器人任务的整个偏好空间。提出的方法是偏好驱动的MORL（PD-MORL），利用偏好作为更新网络参数的指导。它还采用一种新型的并行方法来提高样品效率。我们表明，PD-MORL可实现高达25％的高量量，以挑战连续控制任务，并且与先前的方法相比，使用可训练的参数的数量级较少。

Multi-objective reinforcement learning (MORL) approaches have emerged to tackle many real-world problems with multiple conflicting objectives by maximizing a joint objective function weighted by a preference vector. These approaches find fixed customized policies corresponding to preference vectors specified during training. However, the design constraints and objectives typically change dynamically in real-life scenarios. Furthermore, storing a policy for each potential preference is not scalable. Hence, obtaining a set of Pareto front solutions for the entire preference space in a given domain with a single training is critical. To this end, we propose a novel MORL algorithm that trains a single universal network to cover the entire preference space scalable to continuous robotic tasks. The proposed approach, Preference-Driven MORL (PD-MORL), utilizes the preferences as guidance to update the network parameters. It also employs a novel parallelization approach to increase sample efficiency. We show that PD-MORL achieves up to 25% larger hypervolume for challenging continuous control tasks and uses an order of magnitude fewer trainable parameters compared to prior approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题