具有生成流模型的预期标量回报的多目标协调图

论文标题

具有生成流模型的预期标量回报的多目标协调图

Multi-Objective Coordination Graphs for the Expected Scalarised Returns with Generative Flow Models

论文作者

Hayes, Conor F., Verstraeten, Timothy, Roijers, Diederik M., Howley, Enda, Mannion, Patrick

论文摘要

许多现实世界中的问题都包含多种目标和代理，其中目标之间存在权衡。解决此类问题的关键是利用代理之间存在的稀疏依赖性结构。例如，在风电场控制中，在最大化功率和最大程度地减少系统组件的压力之间存在权衡。涡轮机之间的依赖性是由于唤醒效应而产生的。我们将这种稀疏依赖性模拟为多目标协调图（MO-COG）。在多目标增强中，学习实用程序功能通常用于对用户偏好而不是目标建模，这可能是未知的。在这种情况下，必须计算一组最佳策略。哪些策略是最佳的，取决于采用哪些最佳标准。如果用户的效用函数是从策略的多个执行中得出的，则必须优化标识的预期收益（SER）。如果用户的效用是从策略的一次执行中得出的，则必须优化预期的标量回报（ESR）标准。例如，风电场受到必须始终遵守的限制和法规，因此必须优化ESR标准。对于Mo-COG，最新的算法只能计算一组SER标准的最佳策略，而ESR标准进行了研究。要计算ESR标准下的一组最佳策略（也称为ESR集合），必须维护回报上的分布。因此，为了计算MO-COGS的ESR标准下的一组最佳策略，我们提出了一种新型的分布多目标变量消除（DMOVE）算法。我们在逼真的风电场模拟中评估了DMOVE。鉴于实际风电场设置中的回报是连续的，我们使用称为Real-NVP的模型来学习连续的返回分布来计算ESR集合。

Many real-world problems contain multiple objectives and agents, where a trade-off exists between objectives. Key to solving such problems is to exploit sparse dependency structures that exist between agents. For example, in wind farm control a trade-off exists between maximising power and minimising stress on the systems components. Dependencies between turbines arise due to the wake effect. We model such sparse dependencies between agents as a multi-objective coordination graph (MO-CoG). In multi-objective reinforcement learning a utility function is typically used to model a users preferences over objectives, which may be unknown a priori. In such settings a set of optimal policies must be computed. Which policies are optimal depends on which optimality criterion applies. If the utility function of a user is derived from multiple executions of a policy, the scalarised expected returns (SER) must be optimised. If the utility of a user is derived from a single execution of a policy, the expected scalarised returns (ESR) criterion must be optimised. For example, wind farms are subjected to constraints and regulations that must be adhered to at all times, therefore the ESR criterion must be optimised. For MO-CoGs, the state-of-the-art algorithms can only compute a set of optimal policies for the SER criterion, leaving the ESR criterion understudied. To compute a set of optimal polices under the ESR criterion, also known as the ESR set, distributions over the returns must be maintained. Therefore, to compute a set of optimal policies under the ESR criterion for MO-CoGs, we present a novel distributional multi-objective variable elimination (DMOVE) algorithm. We evaluate DMOVE in realistic wind farm simulations. Given the returns in real-world wind farm settings are continuous, we utilise a model known as real-NVP to learn the continuous return distributions to calculate the ESR set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题