部分可观测时空混沌系统的无模型预测

论文标题

部分可观测时空混沌系统的无模型预测

Socially Fair Reinforcement Learning

论文作者

Mandal, Debmalya, Gan, Jiarui

论文摘要

我们考虑了有多个具有不同奖励功能的利益相关者的情节加强学习问题。我们的目标是在不同的奖励功能方面输出一项在社会上公平的政策。先前的工作提出了不同的目标，即公平政策必须优化，包括最低福利和广义基尼福利。我们首先对问题进行公理视图，并提出四个公理，任何这样的公平目标都必须满足。我们表明，纳什社会福利是一个独特的目标，它独特地满足了所有四个目标，而先前的目标无法满足所有四个公理。然后，我们考虑了基础模型的问题的学习版本，即马尔可夫决策过程未知。我们考虑到最大程度地降低对公平政策最大化的遗憾的问题，从而最大化三个不同的公平目标 - 最低限度福利，广义的基尼福利和纳什社会福利。基于乐观的计划，我们提出了一种通用的学习算法，并在三种不同的政策方面获得了遗憾。为了纳什社会福利的目的，我们还遗憾地得出了一个遗憾的遗憾，它随$ n $（代理商的数量）呈指数增长。最后，我们表明，为了最低限度福利的目的，对于较弱的遗憾概念，人们可以将遗憾提高到$ o（h）$。

We consider the problem of episodic reinforcement learning where there are multiple stakeholders with different reward functions. Our goal is to output a policy that is socially fair with respect to different reward functions. Prior works have proposed different objectives that a fair policy must optimize including minimum welfare, and generalized Gini welfare. We first take an axiomatic view of the problem, and propose four axioms that any such fair objective must satisfy. We show that the Nash social welfare is the unique objective that uniquely satisfies all four objectives, whereas prior objectives fail to satisfy all four axioms. We then consider the learning version of the problem where the underlying model i.e. Markov decision process is unknown. We consider the problem of minimizing regret with respect to the fair policies maximizing three different fair objectives -- minimum welfare, generalized Gini welfare, and Nash social welfare. Based on optimistic planning, we propose a generic learning algorithm and derive its regret bound with respect to the three different policies. For the objective of Nash social welfare, we also derive a lower bound in regret that grows exponentially with $n$, the number of agents. Finally, we show that for the objective of minimum welfare, one can improve regret by a factor of $O(H)$ for a weaker notion of regret.

下载PDF全文

下载文献需遵守相关版权规定

论文标题