在线性MDP中，差异私人探索的遗憾改善了

论文标题

在线性MDP中，差异私人探索的遗憾改善了

Improved Regret for Differentially Private Exploration in Linear MDP

论文作者

Ngo, Dung Daniel, Vietri, Giuseppe, Wu, Zhiwei Steven

论文摘要

我们研究依靠敏感数据（例如医疗记录）的环境的顺序决策中，研究隐私的探索。特别是，我们专注于解决在线性MDP设置中受（联合）差异隐私限制的强化学习问题（RL），在线函数都通过线性函数给出了动态和奖励。由于Luyo等人而引起的此问题的事先工作。（2021）实现了$ o（k^{3/5}）$的遗憾率，对$ k $的情节数量。我们提供了一种私人算法，并具有提高的遗憾率，最佳依赖性$ o（\ sqrt {k}）$对情节数量。我们强烈遗憾保证的关键配方是策略更新时间表中的适应性，其中仅在检测到数据足够更改时进行更新。结果，我们的算法受益于低切换成本，仅执行$ o（\ log（k））$更新，这大大降低了隐私噪声的量。最后，在最普遍的隐私制度中，隐私参数$ε$是一个常数，我们的算法会造成可忽略不计的隐私成本 - 与现有的非私人遗憾界限相比，由于隐私而产生的额外遗憾以低阶术语出现。

We study privacy-preserving exploration in sequential decision-making for environments that rely on sensitive data such as medical records. In particular, we focus on solving the problem of reinforcement learning (RL) subject to the constraint of (joint) differential privacy in the linear MDP setting, where both dynamics and rewards are given by linear functions. Prior work on this problem due to Luyo et al. (2021) achieves a regret rate that has a dependence of $O(K^{3/5})$ on the number of episodes $K$. We provide a private algorithm with an improved regret rate with an optimal dependence of $O(\sqrt{K})$ on the number of episodes. The key recipe for our stronger regret guarantee is the adaptivity in the policy update schedule, in which an update only occurs when sufficient changes in the data are detected. As a result, our algorithm benefits from low switching cost and only performs $O(\log(K))$ updates, which greatly reduces the amount of privacy noise. Finally, in the most prevalent privacy regimes where the privacy parameter $ε$ is a constant, our algorithm incurs negligible privacy cost -- in comparison with the existing non-private regret bounds, the additional regret due to privacy appears in lower-order terms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题