政策学习“无”重叠：悲观和广泛的经验伯恩斯坦的不平等

论文标题

政策学习“无”重叠：悲观和广泛的经验伯恩斯坦的不平等

Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality

论文作者

Jin, Ying, Ren, Zhimei, Yang, Zhuoran, Wang, Zhaoran

论文摘要

本文研究了脱机政策学习，旨在利用观察结果收集了先验（从固定或自适应发展的行为政策）来学习最佳的个性化决策规则，该规则可为特定人群带来最佳的总体结果。现有的政策学习方法依赖于统一的重叠假设，即探索所有个人特征的所有动作的倾向必须降低。由于人们无法控制数据收集过程，因此在许多情况下，此假设可能是不现实的，尤其是当行为策略随着时间的流逝而随着某些行动的倾向减少而发展时。在本文中，我们提出了悲观的政策学习（PPL），这是一种新算法，优化了策略价值的较低置信界（LCB），而不是点估计。 LCB是使用有关收集离线数据的行为策略的知识来构建的。在不假设任何统一的重叠条件的情况下，我们为算法的次级临时性建立了一个依赖数据的上限，该算法仅取决于（i）最佳策略的重叠，以及（ii）我们优化的策略类别的复杂性。对于自适应收集的数据而言，我们要确保有效的政策学习，只要最佳动作的倾向随着时间的推移而降低，而次优的倾向则可以任意快速减少。在我们的理论分析中，我们开发了一种新的自相应型浓度不平等，用于逆强度加权估计量，从而将众所周知的经验性伯恩斯坦的不平等概括为无界和非约束和非i.i.i.d。数据。我们通过大量最小化和策略树搜索以及广泛的仿真研究和现实世界的应用，通过有效的优化算法来补充我们的理论。

This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn an optimal individualized decision rule that achieves the best overall outcomes for a given population. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities for certain actions. In this paper, we propose Pessimistic Policy Learning (PPL), a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which only depends on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class we optimize over. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized type concentration inequality for inverse-propensity-weighting estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data. We complement our theory with an efficient optimization algorithm via Majorization-Minimization and policy tree search, as well as extensive simulation studies and real-world applications that demonstrate the efficacy of PPL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题