使用异质数据进行加强学习：估计和推理

论文标题

使用异质数据进行加强学习：估计和推理

Reinforcement Learning with Heterogeneous Data: Estimation and Inference

论文作者

Chen, Elynn Y., Song, Rui, Jordan, Michael I.

论文摘要

强化学习（RL）有望在医疗保健，教育，商业和其他领域的各种问题中为决策提供数据驱动的支持。经典的RL方法集中在总回报的平均值上，因此可能会在通常是大型数据集基础的异质种群的设置中提供误导性结果。我们介绍了K-异构马尔可夫决策过程（K-H-HETERO MDP），以解决人口异质性的顺序决策问题。我们提出了自动集群的政策评估（ACPE），以估计给定策略的价值以及自动群集策略迭代（ACPI），用于估算给定策略类中的最佳策略。我们的自动集群算法可以自动检测和识别均匀的子人群，同时估计Q函数和每个子人群的最佳策略。我们为ACPE和ACPI获得的估计器建立收敛速率和构造置信区间。我们提出模拟以支持我们的理论发现，并对标准模拟物III数据集进行了经验研究。后一个分析显示了价值异质性的证据，并确认了我们新方法的优势。

Reinforcement Learning (RL) has the promise of providing data-driven support for decision-making in a wide range of problems in healthcare, education, business, and other domains. Classical RL methods focus on the mean of the total return and, thus, may provide misleading results in the setting of the heterogeneous populations that commonly underlie large-scale datasets. We introduce the K-Heterogeneous Markov Decision Process (K-Hetero MDP) to address sequential decision problems with population heterogeneity. We propose the Auto-Clustered Policy Evaluation (ACPE) for estimating the value of a given policy, and the Auto-Clustered Policy Iteration (ACPI) for estimating the optimal policy in a given policy class. Our auto-clustered algorithms can automatically detect and identify homogeneous sub-populations, while estimating the Q function and the optimal policy for each sub-population. We establish convergence rates and construct confidence intervals for the estimators obtained by the ACPE and ACPI. We present simulations to support our theoretical findings, and we conduct an empirical study on the standard MIMIC-III dataset. The latter analysis shows evidence of value heterogeneity and confirms the advantages of our new method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题