论文标题

部分可观测时空混沌系统的无模型预测

Policy-Adaptive Estimator Selection for Off-Policy Evaluation

论文作者

Udagawa, Takuma, Kiyohara, Haruka, Narita, Yusuke, Saito, Yuta, Tateno, Kei

论文摘要

非政策评估(OPE)旨在仅使用离线记录的数据准确评估反事实策略的性能。尽管已经开发了许多估计器,但没有单个估计器主导其他估计器,因为估计器的准确性可能会大大差异,具体取决于给定的OPE任务,例如评估策略,操作数量和噪声水平。因此,数据驱动的估计器选择问题变得越来越重要,并且可能对OPE的准确性产生重大影响。但是,仅使用记录数据识别最精确的估计器非常具有挑战性,因为估计器的基础真相估计精度通常不可用。本文首次研究了对OPE选择估计器选择的具有挑战性的问题。特别是,我们可以通过适当的可用记录数据并构建对基础估算器选择任务有用的伪策略来启用适应给定OPE任务的估算器选择。对合成和现实世界的公司数据的全面实验表明,与非自适应启发式相比,所提出的程序大大改善了估计器的选择。

Off-policy evaluation (OPE) aims to accurately evaluate the performance of counterfactual policies using only offline logged data. Although many estimators have been developed, there is no single estimator that dominates the others, because the estimators' accuracy can vary greatly depending on a given OPE task such as the evaluation policy, number of actions, and noise level. Thus, the data-driven estimator selection problem is becoming increasingly important and can have a significant impact on the accuracy of OPE. However, identifying the most accurate estimator using only the logged data is quite challenging because the ground-truth estimation accuracy of estimators is generally unavailable. This paper studies this challenging problem of estimator selection for OPE for the first time. In particular, we enable an estimator selection that is adaptive to a given OPE task, by appropriately subsampling available logged data and constructing pseudo policies useful for the underlying estimator selection task. Comprehensive experiments on both synthetic and real-world company data demonstrate that the proposed procedure substantially improves the estimator selection compared to a non-adaptive heuristic.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源