论文标题
连续的过滤泊松过程匪徒
Filtered Poisson Process Bandit on a Continuum
论文作者
论文摘要
我们考虑了连续武装匪徒的一个版本,其中一种动作引起了非均匀泊松过程的过滤实现。然后将过滤样本中的点数据显示给决策者,其奖励是显示点的总数。利用有关管理过滤的功能的知识,但在不了解泊松强度函数的情况下,决策者试图最大程度地提高T回合中预期的揭示点的数量。我们提出了使用动作空间的数据自适应离散化的此问题的上限置信度结合算法。在Lipschitz对奖励功能的假设下,这种方法享受O(t^(2/3))遗憾。我们通过针对相关的有限武器匪徒的新下限为该问题提供了较低的界限,并表明上限和下限的订单符合对数因素。
We consider a version of the continuum armed bandit where an action induces a filtered realisation of a non-homogeneous Poisson process. Point data in the filtered sample are then revealed to the decision-maker, whose reward is the total number of revealed points. Using knowledge of the function governing the filtering, but without knowledge of the Poisson intensity function, the decision-maker seeks to maximise the expected number of revealed points over T rounds. We propose an upper confidence bound algorithm for this problem utilising data-adaptive discretisation of the action space. This approach enjoys O(T^(2/3)) regret under a Lipschitz assumption on the reward function. We provide lower bounds on the regret of any algorithm for the problem, via new lower bounds for related finite-armed bandits, and show that the orders of the upper and lower bounds match up to a logarithmic factor.