随机保守上下文线性匪徒

论文标题

随机保守上下文线性匪徒

Stochastic Conservative Contextual Linear Bandits

论文作者

Lin, Jiabin, Lee, Xian Yeow, Jubery, Talukder, Moothedath, Shana, Sarkar, Soumik, Ganapathysubramanian, Baskar

论文摘要

许多物理系统都有基本的安全考虑因素，要求部署策略确保一组约束的满意度。此外，我们通常只有有关系统状态的部分信息。我们研究了不确定性下的安全实时决策的问题。在本文中，当对手在可能的上下文集中选择分布时，我们为实时决策制定了保守的随机上下文匪徒公式，并且学习者受到某些安全/绩效约束的约束。学习者仅观察上下文分布和确切的上下文是未知的，目标是开发一种算法，该算法选择一系列最佳动作序列，以最大程度地提高累积奖励，而不会在任何时间步骤违反安全约束。通过利用UCB算法为此设置，我们为具有上下文分布的随机匪徒提出了一种保守的UCB算法。我们证明了对算法的遗憾的上限，并表明它可以分解为三个术语：（i）遗憾的是标准线性UCB算法的遗憾，（ii）一个持续的术语（独立于时间范围），该期限（独立于时间范围）是为了保守的损失，以确保安全约束，并且（ii）损失了（ii）的损失（ii）是为了确定时间表的范围（ii独立于时间范围）。为了验证我们的方法的性能，我们对合成数据和通过基因组收集到田间（G2F）计划的现实玉米数据进行了广泛的模拟。

Many physical systems have underlying safety considerations that require that the strategy deployed ensures the satisfaction of a set of constraints. Further, often we have only partial information on the state of the system. We study the problem of safe real-time decision making under uncertainty. In this paper, we formulate a conservative stochastic contextual bandit formulation for real-time decision making when an adversary chooses a distribution on the set of possible contexts and the learner is subject to certain safety/performance constraints. The learner observes only the context distribution and the exact context is unknown, and the goal is to develop an algorithm that selects a sequence of optimal actions to maximize the cumulative reward without violating the safety constraints at any time step. By leveraging the UCB algorithm for this setting, we propose a conservative linear UCB algorithm for stochastic bandits with context distribution. We prove an upper bound on the regret of the algorithm and show that it can be decomposed into three terms: (i) an upper bound for the regret of the standard linear UCB algorithm, (ii) a constant term (independent of time horizon) that accounts for the loss of being conservative in order to satisfy the safety constraint, and (ii) a constant term (independent of time horizon) that accounts for the loss for the contexts being unknown and only the distribution being known. To validate the performance of our approach we perform extensive simulations on synthetic data and on real-world maize data collected through the Genomes to Fields (G2F) initiative.

下载PDF全文

下载文献需遵守相关版权规定

论文标题