论文标题
在服务率未知的排队系统中学习一套离散的最佳分配规则
Learning a Discrete Set of Optimal Allocation Rules in a Queueing System with Unknown Service Rate
论文作者
论文摘要
由Erlang-B阻止模型的广泛应用,超出通信网络和呼叫中心的广泛现代应用,并在设计生产系统,消息传递系统和基于应用程序的停车系统中进行尺寸和价格,我们研究了该系统的录取控制,但到达和服务率不明。在我们的模型中,在每个工作到达时,调度员决定将作业分配给可用的服务器或阻止它。每个服务的工作都会为调度员带来固定的奖励,但也会导致每单位服务时间。我们的目标是设计一个派遣政策,该政策最大程度地基于观察到的每次到达的到达时间和系统状态,从而最大程度地提高了调度员的长期平均奖励,从而反映了这种系统的现实采样。至关重要的是,调度员既不观察到服务时间也没有出发时间,因此使用奖励信号的标准加强学习方法不适用。因此,我们将基于学习的调度方案作为参数学习问题A'la自我调整自适应控制。在我们的问题中,确定性等效控制在一个始终承认是否会允许房间政策(无限频繁地探索)和永不允许的政策(立即终止学习)之间,这与自适应控制文献不同。因此,我们的学习计划明智地使用始终承认是否会室政策,以免学习停滞。我们证明,对于所有服务率,拟议的政策渐近地学习采取最佳行动并提供有限的遗憾保证。确定性同等最佳控制政策的极端对比导致学习困难,这在我们的遗憾范围内出现了不同的参数制度:一种政权的持续遗憾与另一方面的对数后悔的对数。
Motivated by the wide range of modern applications of the Erlang-B blocking model beyond communication networks and call centers to sizing and pricing in design production systems, messaging systems, and app-based parking systems, we study admission control for such a system but with unknown arrival and service rates. In our model, at every job arrival, a dispatcher decides to assign the job to an available server or block it. Every served job yields a fixed reward for the dispatcher, but it also results in a cost per unit time of service. Our goal is to design a dispatching policy that maximizes the long-term average reward for the dispatcher based on observing only the arrival times and the state of the system at each arrival that reflects a realistic sampling of such systems. Critically, the dispatcher observes neither the service times nor departure times so that standard reinforcement learning-based approaches that use reward signals do not apply. Hence, we develop our learning-based dispatch scheme as a parametric learning problem a'la self-tuning adaptive control. In our problem, certainty equivalent control switches between an always admit if room policy (explore infinitely often) and a never admit policy (immediately terminate learning), which is distinct from the adaptive control literature. Hence, our learning scheme judiciously uses the always admit if room policy so that learning doesn't stall. We prove that for all service rates, the proposed policy asymptotically learns to take the optimal action and present finite-time regret guarantees. The extreme contrast in the certainty equivalent optimal control policies leads to difficulties in learning that show up in our regret bounds for different parameter regimes: constant regret in one regime versus regret growing logarithmically in the other.