论文标题
了解在线组合优化政策优化中的课程学习
Understanding Curriculum Learning in Policy Optimization for Online Combinatorial Optimization
论文作者
论文摘要
近年来,强化学习(RL)开始在解决组合优化(CO)问题方面表现出令人鼓舞的结果,特别是当与课程学习以促进培训时。尽管有经验证据,但关于RL为什么帮助的理论研究仍处于早期阶段。本文介绍了有关在线CO问题的策略优化方法的首次系统研究。我们表明,在线CO问题自然可以作为潜在的马尔可夫决策过程(LMDP),并证明用于解决LMDP的自然政策梯度(NPG)的融合界限。此外,我们的理论解释了课程学习的好处:它可以找到强大的抽样策略并减少分布转移,这是控制我们定理中收敛速率的关键数量。对于最佳选择问题(BCP),我们正式证明,即使课程是在较小规模上随机生成的BCP,我们也正式证明,通过课程学习呈指数减少。我们的理论还表明,我们可以简化从多步到单步的先前工作中使用的课程学习方案。最后,我们提供有关最佳选择问题,在线背包和AdWords的广泛实验,以验证我们的发现。
Over the recent years, reinforcement learning (RL) starts to show promising results in tackling combinatorial optimization (CO) problems, in particular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for online CO problems. We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical online CO problem, the Best Choice Problem (BCP), we formally prove that distribution shift is reduced exponentially with curriculum learning even if the curriculum is a randomly generated BCP on a smaller scale. Our theory also shows we can simplify the curriculum learning scheme used in prior work from multi-step to single-step. Lastly, we provide extensive experiments on the Best Choice Problem, Online Knapsack, and AdWords to verify our findings.