随机凸优化，可证明有效的学徒学习

论文标题

随机凸优化，可证明有效的学徒学习

Stochastic convex optimization for provably efficient apprenticeship learning

论文作者

Kamoutsi, Angeliki, Banjac, Goran, Lygeros, John

论文摘要

我们考虑具有未知成本功能的大规模马尔可夫决策过程（MDP），并采用随机凸优化工具来解决模仿学习问题，该工具包括从有限的专家演示中学习政策。我们采用学徒学习形式主义，它假设可以将真实成本函数表示为某些已知特征的线性组合。现有的逆增强学习算法具有强大的理论保证，但在计算上却是昂贵的，因为它们使用加固学习或计划算法作为子例程。另一方面，基于最先进的政策梯度算法（例如IM-Forefe，IM-TRPO和GAIL）在挑战基准任务方面取得了重大的经验成功，但在理论方面却不太了解。为了强调非质子性绩效保证，我们提出了一种直接从专家演示中学习政策的方法，绕开了学习成本函数的中间步骤，通过将问题作为单个凸优化问题提出，而不是占用度量。我们开发了一种计算高效的算法，并通过随机凸优化的结果以及近似线性编程中的最新作品来解决提取策略质量的高度置信度，以解决向前的MDP。

We consider large-scale Markov decision processes (MDPs) with an unknown cost function and employ stochastic convex optimization tools to address the problem of imitation learning, which consists of learning a policy from a finite set of expert demonstrations. We adopt the apprenticeship learning formalism, which carries the assumption that the true cost function can be represented as a linear combination of some known features. Existing inverse reinforcement learning algorithms come with strong theoretical guarantees, but are computationally expensive because they use reinforcement learning or planning algorithms as a subroutine. On the other hand, state-of-the-art policy gradient based algorithms (like IM-REINFORCE, IM-TRPO, and GAIL), achieve significant empirical success in challenging benchmark tasks, but are not well understood in terms of theory. With an emphasis on non-asymptotic guarantees of performance, we propose a method that directly learns a policy from expert demonstrations, bypassing the intermediate step of learning the cost function, by formulating the problem as a single convex optimization problem over occupancy measures. We develop a computationally efficient algorithm and derive high confidence regret bounds on the quality of the extracted policy, utilizing results from stochastic convex optimization and recent works in approximate linear programming for solving forward MDPs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题