论文标题
足够探索凸Q学习
Sufficient Exploration for Convex Q-learning
论文作者
论文摘要
近年来,已经进行了集体研究工作,以寻找新的增强学习的表述,这些辅助学习效率更高且更适合分析。本文涉及一种基于线性编程(LP)对Manne的最佳控制的方法。原始版本称为Logistic Q-Learning,双重变体是凸Q学习。本文着重于后者,同时与前者建造桥梁。主要贡献如下:(i)凸Q-学习的双重学习不是Manne的LP或Logistic Q学习的版本,但具有相似的结构,揭示了需要进行正规化以避免过度拟合。 (ii)对于Q学习LP的有界溶液,获得了足够的条件。 (iii)模拟研究揭示了基于连续时间模型的采样数据系统时的数值挑战。使用状态依赖性抽样解决了挑战。该理论用Openai Gym的示例的应用进行了说明。结果表明,在标准Q学习分歧(例如LQR问题)的情况下,凸Q学习是成功的。
In recent years there has been a collective research effort to find new formulations of reinforcement learning that are simultaneously more efficient and more amenable to analysis. This paper concerns one approach that builds on the linear programming (LP) formulation of optimal control of Manne. A primal version is called logistic Q-learning, and a dual variant is convex Q-learning. This paper focuses on the latter, while building bridges with the former. The main contributions follow: (i) The dual of convex Q-learning is not precisely Manne's LP or a version of logistic Q-learning, but has similar structure that reveals the need for regularization to avoid over-fitting. (ii) A sufficient condition is obtained for a bounded solution to the Q-learning LP. (iii) Simulation studies reveal numerical challenges when addressing sampled-data systems based on a continuous time model. The challenge is addressed using state-dependent sampling. The theory is illustrated with applications to examples from OpenAI gym. It is shown that convex Q-learning is successful in cases where standard Q-learning diverges, such as the LQR problem.