增强的场景特异性具有稀疏的动态价值估计

论文标题

增强的场景特异性具有稀疏的动态价值估计

Enhanced Scene Specificity with Sparse Dynamic Value Estimation

论文作者

Singh, Jaskirat, Zheng, Liang

论文摘要

多场景强化学习涉及从同一任务中跨多个场景 /级别培训RL代理，并且对于许多泛化应用程序至关重要。但是，包含多个场景会导致策略梯度计算的样本差异的增加，通常会导致直接应用传统方法（例如PPO，A3C）的次优性能。降低方差的一种策略是将每个场景视为马尔可夫决策过程（MDP），并学习取决于状态和MDP（M）的联合价值函数。但是，这是并非平凡的，因为该代理通常不知道多场RL中的火车 /测试时间的基础水平。最近，辛格等。 [1]试图通过提出一种动态值估计方法来解决这个问题，该方法将真实的关节值函数分布建模为高斯混合模型（GMM）。在本文中，我们认为，一旦代理探索了大多数状态空间，就可以通过逐步执行稀疏集群分配来进一步降低真实场景特定的值函数和预测的动态估计之间的误差。最终的代理不仅在一系列OpenAI Procgen环境中显示出最终奖励得分的显着提高，而且在完成游戏水平的同时表现出提高的导航效率。

Multi-scene reinforcement learning involves training the RL agent across multiple scenes / levels from the same task, and has become essential for many generalization applications. However, the inclusion of multiple scenes leads to an increase in sample variance for policy gradient computations, often resulting in suboptimal performance with the direct application of traditional methods (e.g. PPO, A3C). One strategy for variance reduction is to consider each scene as a distinct Markov decision process (MDP) and learn a joint value function dependent on both state (s) and MDP (M). However, this is non-trivial as the agent is usually unaware of the underlying level at train / test times in multi-scene RL. Recently, Singh et al. [1] tried to address this by proposing a dynamic value estimation approach that models the true joint value function distribution as a Gaussian mixture model (GMM). In this paper, we argue that the error between the true scene-specific value function and the predicted dynamic estimate can be further reduced by progressively enforcing sparse cluster assignments once the agent has explored most of the state space. The resulting agents not only show significant improvements in the final reward score across a range of OpenAI ProcGen environments, but also exhibit increased navigation efficiency while completing a game level.

下载PDF全文

下载文献需遵守相关版权规定

论文标题