在价值函数中输入策略：策略表示和策略扩展的价值函数近似器

论文标题

在价值函数中输入策略：策略表示和策略扩展的价值函数近似器

What About Inputing Policy in Value Function: Policy Representation and Policy-extended Value Function Approximator

论文作者

Tang, Hongyao, Meng, Zhaopeng, Hao, Jianye, Chen, Chen, Graves, Daniel, Li, Dong, Yu, Changmin, Mao, Hangyu, Liu, Wulong, Yang, Yaodong, Tao, Wenyuan, Wang, Li

论文摘要

我们在增强学习（RL）中研究了策略扩展的值函数近似值（PEVFA），该函数扩展了传统的值函数近似器（VFA），不仅将状态（和动作）作为输入，还将其视为显式策略表示。这样的扩展使PEVFA能够同时保留多个策略的值，并带来了一个吸引人的特征，即\ emph {al emph {value pentriciations n politicies}。我们正式分析了广义政策迭代（GPI）下的价值泛化。从理论和经验镜头来看，我们表明，PEVFA提供的广义值估计值可能具有较低的初始近似误差，而连续策略的真实值有望改善GPI期间的连续值近似值。根据上述线索，我们引入了一种新的GPI形式，其中PEVFA利用了沿策略改进路径的价值泛化。此外，我们为RL策略提出了一个代表学习框架，提供了几种方法，可以从策略网络参数或州行动对中学习有效的策略嵌入。在我们的实验中，我们评估了PEVFA提供的价值泛化的功效，以及在OpenAi Gym连续控制任务中提供的策略代表学习。对于算法实现的代表性实例，在大多数环境中，在GPI的范式下重新实现了近端策略优化（PPO），其VANILLA对应物的性能提高了40 \％。

We study Policy-extended Value Function Approximator (PeVFA) in Reinforcement Learning (RL), which extends conventional value function approximator (VFA) to take as input not only the state (and action) but also an explicit policy representation. Such an extension enables PeVFA to preserve values of multiple policies at the same time and brings an appealing characteristic, i.e., \emph{value generalization among policies}. We formally analyze the value generalization under Generalized Policy Iteration (GPI). From theoretical and empirical lens, we show that generalized value estimates offered by PeVFA may have lower initial approximation error to true values of successive policies, which is expected to improve consecutive value approximation during GPI. Based on above clues, we introduce a new form of GPI with PeVFA which leverages the value generalization along policy improvement path. Moreover, we propose a representation learning framework for RL policy, providing several approaches to learn effective policy embeddings from policy network parameters or state-action pairs. In our experiments, we evaluate the efficacy of value generalization offered by PeVFA and policy representation learning in several OpenAI Gym continuous control tasks. For a representative instance of algorithm implementation, Proximal Policy Optimization (PPO) re-implemented under the paradigm of GPI with PeVFA achieves about 40\% performance improvement on its vanilla counterpart in most environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题