迭代摊销政策优化

论文标题

迭代摊销政策优化

Iterative Amortized Policy Optimization

论文作者

Marino, Joseph, Piché, Alexandre, Ialongo, Alessandro Davide, Yue, Yisong

论文摘要

策略网络是用于连续控制的深入增强学习（RL）算法的主要特征，从而实现了高价值动作的估计和采样。从RL的变异推理角度来看，当与熵或KL正则化时，策略网络是\ textit {amortized优化}的一种形式，优化了网络参数，而不是直接直接策略分布。但是，\ textit {Direct}摊销映射可以产生次优政策估计和限制分布，从而限制了性能和探索。从这个角度来看，我们考虑了\ textit {迭代}摊销优化器的更灵活的类。我们证明，所得的技术，迭代摊销的策略优化，可在基准连续控制任务上进行直接摊销，从而提高性能。

Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control, enabling the estimation and sampling of high-value actions. From the variational inference perspective on RL, policy networks, when used with entropy or KL regularization, are a form of \textit{amortized optimization}, optimizing network parameters rather than the policy distributions directly. However, \textit{direct} amortized mappings can yield suboptimal policy estimates and restricted distributions, limiting performance and exploration. Given this perspective, we consider the more flexible class of \textit{iterative} amortized optimizers. We demonstrate that the resulting technique, iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题