论文标题

迭代摊销政策优化

Iterative Amortized Policy Optimization

论文作者

Marino, Joseph, Piché, Alexandre, Ialongo, Alessandro Davide, Yue, Yisong

论文摘要

策略网络是用于连续控制的深入增强学习(RL)算法的主要特征,从而实现了高价值动作的估计和采样。从RL的变异推理角度来看,当与熵或KL正则化时,策略网络是\ textit {amortized优化}的一种形式,优化了网络参数,而不是直接直接策略分布。但是,\ textit {Direct}摊销映射可以产生次优政策估计和限制分布,从而限制了性能和探索。从这个角度来看,我们考虑了\ textit {迭代}摊销优化器的更灵活的类。我们证明,所得的技术,迭代摊销的策略优化,可在基准连续控制任务上进行直接摊销,从而提高性能。

Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control, enabling the estimation and sampling of high-value actions. From the variational inference perspective on RL, policy networks, when used with entropy or KL regularization, are a form of \textit{amortized optimization}, optimizing network parameters rather than the policy distributions directly. However, \textit{direct} amortized mappings can yield suboptimal policy estimates and restricted distributions, limiting performance and exploration. Given this perspective, we consider the more flexible class of \textit{iterative} amortized optimizers. We demonstrate that the resulting technique, iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源