模仿仅国家分发匹配的学习

论文标题

模仿仅国家分发匹配的学习

Imitation Learning by State-Only Distribution Matching

论文作者

Boborzi, Damian, Straehle, Christoph-Nikolas, Buchner, Jens S., Mikelsons, Lars

论文摘要

来自观察的模仿学习以与人类学习类似的方式描述了政策学习。通过观察执行任务的专家来培训代理的政策。虽然许多纯粹的模仿学习方法基于对抗性模仿学习，但一个主要的缺点是，对抗训练通常是不稳定的，并且缺乏可靠的收敛估计器。如果真实的环境奖励未知，并且不能用于选择最佳模型，则可能会导致不良的现实策略绩效。我们提出了一种非对抗性学习方法，以及可解释的收敛性和性能指标。我们的培训目标可以最大程度地减少政策与专家国家过渡轨迹之间的Kulback-Leibler Divergence（KLD），这些轨迹可以以非对抗性方式进行优化。当学习的密度模型指导优化时，这种方法表明了改善的鲁棒性。我们通过使用其他密度模型来基于修改的奖励，将KLD最小化作为软演员评论家的目标，从而进一步提高样本效率，以估计环境的前进和后退动态。最后，我们评估了方法对众所周知的连续控制环境的有效性，并显示出最先进的性能，同时具有可靠的性能估计器与最近的几种学习方法相比。

Imitation Learning from observation describes policy learning in a similar way to human learning. An agent's policy is trained by observing an expert performing a task. While many state-only imitation learning approaches are based on adversarial imitation learning, one main drawback is that adversarial training is often unstable and lacks a reliable convergence estimator. If the true environment reward is unknown and cannot be used to select the best-performing model, this can result in bad real-world policy performance. We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric. Our training objective minimizes the Kulback-Leibler divergence (KLD) between the policy and expert state transition trajectories which can be optimized in a non-adversarial fashion. Such methods demonstrate improved robustness when learned density models guide the optimization. We further improve the sample efficiency by rewriting the KLD minimization as the Soft Actor Critic objective based on a modified reward using additional density models that estimate the environment's forward and backward dynamics. Finally, we evaluate the effectiveness of our approach on well-known continuous control environments and show state-of-the-art performance while having a reliable performance estimator compared to several recent learning-from-observation methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题