对于过度干扰拒绝的过渡不匹配补偿，模块化转移学习不匹配

论文标题

对于过度干扰拒绝的过渡不匹配补偿，模块化转移学习不匹配

Modular Transfer Learning with Transition Mismatch Compensation for Excessive Disturbance Rejection

论文作者

Wang, Tianming, Lu, Wenjie, Yu, Huan, Liu, Dikai

论文摘要

浅水中的水下机器人通常会遭受强波力强，这可能经常超过机器人的控制限制。基于学习的控制器适用于干扰拒绝控制，但是过度的干扰严重影响了马尔可夫决策过程（MDP）或部分可观察到的马尔可夫决策过程（POMDP）。同样，针对目标系统上的纯学习程序可能会遇到损坏探索性动作或不可预测的系统变化，并且仅在先前模型上进行培训通常无法解决目标系统中的模型不匹配。在本文中，我们提出了一个转移学习框架，该框架适应了控制政策，以使在动态模型不匹配下过度干扰拒绝拒绝水下机器人。应用模块化的学习策略网络，由广义控制策略（GCP）和在线干扰识别模型（ODI）组成。 GCP首先在各种干扰波形上进行训练。然后，ODI学会使用系统和系统的行动来预测作为GCP输入（以及系统状态）提供的干扰波形。使用过渡不匹配补偿（TMC）的转移加强学习算法是根据模块化体系结构开发的，该算法通过最大程度地减少源和目标任务的两个动力学模型预测的过渡的不匹配来学习额外的补偿性政策。我们在模拟中的姿势调节任务中证明了TMC能够成功拒绝干扰并在机器人系统的经验模型下稳定机器人，同时提高了样品效率。

Underwater robots in shallow waters usually suffer from strong wave forces, which may frequently exceed robot's control constraints. Learning-based controllers are suitable for disturbance rejection control, but the excessive disturbances heavily affect the state transition in Markov Decision Process (MDP) or Partially Observable Markov Decision Process (POMDP). Also, pure learning procedures on targeted system may encounter damaging exploratory actions or unpredictable system variations, and training exclusively on a prior model usually cannot address model mismatch from the targeted system. In this paper, we propose a transfer learning framework that adapts a control policy for excessive disturbance rejection of an underwater robot under dynamics model mismatch. A modular network of learning policies is applied, composed of a Generalized Control Policy (GCP) and an Online Disturbance Identification Model (ODI). GCP is first trained over a wide array of disturbance waveforms. ODI then learns to use past states and actions of the system to predict the disturbance waveforms which are provided as input to GCP (along with the system state). A transfer reinforcement learning algorithm using Transition Mismatch Compensation (TMC) is developed based on the modular architecture, that learns an additional compensatory policy through minimizing mismatch of transitions predicted by the two dynamics models of the source and target tasks. We demonstrated on a pose regulation task in simulation that TMC is able to successfully reject the disturbances and stabilize the robot under an empirical model of the robot system, meanwhile improve sample efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题