在自我监督的机器人学习中进行超参数自动调节

论文标题

在自我监督的机器人学习中进行超参数自动调节

Hyperparameter Auto-tuning in Self-Supervised Robotic Learning

论文作者

Huang, Jiancong, Rojas, Juan, Zimmer, Matthieu, Wu, Hongmin, Guan, Yisheng, Weng, Paul

论文摘要

强化学习中的策略优化需要在不同环境中选择众多超参数。错误地修复它们可能会对优化性能产生负面影响，从而显着导致学习不足或多余的学习。学习不足（由于与本地Optima的融合）会导致表现不佳的政策，而冗余学习浪费了时间和资源。当使用单个策略解决多任务学习问题时，效果会进一步加剧。观察到在变异自动编码器中使用的证据下限（ELBO）与图像样本的多样性相关，我们提出了一种基于ELBO的自动调整技术，以进行自我监督的增强学习。我们的方法可以自动调整三个超参数：重播缓冲区大小，每个时期期间的策略梯度更新数以及每个时期期间的探索步骤数量。我们使用最先进的自我监督的机器人学习框架（使用软演员批评的想象目标（钻机）的增强学习框架）作为实验验证的基线。实验表明，我们的方法可以在线自动调整，并在很少的时间和计算资源中产生最佳性能。可以在项目页面\ url {www.juanrojas.net/autotune}上找到用于模拟和实体机器人实验的代码，视频和附录。

Policy optimization in reinforcement learning requires the selection of numerous hyperparameters across different environments. Fixing them incorrectly may negatively impact optimization performance leading notably to insufficient or redundant learning. Insufficient learning (due to convergence to local optima) results in under-performing policies whilst redundant learning wastes time and resources. The effects are further exacerbated when using single policies to solve multi-task learning problems. Observing that the Evidence Lower Bound (ELBO) used in Variational Auto-Encoders correlates with the diversity of image samples, we propose an auto-tuning technique based on the ELBO for self-supervised reinforcement learning. Our approach can auto-tune three hyperparameters: the replay buffer size, the number of policy gradient updates during each epoch, and the number of exploration steps during each epoch. We use a state-of-the-art self-supervised robot learning framework (Reinforcement Learning with Imagined Goals (RIG) using Soft Actor-Critic) as baseline for experimental verification. Experiments show that our method can auto-tune online and yields the best performance at a fraction of the time and computational resources. Code, video, and appendix for simulated and real-robot experiments can be found at the project page \url{www.JuanRojas.net/autotune}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题