强化学习奖励功能在无人机控制任务中

论文标题

强化学习奖励功能在无人机控制任务中

Reinforcement learning reward function in unmanned aerial vehicle control tasks

论文作者

Tovarnov, Mikhail S., Bykov, Nikita V.

论文摘要

本文提出了一种新的奖励功能，可用于无人驾驶汽车（UAV）控制和导航问题的深度加固学习。奖励函数基于对目标的简化轨迹时间的构建和估计，这是三阶曲线。可以将此奖励函数不变地应用于二维和三维虚拟环境中的问题。奖励函数的有效性在新开发的虚拟环境中进行了测试，即简化的二维环境，描述了无人机控制和飞行的动力学，考虑到推力，惯性，重力和空气动力学阻力的力量。在此公式中，成功解决了无人机控制和导航的三个任务：无人机飞行到给定的空间点，避免另一个无人机拦截拦截以及另一个无人机拦截一个无人机的组织。使用了三种最相关的现代深度强化学习算法，软批评者，深层确定性的政策梯度和双重延迟的深层确定性政策梯度。所有三种算法都表现良好，表明所选奖励函数的有效性。

This paper presents a new reward function that can be used for deep reinforcement learning in unmanned aerial vehicle (UAV) control and navigation problems. The reward function is based on the construction and estimation of the time of simplified trajectories to the target, which are third-order Bezier curves. This reward function can be applied unchanged to solve problems in both two-dimensional and three-dimensional virtual environments. The effectiveness of the reward function was tested in a newly developed virtual environment, namely, a simplified two-dimensional environment describing the dynamics of UAV control and flight, taking into account the forces of thrust, inertia, gravity, and aerodynamic drag. In this formulation, three tasks of UAV control and navigation were successfully solved: UAV flight to a given point in space, avoidance of interception by another UAV, and organization of interception of one UAV by another. The three most relevant modern deep reinforcement learning algorithms, Soft actor-critic, Deep Deterministic Policy Gradient, and Twin Delayed Deep Deterministic Policy Gradient were used. All three algorithms performed well, indicating the effectiveness of the selected reward function.

下载PDF全文

下载文献需遵守相关版权规定

论文标题