论文标题
在具有部分知名语义的环境中的奖励机和政策联合学习
Joint Learning of Reward Machines and Policies in Environments with Partially Known Semantics
论文作者
论文摘要
我们研究了由奖励机器编码的任务加强学习的问题。该任务是根据环境中的一组属性定义的,称为原子命题,并由布尔变量表示。文献中常用的一个不切实际的假设是,这些命题的真实价值是准确的。但是,在实际情况下,这些真实价值观尚不确定,因为它们来自遇到不完美的传感器。同时,奖励机可能很难明确建模,尤其是当它们编码复杂的任务时。我们开发了一种增强学习算法,该算法会渗透一台奖励机器,该奖励机在学习如何执行它的同时编码基本任务,尽管命题的真实价值是不确定的。为了解决这种不确定性,该算法对原子命题的真实价值保持了概率的估计。它根据环境探索到达的新感官测量结果来更新此估算。此外,该算法还保留了假设奖励机,该机器是对编码要学习任务的奖励机器的估计。当代理探索环境时,算法根据获得的奖励和原子命题真实价值的估计来更新假设奖励机。最后,该算法对假设奖励机的状态使用Q学习过程来确定完成任务的策略。我们证明该算法成功地侵入了奖励机器,并渐近地学习完成各个任务的政策。
We study the problem of reinforcement learning for a task encoded by a reward machine. The task is defined over a set of properties in the environment, called atomic propositions, and represented by Boolean variables. One unrealistic assumption commonly used in the literature is that the truth values of these propositions are accurately known. In real situations, however, these truth values are uncertain since they come from sensors that suffer from imperfections. At the same time, reward machines can be difficult to model explicitly, especially when they encode complicated tasks. We develop a reinforcement-learning algorithm that infers a reward machine that encodes the underlying task while learning how to execute it, despite the uncertainties of the propositions' truth values. In order to address such uncertainties, the algorithm maintains a probabilistic estimate about the truth value of the atomic propositions; it updates this estimate according to new sensory measurements that arrive from the exploration of the environment. Additionally, the algorithm maintains a hypothesis reward machine, which acts as an estimate of the reward machine that encodes the task to be learned. As the agent explores the environment, the algorithm updates the hypothesis reward machine according to the obtained rewards and the estimate of the atomic propositions' truth value. Finally, the algorithm uses a Q-learning procedure for the states of the hypothesis reward machine to determine the policy that accomplishes the task. We prove that the algorithm successfully infers the reward machine and asymptotically learns a policy that accomplishes the respective task.