论文标题
一致性驱动的顺序变压器注意模型,用于部分可观察的场景
Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes
论文作者
论文摘要
大多数硬注意模型最初都会观察一个完整的场景,以找到和感知信息丰富的瞥见,并预测基于瞥见的场景的类标签。但是,在许多应用中(例如,航空成像),由于时间和资源有限,可以观察整个场景并不总是可行的。在本文中,我们开发了一个顺序的变形金刚注意模型(StAM),该模型仅部分观察完整的图像,并仅基于过去的瞥见来预测信息丰富的瞥见位置。我们使用Deit-Distilled设计代理,并使用一步演员批评算法对其进行训练。此外,为了提高分类性能,我们引入了一个新颖的培训目标,该目标可以通过完整图像预测的教师模型预测的班级分布与我们的代理商使用瞥见预测的班级分布之间的一致性。当代理仅感知总图像面积的4%时,我们训练目标中提出的一致性损失的包含分别在ImageNet和FMOW数据集上的精度分别提高了3%和8%。此外,我们的代理商在Imagenet和Fmow上的瞥见中观察到近27%和42%的像素来优于先前的最先前。
Most hard attention models initially observe a complete scene to locate and sense informative glimpses, and predict class-label of a scene based on glimpses. However, in many applications (e.g., aerial imaging), observing an entire scene is not always feasible due to the limited time and resources available for acquisition. In this paper, we develop a Sequential Transformers Attention Model (STAM) that only partially observes a complete image and predicts informative glimpse locations solely based on past glimpses. We design our agent using DeiT-distilled and train it with a one-step actor-critic algorithm. Furthermore, to improve classification performance, we introduce a novel training objective, which enforces consistency between the class distribution predicted by a teacher model from a complete image and the class distribution predicted by our agent using glimpses. When the agent senses only 4% of the total image area, the inclusion of the proposed consistency loss in our training objective yields 3% and 8% higher accuracy on ImageNet and fMoW datasets, respectively. Moreover, our agent outperforms previous state-of-the-art by observing nearly 27% and 42% fewer pixels in glimpses on ImageNet and fMoW.