论文标题
教学事项:调查监督在视觉变压器中的作用
Teaching Matters: Investigating the Role of Supervision in Vision Transformers
论文作者
论文摘要
近年来,视觉变压器(VIT)已获得了巨大的知名度,并将其扩散到许多应用中。但是,他们在不同学习范式下的行为尚未得到很好的探索。我们比较通过不同的监督方法训练的VIT,并表明他们从注意力,表现和下游表现方面学习了各种行为。我们还发现了整个监督的VIT行为,包括偏移局部注意力头的出现。这些是自我发挥的头脑,与当前令牌相邻的代币具有固定的方向偏移,这一现象在我们所知的最大知识中并未在任何先前的工作中得到强调。我们的分析表明,VIT具有高度灵活性,并根据其培训方法学习以不同的订单处理本地和全球信息。我们发现,对比的自我监督方法学习具有明确监督功能的竞争功能,甚至可以在零件级别的任务中提高。我们还发现,基于重建的模型的表示形式与对比度的自我监督模型表现出非平凡的相似性。项目网站(https://www.cs.umd.edu/~sakshams/vit_analysis)和代码(https://www.github.com/mwalmer-umd/vit_analysis)公开。
Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, their behavior under different learning paradigms is not well explored. We compare ViTs trained through different methods of supervision, and show that they learn a diverse range of behaviors in terms of their attention, representations, and downstream performance. We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads. These are self-attention heads that attend to a token adjacent to the current token with a fixed directional offset, a phenomenon that to the best of our knowledge has not been highlighted in any prior work. Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method. We find that contrastive self-supervised methods learn features that are competitive with explicitly supervised features, and they can even be superior for part-level tasks. We also find that the representations of reconstruction-based models show non-trivial similarity to contrastive self-supervised models. Project website (https://www.cs.umd.edu/~sakshams/vit_analysis) and code (https://www.github.com/mwalmer-umd/vit_analysis) are publicly available.