论文标题
更加关注自我注意:通过注意力指导改善预训练的语言模型
Paying More Attention to Self-attention: Improving Pre-trained Language Models via Attention Guiding
论文作者
论文摘要
预训练的语言模型(PLM)已经证明了它们对广泛的信息检索和自然语言处理任务的有效性。作为PLM的核心部分,多头自我注意力吸引了其共同参与不同职位信息的能力。但是,研究人员发现,PLM始终表现出固定的注意力模式,无论输入如何(例如,过度注意[Cls]或[Sep]]),我们认为这可能会忽略其他位置的重要信息。在这项工作中,我们提出了一种简单而有效的注意力指导机制,以通过鼓励人们对既定目标的关注来提高PLM的性能。具体而言,我们提出了两种注意力指导方法,即MAP Incistination Guiding(MDG)和注意力模式去相关指南(PDG)。前者肯定会鼓励多个自我注意力负责人之间的多样性共同参与来自不同表示子空间的信息,而后者则鼓励自我意见尽可能多地参与投入的职位。我们在三个基准数据集上使用多种一般预训练的模型(即Bert,Albert和Roberta)和特定领域的预培训模型(即Biobert,Clinicalbert,Bluebert和Scibert)进行实验。广泛的实验结果表明,我们提出的MDG和PDG具有高效率和低成本的所有数据集的稳定性能提高。
Pre-trained language models (PLM) have demonstrated their effectiveness for a broad range of information retrieval and natural language processing tasks. As the core part of PLM, multi-head self-attention is appealing for its ability to jointly attend to information from different positions. However, researchers have found that PLM always exhibits fixed attention patterns regardless of the input (e.g., excessively paying attention to [CLS] or [SEP]), which we argue might neglect important information in the other positions. In this work, we propose a simple yet effective attention guiding mechanism to improve the performance of PLM by encouraging attention towards the established goals. Specifically, we propose two kinds of attention guiding methods, i.e., map discrimination guiding (MDG) and attention pattern decorrelation guiding (PDG). The former definitely encourages the diversity among multiple self-attention heads to jointly attend to information from different representation subspaces, while the latter encourages self-attention to attend to as many different positions of the input as possible. We conduct experiments with multiple general pre-trained models (i.e., BERT, ALBERT, and Roberta) and domain-specific pre-trained models (i.e., BioBERT, ClinicalBERT, BlueBert, and SciBERT) on three benchmark datasets (i.e., MultiNLI, MedNLI, and Cross-genre-IR). Extensive experimental results demonstrate that our proposed MDG and PDG bring stable performance improvements on all datasets with high efficiency and low cost.