DA-Transformer：距离感知的变压器

论文标题

DA-Transformer：距离感知的变压器

DA-Transformer: Distance-aware Transformer

论文作者

Wu, Chuhan, Wu, Fangzhao, Huang, Yongfeng

论文摘要

Transformer通过组成Bert和GPT等各种高级模型，在NLP字段取得了巨大的成功。但是，变压器及其现有变体可能在捕获令牌距离方面可能并不最佳，因为这些方法使用的位置或距离嵌入通常无法保留实际距离的确切信息，这可能对对上下文的订单和关系建模可能不利。在本文中，我们提出了DA-Transformer，这是一种可以利用真实距离的距离感知的变压器。我们建议将令牌之间的实际距离结合起来，以重新缩放原始的自我发场权重，这些权重由注意查询与关键之间的相关性计算得出。具体而言，在不同的自我注意事项中，每对代币之间的相对距离由不同的可学习参数加权，这些参数控制着这些头部的长期或短期信息的不同偏好。由于原始的加权实际距离可能不是调整自我发挥的权重的最佳选择，因此我们提出了可学习的Sigmoid函数，以将其映射到具有适当范围的重新缩放系数中。我们首先通过relu函数夹住原始的自我发场权重，以保持非负性并引入稀疏性，然后用重新缩放的系数乘以它们，以将实际距离信息编码为自我注意力。在五个基准数据集上进行的广泛实验表明，DA-TransFormer可以有效地提高许多任务的性能，并优于Vanilla Transformer及其几种变体。

Transformer has achieved great success in the NLP field by composing various advanced models like BERT and GPT. However, Transformer and its existing variants may not be optimal in capturing token distances because the position or distance embeddings used by these methods usually cannot keep the precise information of real distances, which may not be beneficial for modeling the orders and relations of contexts. In this paper, we propose DA-Transformer, which is a distance-aware Transformer that can exploit the real distance. We propose to incorporate the real distances between tokens to re-scale the raw self-attention weights, which are computed by the relevance between attention query and key. Concretely, in different self-attention heads the relative distance between each pair of tokens is weighted by different learnable parameters, which control the different preferences on long- or short-term information of these heads. Since the raw weighted real distances may not be optimal for adjusting self-attention weights, we propose a learnable sigmoid function to map them into re-scaled coefficients that have proper ranges. We first clip the raw self-attention weights via the ReLU function to keep non-negativity and introduce sparsity, and then multiply them with the re-scaled coefficients to encode real distance information into self-attention. Extensive experiments on five benchmark datasets show that DA-Transformer can effectively improve the performance of many tasks and outperform the vanilla Transformer and its several variants.

下载PDF全文

下载文献需遵守相关版权规定

论文标题