通过单数值变换来解决变压器中的令牌均匀性

论文标题

通过单数值变换来解决变压器中的令牌均匀性

Addressing Token Uniformity in Transformers via Singular Value Transformation

论文作者

Yan, Hanqi, Gui, Lin, Li, Wenjie, He, Yulan

论文摘要

在基于变压器的模型中通常观察到令牌均匀性，在经过变压器中经过堆叠的多个自我发场层后，不同的令牌共享大量相似信息。在本文中，我们建议使用每个变压器层的输出的奇异值的分布来表征令牌均匀性的现象，并从经验上说明，偏斜的奇异值分布可以减轻“令牌均匀性”问题。基于我们的观察结果，我们定义了奇异值分布的几种理想特性，并提出了一种新的转换函数，以更新奇异值。我们表明，除了减轻令牌均匀性外，转换功能还应保留原始嵌入空间中的当地邻域结构。我们提出的奇异价值变换函数应用于伯特，阿尔伯特，罗伯塔和德文尔特等一系列基于变压器的语言模型，并且在语义文本相似性评估和一系列胶水任务中观察到了改善的性能。我们的源代码可在https://github.com/hanqi-qi/tokenuni.git上找到。

Token uniformity is commonly observed in transformer-based models, in which different tokens share a large proportion of similar information after going through stacked multiple self-attention layers in a transformer. In this paper, we propose to use the distribution of singular values of outputs of each transformer layer to characterise the phenomenon of token uniformity and empirically illustrate that a less skewed singular value distribution can alleviate the `token uniformity' problem. Base on our observations, we define several desirable properties of singular value distributions and propose a novel transformation function for updating the singular values. We show that apart from alleviating token uniformity, the transformation function should preserve the local neighbourhood structure in the original embedding space. Our proposed singular value transformation function is applied to a range of transformer-based language models such as BERT, ALBERT, RoBERTa and DistilBERT, and improved performance is observed in semantic textual similarity evaluation and a range of GLUE tasks. Our source code is available at https://github.com/hanqi-qi/tokenUni.git.

下载PDF全文

下载文献需遵守相关版权规定

论文标题