从图表的角度重新审视伯特的过度光滑

论文标题

从图表的角度重新审视伯特的过度光滑

Revisiting Over-smoothing in BERT from the Perspective of Graph

论文作者

Shi, Han, Gao, Jiahui, Xu, Hang, Liang, Xiaodan, Li, Zhenguo, Kong, Lingpeng, Lee, Stephen M. S., Kwok, James T.

论文摘要

最近在视觉和语言领域都观察到了基于变压器模型的过度平滑现象。但是，没有现有的工作更深入地进一步研究这种现象的主要原因。在这项工作中，我们试图从图表的角度来分析过度平滑的问题，在图形的角度首次发现和探索了此类问题。直观地，自我发项矩阵可以看作是相应图的归一化相邻矩阵。基于上述连接，我们提供了一些理论分析，发现层归一化在基于变压器的模型的过度平滑问题中起关键作用。具体而言，如果层归一化的标准偏差足够大，则变压器堆栈的输出将收敛到特定的低率子空间，并导致过度平滑。为了减轻过度平滑的问题，我们考虑了分层融合策略，这些策略结合了不同层的表示形式，以使产出更加多样化。各种数据集的广泛实验结果说明了我们的融合方法的效果。

Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields. However, no existing work has delved deeper to further investigate the main cause of this phenomenon. In this work, we make the attempt to analyze the over-smoothing problem from the perspective of graph, where such problem was first discovered and explored. Intuitively, the self-attention matrix can be seen as a normalized adjacent matrix of a corresponding graph. Based on the above connection, we provide some theoretical analysis and find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models. Specifically, if the standard deviation of layer normalization is sufficiently large, the output of Transformer stacks will converge to a specific low-rank subspace and result in over-smoothing. To alleviate the over-smoothing problem, we consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse. Extensive experiment results on various data sets illustrate the effect of our fusion method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题