探索分层注意力变压器，以进行有效的长文档分类

论文标题

探索分层注意力变压器，以进行有效的长文档分类

An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification

论文作者

Chalkidis, Ilias, Dai, Xiang, Fergadiotis, Manos, Malakasiotis, Prodromos, Elliott, Desmond

论文摘要

非层次稀疏注意力变压器的模型，例如longformer和Big Bird，是使用长文档的流行方法。与原始变压器相比，这些方法在效率方面有明显的好处，但是分层注意力变压器（HAT）模型是一种详细研究的替代方法。我们开发并发布了全面训练的帽子模型，这些模型在细分市场上使用跨段编码器，并将其与长形式模型和部分预训练的帽子进行比较。在几个长文档下游分类任务中，我们的最佳HAT模型优于同等大小的长形模型，而使用10-20％的GPU内存和处理文档则要快40-45％。在一系列的消融研究中，我们发现在整个模型中，帽子在整个模型中都表现最好，而不是实现早期或晚期跨段环境化的替代配置。我们的代码在github上：https：//github.com/coastalcph/hierarchical-transformers。

Non-hierarchical sparse attention Transformer-based models, such as Longformer and Big Bird, are popular approaches to working with long documents. There are clear benefits to these approaches compared to the original Transformer in terms of efficiency, but Hierarchical Attention Transformer (HAT) models are a vastly understudied alternative. We develop and release fully pre-trained HAT models that use segment-wise followed by cross-segment encoders and compare them with Longformer models and partially pre-trained HATs. In several long document downstream classification tasks, our best HAT model outperforms equally-sized Longformer models while using 10-20% less GPU memory and processing documents 40-45% faster. In a series of ablation studies, we find that HATs perform best with cross-segment contextualization throughout the model than alternative configurations that implement either early or late cross-segment contextualization. Our code is on GitHub: https://github.com/coastalcph/hierarchical-transformers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题