一项关于自动回归和非自动回忆性多标签学习的研究

论文标题

一项关于自动回归和非自动回忆性多标签学习的研究

A Study on the Autoregressive and non-Autoregressive Multi-label Learning

论文作者

Barezi, Elham J., Calixto, Iacer, Cho, Kyunghyun, Fung, Pascale

论文摘要

极端分类任务是具有大量标签（标签）的多标签任务。这些任务很困难，因为标签空间通常很大，例如数千或数百万个标签，（ii）非常稀疏，即很少有标签适用于每个输入文档，并且（iii）高度相关，这意味着一个标签的存在会改变预测所有其他标签的可能性。在这项工作中，我们提出了一个基于自我注意的差异编码模型，以共同提取标签标签标签和标签 - 功能依赖性，并预测给定输入的标签。更详细地，我们提出了一个非自动回归潜在可变模型，并将其与强大的自回归基线进行比较，该基线可预测基于所有先前生成的标签的标签。因此，我们的模型可用于并行预测所有标签，同时通过潜在变量同时包括标签标签和标签 - 功能依赖性，并与自动回应基线进行比较。我们将模型应用于四个标准的极端分类自然语言数据集，以及一个新闻视频数据集，用于从语义概念的词典中进行自动标签检测。实验结果表明，尽管使用给定标签的标签进行链订单标签预测的自回旋模型，但适用于小型标签或高度排名标签的预测，但是当我们需要预测更多的标签，或者数据集具有较大的标签数量时，我们的非自动进程模型超过了2％至6％。

Extreme classification tasks are multi-label tasks with an extremely large number of labels (tags). These tasks are hard because the label space is usually (i) very large, e.g. thousands or millions of labels, (ii) very sparse, i.e. very few labels apply to each input document, and (iii) highly correlated, meaning that the existence of one label changes the likelihood of predicting all other labels. In this work, we propose a self-attention based variational encoder-model to extract the label-label and label-feature dependencies jointly and to predict labels for a given input. In more detail, we propose a non-autoregressive latent variable model and compare it to a strong autoregressive baseline that predicts a label based on all previously generated labels. Our model can therefore be used to predict all labels in parallel while still including both label-label and label-feature dependencies through latent variables, and compares favourably to the autoregressive baseline. We apply our models to four standard extreme classification natural language data sets, and one news videos dataset for automated label detection from a lexicon of semantic concepts. Experimental results show that although the autoregressive models, where use a given order of the labels for chain-order label prediction, work great for the small scale labels or the prediction of the highly ranked label, but our non-autoregressive model surpasses them by around 2% to 6% when we need to predict more labels, or the dataset has a larger number of the labels.

下载PDF全文

下载文献需遵守相关版权规定

论文标题