论文标题
基于联接和个体差异解释的各个单词嵌入的盲信号分解
Blind signal decomposition of various word embeddings based on join and individual variance explained
论文作者
论文摘要
近年来,自然语言处理(NLP)已成为人类生活中各种应用的最重要领域之一。作为最基本的任务,单词嵌入领域仍然需要更多的关注和研究。当前,有关单词嵌入的现有作品正在集中于提出新颖的嵌入算法和降低尺寸的技术,以训练训练良好的单词嵌入。在本文中,我们建议使用一种新型的关节信号分离方法 - 将各种训练有素的单词嵌入到关节和单个组件中。通过这个分解框架,我们可以轻松地研究不同单词嵌入之间的相似性和差异。我们对Word2Vec,FastText和Glove进行了广泛的经验研究,并在不同的语料库和不同的尺寸上进行了训练。我们根据Twitter和Stanford Treebank上的情感分析比较了不同分解组件的性能。我们发现,通过将不同的单词嵌入到联合组件中,可以为具有较低性能的原始单词嵌入而大大改善情感性能。此外,我们发现,通过将不同的组件连接在一起,相同的模型可以实现更好的性能。这些发现为嵌入一词提供了很好的见解,我们的作品为通过融合而生成单词嵌入的新作品。
In recent years, natural language processing (NLP) has become one of the most important areas with various applications in human's life. As the most fundamental task, the field of word embedding still requires more attention and research. Currently, existing works about word embedding are focusing on proposing novel embedding algorithms and dimension reduction techniques on well-trained word embeddings. In this paper, we propose to use a novel joint signal separation method - JIVE to jointly decompose various trained word embeddings into joint and individual components. Through this decomposition framework, we can easily investigate the similarity and difference among different word embeddings. We conducted extensive empirical study on word2vec, FastText and GLoVE trained on different corpus and with different dimensions. We compared the performance of different decomposed components based on sentiment analysis on Twitter and Stanford sentiment treebank. We found that by mapping different word embeddings into the joint component, sentiment performance can be greatly improved for the original word embeddings with lower performance. Moreover, we found that by concatenating different components together, the same model can achieve better performance. These findings provide great insights into the word embeddings and our work offer a new of generating word embeddings by fusing.