论文标题
单词嵌入式编码信息增益
Norm of Word Embedding Encodes Information Gain
论文作者
论文摘要
单词的分布式表示,编码词汇语义信息,但是编码哪种类型的信息?如何?在使用负抽采样方法上关注跳过,我们发现静态单词嵌入的平方规范编码单词传达的信息增益。信息增益是由单词与umigram分布的共发生分布的kullback-leibler差异定义的。我们的发现是通过指数分布的指数家族的理论框架来解释的,并通过清除词频率引起的伪造相关性的精确实验确认。该理论还扩展到语言模型中的上下文化单词嵌入或具有软磁输出层的任何神经网络。我们还证明,KL差异和嵌入的平方规范都为单词在诸如关键字提取,专有名称歧视和超核歧视之类的任务中的信息提供了有用的指标。
Distributed representations of words encode lexical semantic information, but what type of information is encoded and how? Focusing on the skip-gram with negative-sampling method, we found that the squared norm of static word embedding encodes the information gain conveyed by the word; the information gain is defined by the Kullback-Leibler divergence of the co-occurrence distribution of the word to the unigram distribution. Our findings are explained by the theoretical framework of the exponential family of probability distributions and confirmed through precise experiments that remove spurious correlations arising from word frequency. This theory also extends to contextualized word embeddings in language models or any neural networks with the softmax output layer. We also demonstrate that both the KL divergence and the squared norm of embedding provide a useful metric of the informativeness of a word in tasks such as keyword extraction, proper-noun discrimination, and hypernym discrimination.