论文标题

分类群:主题分类学完成新的主题集群的层次结构

TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters

论文作者

Lee, Dongha, Shen, Jiaming, Kang, SeongKu, Yoon, Susik, Han, Jiawei, Yu, Hwanjo

论文摘要

主题分类法代表文档集合的潜在主题(或类别)结构,在许多应用程序(例如Web搜索和信息过滤)中提供了宝贵的内容知识。最近,已经开发了几种无监督的方法来自动从文本语料库中构建主题分类法,但是在没有任何先验知识的情况下,生成所需的分类法是一项挑战。在本文中,我们研究了如何利用有关主题结构的部分(或不完整)信息作为指导,以找出完整的主题分类法。我们为主题分类完成的新框架提出了一个新的框架,名为Talsocom,该框架通过发现新颖的术语和文档的次主题簇来递归扩展主题分类法。 To effectively identify novel topics within a hierarchical topic structure, TaxoCom devises its embedding and clustering techniques to be closely-linked with each other: (i) locally discriminative embedding optimizes the text embedding space to be discriminative among known (i.e., given) sub-topics, and (ii) novelty adaptive clustering assigns terms into either one of the known sub-topics or novel sub-topics.我们在两个现实世界数据集上进行的全面实验表明,分类群不仅在术语相干性和主题覆盖范围内生成高质量的主题分类法,而且在下游任务上都优于所有其他基准。

Topic taxonomies, which represent the latent topic (or category) structure of document collections, provide valuable knowledge of contents in many applications such as web search and information filtering. Recently, several unsupervised methods have been developed to automatically construct the topic taxonomy from a text corpus, but it is challenging to generate the desired taxonomy without any prior knowledge. In this paper, we study how to leverage the partial (or incomplete) information about the topic structure as guidance to find out the complete topic taxonomy. We propose a novel framework for topic taxonomy completion, named TaxoCom, which recursively expands the topic taxonomy by discovering novel sub-topic clusters of terms and documents. To effectively identify novel topics within a hierarchical topic structure, TaxoCom devises its embedding and clustering techniques to be closely-linked with each other: (i) locally discriminative embedding optimizes the text embedding space to be discriminative among known (i.e., given) sub-topics, and (ii) novelty adaptive clustering assigns terms into either one of the known sub-topics or novel sub-topics. Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage but also outperforms all other baselines for a downstream task.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源