论文标题
Bertopic:基于类的TF-IDF程序的神经主题建模
BERTopic: Neural topic modeling with a class-based TF-IDF procedure
论文作者
论文摘要
主题模型可以是发现文档集合中潜在主题的有用工具。最近的研究表明,方法建模作为集群任务的可行性。我们提出了Bertopic,这是一个主题模型,该模型通过开发基于类TF-IDF的类别的变化来提取连贯的主题表示来扩展此过程。更具体地说,Bertopic生成文档嵌入具有基于预训练的变压器的语言模型,簇这些嵌入,最后,使用基于类的TF-IDF过程生成主题表示。 Bertopic生成了连贯的主题,并在涉及古典模型的各种基准和遵循主题建模的聚类方法的基准中保持竞争力。
Topic models can be useful tools to discover latent topics in collections of documents. Recent studies have shown the feasibility of approach topic modeling as a clustering task. We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF. More specifically, BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure. BERTopic generates coherent topics and remains competitive across a variety of benchmarks involving classical models and those that follow the more recent clustering approach of topic modeling.