论文标题
文本分类中的拓扑数据分析:提取具有添加信息的功能
Topological Data Analysis in Text Classification: Extracting Features with Additive Information
论文作者
论文摘要
虽然在许多有关高维数字数据的研究中已经探索了拓扑数据分析的强度,但将其应用于文本仍然是一项艰巨的任务。由于拓扑数据分析的主要目标是定义和量化数字数据中的形状,即使矢量空间的几何形状和概念空间显然与信息检索和语义和语义相关,但定义文本中的形状也更具挑战性。在本文中,我们研究了两种不同的拓扑特征从文本中提取的方法,用作单词的基础表示,即两种最流行的方法,即单词嵌入和tf-idf载体。为了从单词嵌入空间中提取拓扑特征,我们将文本文档的嵌入为高维时间序列,并分析了基础图的拓扑,其中顶点对应于不同的嵌入尺寸。对于使用TF-IDF表示的拓扑数据分析,我们分析了该图的拓扑结构,其顶点来自文本文档中不同块的TF-IDF向量。在这两种情况下,我们都采用同源持久性来揭示不同距离分辨率下的几何结构。我们的结果表明,这些拓扑特征带有一些独家信息,这些信息不会被传统的文本挖掘方法捕获。在我们的实验中,我们观察到整体模型中的常规特征添加拓扑特征可改善分类结果(最多5 \%)。另一方面,正如预期的那样,拓扑特征本身可能不足以有效分类。查看单词嵌入中的TDA功能是否足够,这是一个开放的问题,因为它们似乎在使用线性支持向量分类器获得的最高结果的范围内执行。
While the strength of Topological Data Analysis has been explored in many studies on high dimensional numeric data, it is still a challenging task to apply it to text. As the primary goal in topological data analysis is to define and quantify the shapes in numeric data, defining shapes in the text is much more challenging, even though the geometries of vector spaces and conceptual spaces are clearly relevant for information retrieval and semantics. In this paper, we examine two different methods of extraction of topological features from text, using as the underlying representations of words the two most popular methods, namely word embeddings and TF-IDF vectors. To extract topological features from the word embedding space, we interpret the embedding of a text document as high dimensional time series, and we analyze the topology of the underlying graph where the vertices correspond to different embedding dimensions. For topological data analysis with the TF-IDF representations, we analyze the topology of the graph whose vertices come from the TF-IDF vectors of different blocks in the textual document. In both cases, we apply homological persistence to reveal the geometric structures under different distance resolutions. Our results show that these topological features carry some exclusive information that is not captured by conventional text mining methods. In our experiments we observe adding topological features to the conventional features in ensemble models improves the classification results (up to 5\%). On the other hand, as expected, topological features by themselves may be not sufficient for effective classification. It is an open problem to see whether TDA features from word embeddings might be sufficient, as they seem to perform within a range of few points from top results obtained with a linear support vector classifier.