论文标题
我缺少什么吗?主题建模
Is there something I'm missing? Topic Modeling in eDiscovery
论文作者
论文摘要
在合法的杂货店中,各方必须搜索其电子存储的信息,以查找与特定案例相关的文档。关于这些搜索范围的谈判通常是基于担心会错过的。本文继续有一个论点,即发现应基于确定案件的事实。如果搜索过程不完整(如果召回少于100%),则在介绍所有相关的可用主题时仍可能是完整的。在这项研究中,使用潜在的Dirichlet分配来从所有已知的相关文档中识别100个主题。然后将这些文档分类为约80%的召回率(即,分类器发现了80%的相关文档,指定了命中量,而错过了20%,被指定为“错过的集合”)。尽管事实是,分类器识别出的所有相关文件少于所有相关文件,但已确定的文档包含了从完整的文档中得出的所有主题。同样的模式认为,分类器是一个幼稚的贝叶斯分类器,该分类器对随机选择的文档进行培训,还是经过连续积极学习训练的支持向量机(将评估侧重于最重要的是最重要的文档)。在任何一个分类器的错过的套装中均未确定任何主题,这些主题尚未在热门单元中看到。不仅是合理的计算机辅助搜索过程(根据《联邦民事诉讼规则》的要求),还可以通过主题衡量。
In legal eDiscovery, the parties are required to search through their electronically stored information to find documents that are relevant to a specific case. Negotiations over the scope of these searches are often based on a fear that something will be missed. This paper continues an argument that discovery should be based on identifying the facts of a case. If a search process is less than complete (if it has Recall less than 100%), it may still be complete in presenting all of the relevant available topics. In this study, Latent Dirichlet Allocation was used to identify 100 topics from all of the known relevant documents. The documents were then categorized to about 80% Recall (i.e., 80% of the relevant documents were found by the categorizer, designated the hit set and 20% were missed, designated the missed set). Despite the fact that less than all of the relevant documents were identified by the categorizer, the documents that were identified contained all of the topics derived from the full set of documents. This same pattern held whether the categorizer was a naïve Bayes categorizer trained on a random selection of documents or a Support Vector Machine trained with Continuous Active Learning (which focuses evaluation on the most-likely-to-be-relevant documents). No topics were identified in either categorizer's missed set that were not already seen in the hit set. Not only is a computer-assisted search process reasonable (as required by the Federal Rules of Civil Procedure), it is also complete when measured by topics.