论文标题
为黑暗网络的语言开发新的灯光
Shedding New Light on the Language of the Dark Web
论文作者
论文摘要
隐藏的性质和黑暗网络的有限可访问性,再加上该领域缺乏公共数据集,因此很难研究其固有的特征,例如语言特性。黑暗网络域的文本分类的先前作品表明,深度神经模型的使用可能是无效的,这可能是由于黑暗和表面网之间的语言差异。但是,没有做很多工作来揭示黑暗网络的语言特征。本文介绍了CODA,这是一个公开可用的深色Web数据集,该数据集由针对基于文本的Dark Web分析的10000个Web文档组成。通过利用Coda,我们对黑网进行了彻底的语言分析,并检查了黑网和表面网络之间的文本差异。我们还评估了深色网页分类的各种方法的性能。最后,我们将CODA与现有的公共Dark Web数据集进行了比较,并评估了它们对各种用例的适用性。
The hidden nature and the limited accessibility of the Dark Web, combined with the lack of public datasets in this domain, make it difficult to study its inherent characteristics such as linguistic properties. Previous works on text classification of Dark Web domain have suggested that the use of deep neural models may be ineffective, potentially due to the linguistic differences between the Dark and Surface Webs. However, not much work has been done to uncover the linguistic characteristics of the Dark Web. This paper introduces CoDA, a publicly available Dark Web dataset consisting of 10000 web documents tailored towards text-based Dark Web analysis. By leveraging CoDA, we conduct a thorough linguistic analysis of the Dark Web and examine the textual differences between the Dark Web and the Surface Web. We also assess the performance of various methods of Dark Web page classification. Finally, we compare CoDA with an existing public Dark Web dataset and evaluate their suitability for various use cases.