威胁情报中指定实体识别的基于多功能的语义增强网络

论文标题

威胁情报中指定实体识别的基于多功能的语义增强网络

Multi-features based Semantic Augmentation Networks for Named Entity Recognition in Threat Intelligence

论文作者

Liu, Peipei, Li, Hong, Wang, Zuoguang, Liu, Jie, Ren, Yimo, Zhu, Hongsong

论文摘要

从非结构化网络文本中提取网络安全实体，例如攻击者和漏洞是安全分析的重要组成部分。但是，智能数据的稀疏性是由较高的频率变化产生的，并且网络安全实体名称的随机性使得当前方法在提取与安全相关的概念和实体方面的表现很难。为此，我们提出了一种语义增强方法，该方法结合了不同的语言特征，以丰富输入令牌的表示，以通过非结构化文本检测和对网络安全名称进行分类。特别是，我们编码和汇总了每个输入令牌的组成特征，形态特征和语音特征的一部分，以提高方法的鲁棒性。不仅如此，令牌从其在网络安全域中最相似的k单词获得了增强的语义信息，在该语料库中，将一个细心的模块利用了一个细分模块来权衡单词的差异，并从基于大规模的通用田间语料库的上下文线索中获得了差异。我们已经在网络安全数据集DNRTI和MalwaretextDB上进行了实验，结果证明了该方法的有效性。

Extracting cybersecurity entities such as attackers and vulnerabilities from unstructured network texts is an important part of security analysis. However, the sparsity of intelligence data resulted from the higher frequency variations and the randomness of cybersecurity entity names makes it difficult for current methods to perform well in extracting security-related concepts and entities. To this end, we propose a semantic augmentation method which incorporates different linguistic features to enrich the representation of input tokens to detect and classify the cybersecurity names over unstructured text. In particular, we encode and aggregate the constituent feature, morphological feature and part of speech feature for each input token to improve the robustness of the method. More than that, a token gets augmented semantic information from its most similar K words in cybersecurity domain corpus where an attentive module is leveraged to weigh differences of the words, and from contextual clues based on a large-scale general field corpus. We have conducted experiments on the cybersecurity datasets DNRTI and MalwareTextDB, and the results demonstrate the effectiveness of the proposed method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题