论文标题
posnoise:在作者分析中对主题偏见的有效对策
POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis
论文作者
论文摘要
作者身份验证(AV)是数字文本取证中的一项基本研究任务,该任务解决了是否由同一个人撰写两个文本的问题。近年来,已经提出了各种关注此问题的AV方法,可以分为两类:第一类是指基于明确定义的功能的此类方法,其中一个完全控制了哪些功能及其实际代表。另一方面,第二类涉及基于隐式定义特征的此类AV方法,其中不涉及控制机制,因此文本中的任何字符序列都可以作为潜在特征。但是,属于第二类的AV方法具有文本主题可能会偏向其分类预测的风险,这反过来又可能导致对其结果的误导性结论。为了解决这个问题,我们提出了一种称为posnoise的预处理技术,该技术有效地掩盖了给定文本中与主题相关的内容。这样,AV方法被迫专注于与写作风格更相关的文本单元。我们基于六种AV方法(属于第二类)的经验评估,七个语料库表明,与众所周知的主题掩盖方法相比,Posnoise在42个病例中的34例中会带来更好的结果,其准确性提高了10%。
Authorship verification (AV) is a fundamental research task in digital text forensics, which addresses the problem of whether two texts were written by the same person. In recent years, a variety of AV methods have been proposed that focus on this problem and can be divided into two categories: The first category refers to such methods that are based on explicitly defined features, where one has full control over which features are considered and what they actually represent. The second category, on the other hand, relates to such AV methods that are based on implicitly defined features, where no control mechanism is involved, so that any character sequence in a text can serve as a potential feature. However, AV methods belonging to the second category bear the risk that the topic of the texts may bias their classification predictions, which in turn may lead to misleading conclusions regarding their results. To tackle this problem, we propose a preprocessing technique called POSNoise, which effectively masks topic-related content in a given text. In this way, AV methods are forced to focus on such text units that are more related to the writing style. Our empirical evaluation based on six AV methods (falling into the second category) and seven corpora shows that POSNoise leads to better results compared to a well-known topic masking approach in 34 out of 42 cases, with an increase in accuracy of up to 10%.