论文标题

要了解文本差异对重复错误报告检测的影响

Towards Understanding the Impacts of Textual Dissimilarity on Duplicate Bug Report Detection

论文作者

Jahan, Sigma, Rahman, Mohammad Masudur

论文摘要

大约40%的软件错误报告是彼此的重复项,在软件维护过程中构成了一个主要的开销。传统技术通常着重于检测文本相似的重复错误报告。但是,在错误跟踪系统中,许多重复的错误报告在文本上可能并不相似,传统技术可能会缺乏。在本文中,我们进行了一项大规模的实证研究,以更好地了解文本差异对检测重复错误报告的影响。首先,我们从三个开源系统收集了92,854个错误报告,并构造了两个包含文本相似和文本不同的重复错误报告的数据集。然后,我们在检测重复的错误报告中确定了三种现有技术的性能,并表明它们的性能对于文本相似的重复报告明显较差。其次,我们使用描述性分析,单词嵌入可视化和手动分析的组合分析了两组错误报告。我们发现,文本不同的重复错误报告通常会错过重要组件(例如,预期的行为和复制步骤),这可能会导致其文本差异和现有技术的性能差。最后,我们将特定于域的嵌入应用于重复的错误报告检测问题,这表明结果不同。上面的所有这些发现都需要进一步调查和更有效的解决方案,以检测文本不同的重复错误报告。

About 40% of software bug reports are duplicates of one another, which pose a major overhead during software maintenance. Traditional techniques often focus on detecting duplicate bug reports that are textually similar. However, in bug tracking systems, many duplicate bug reports might not be textually similar, for which the traditional techniques might fall short. In this paper, we conduct a large-scale empirical study to better understand the impacts of textual dissimilarity on the detection of duplicate bug reports. First, we collect a total of 92,854 bug reports from three open-source systems and construct two datasets containing textually similar and textually dissimilar duplicate bug reports. Then we determine the performance of three existing techniques in detecting duplicate bug reports and show that their performance is significantly poor for textually dissimilar duplicate reports. Second, we analyze the two groups of bug reports using a combination of descriptive analysis, word embedding visualization, and manual analysis. We found that textually dissimilar duplicate bug reports often miss important components (e.g., expected behaviors and steps to reproduce), which could lead to their textual differences and poor performance by the existing techniques. Finally, we apply domain-specific embedding to duplicate bug report detection problems, which shows mixed results. All these findings above warrant further investigation and more effective solutions for detecting textually dissimilar duplicate bug reports.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源