论文标题
使用少量学习预测片状测试类别
Predicting Flaky Tests Categories using Few-Shot Learning
论文作者
论文摘要
片状测试是在同一版本的程序上运行时产生不同结果的测试。这种非确定性行为困扰着与虚假信号连续整合,浪费开发人员的时间并减少他们对测试套件的信任。研究强调了保持测试无扁平的重要性。最近,研究界一直通过提出许多静态和动态的方法来推动片状测试的检测。在有希望的同时,这些方法主要集中于将测试分类为片状,即使报告了高表现,了解片状的原因仍然具有挑战性。这部分对于旨在修复它的研究人员和开发人员至关重要。为了帮助理解给定的片状测试,我们提出了FlakyCat,这是根据其根本原因类别对片状测试进行分类的第一种方法。 FlakyCat依靠Codebert来代码表示,并利用了一种基于暹罗网络的几杆学习方法来培训具有很少数据的多级分类器。我们通过从开源Java项目收集的343项片状测试中训练和评估FlakyCat。我们的评估表明,FlakyCat准确地对片状测试进行了分类,加权F1得分为70%。此外,我们研究了每个类别的方法的性能,表明异步等待,无序的收集和与时间相关的片状测试是准确分类的,而与并发相关的片状测试则更具挑战性。最后,为了促进对Flakycat预测的理解,我们为基于Codebert的模型解释性提供了一种新技术,该技术突出了影响分类的代码语句。
Flaky tests are tests that yield different outcomes when run on the same version of a program. This non-deterministic behaviour plagues continuous integration with false signals, wasting developers' time and reducing their trust in test suites. Studies highlighted the importance of keeping tests flakiness-free. Recently, the research community has been pushing forward the detection of flaky tests by suggesting many static and dynamic approaches. While promising, those approaches mainly focus on classifying tests as flaky or not and, even when high performances are reported, it remains challenging to understand the cause of flakiness. This part is crucial for researchers and developers that aim to fix it. To help with the comprehension of a given flaky test, we propose FlakyCat, the first approach for classifying flaky tests based on their root cause category. FlakyCat relies on CodeBERT for code representation and leverages a Siamese network-based Few-Shot learning method to train a multi-class classifier with few data. We train and evaluate FlakyCat on a set of 343 flaky tests collected from open-source Java projects. Our evaluation shows that FlakyCat categorises flaky tests accurately, with a weighted F1 score of 70%. Furthermore, we investigate the performance of our approach for each category, revealing that Async waits, Unordered collections and Time-related flaky tests are accurately classified, while Concurrency-related flaky tests are more challenging to predict. Finally, to facilitate the comprehension of FlakyCat's predictions, we present a new technique for CodeBERT-based model interpretability that highlights code statements influencing the categorization.