深度学习的何时以及为什么产生无效的输入：一项实证研究

论文标题

深度学习的何时以及为什么产生无效的输入：一项实证研究

When and Why Test Generators for Deep Learning Produce Invalid Inputs: an Empirical Study

论文作者

Riccio, Vincenzo, Tonella, Paolo

论文摘要

测试基于深度学习（DL）的系统固有地需要大型和代表性的测试集，以评估DL系统是否在其培训数据集之外概括。已经提出了不同的测试输入发电机（TIG）来产生人工输入，以通过触发不良行为来暴露DL系统问题。不幸的是，此类产生的输入可能无效，即作为输入域的一部分，无法识别，因此提供了不可靠的质量评估。自动化验证器可以减轻手动检查人类测试人员投入的有效性的负担，尽管输入有效性是一个难以形式化的概念，从而自动化了。在本文中，我们研究TIG在多大程度上可以产生有效的输入。我们进行了一项大型实证研究，涉及2个不同的自动化验证者，220名人类评估者，5个不同的TIG和3个分类任务。根据自动验证器的说法，我们的结果表明，人工产生的输入有效84％是有效的，但并非总是保留其预期标签。自动化验证者与人类达成了良好的共识（精度为78％），但在处理功能丰富的数据集时仍存在局限性。

Testing Deep Learning (DL) based systems inherently requires large and representative test sets to evaluate whether DL systems generalise beyond their training datasets. Diverse Test Input Generators (TIGs) have been proposed to produce artificial inputs that expose issues of the DL systems by triggering misbehaviours. Unfortunately, such generated inputs may be invalid, i.e., not recognisable as part of the input domain, thus providing an unreliable quality assessment. Automated validators can ease the burden of manually checking the validity of inputs for human testers, although input validity is a concept difficult to formalise and, thus, automate. In this paper, we investigate to what extent TIGs can generate valid inputs, according to both automated and human validators. We conduct a large empirical study, involving 2 different automated validators, 220 human assessors, 5 different TIGs and 3 classification tasks. Our results show that 84% artificially generated inputs are valid, according to automated validators, but their expected label is not always preserved. Automated validators reach a good consensus with humans (78% accuracy), but still have limitations when dealing with feature-rich datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题