谁评估评估人员？关于评估基于AI的进攻代码生成器的自动指标

论文标题

谁评估评估人员？关于评估基于AI的进攻代码生成器的自动指标

Who Evaluates the Evaluators? On Automatic Metrics for Assessing AI-based Offensive Code Generators

论文作者

Liguori, Pietro, Improta, Cristina, Natella, Roberto, Cukic, Bojan, Cotroneo, Domenico

论文摘要

基于AI的代码生成器是一种新兴解决方案，用于使用深层神经网络（NMT，NMT）自然语言的描述开始自动编写程序。特别是，通过生成概念验证攻击，代码生成器已用于道德黑客和进攻性安全测试。不幸的是，代码生成器的评估仍然面临几个问题。当前的实践使用输出相似度指标，即使用基本真相引用的生成代码的文本相似性的自动指标。但是，尚不清楚要使用什么指标，哪些度量最适合特定环境。这项工作分析了进攻代码生成器上的大量输出相似性指标。我们使用两个包含进攻式组件和Python代码的数据集应用于两个最先进的NMT模型上的指标。我们将自动指标与人类评估的估计值进行比较，并提供对其优势和局限性的实用见解。

AI-based code generators are an emerging solution for automatically writing programs starting from descriptions in natural language, by using deep neural networks (Neural Machine Translation, NMT). In particular, code generators have been used for ethical hacking and offensive security testing by generating proof-of-concept attacks. Unfortunately, the evaluation of code generators still faces several issues. The current practice uses output similarity metrics, i.e., automatic metrics that compute the textual similarity of generated code with ground-truth references. However, it is not clear what metric to use, and which metric is most suitable for specific contexts. This work analyzes a large set of output similarity metrics on offensive code generators. We apply the metrics on two state-of-the-art NMT models using two datasets containing offensive assembly and Python code with their descriptions in the English language. We compare the estimates from the automatic metrics with human evaluation and provide practical insights into their strengths and limitations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题