Marvolo：实用ML驱动的恶意软件检测的程序化数据增强

论文标题

Marvolo：实用ML驱动的恶意软件检测的程序化数据增强

Marvolo: Programmatic Data Augmentation for Practical ML-Driven Malware Detection

论文作者

Wong, Michael D., Raff, Edward, Holt, James, Netravali, Ravi

论文摘要

由于技术困难以与原始数据一致的方式更改数据，因此在网络安全范围内，数据增强很少见。鉴于获得符合版权限制的良性和恶意培训数据的独特困难，这一缺陷尤其繁重，而银行和政府等机构会收到有针对性的恶意软件，这些恶意软件将永远不会大量存在。我们介绍Marvolo，这是一种二进制突变器，以编程方式生产恶意软件（和良性）数据集，以提高ML驱动的恶意软件探测器的准确性。 Marvolo采用语义保护代码转换，模仿恶意软件作者和防御性良性开发人员通常在实践中进行的更改，从而使我们能够生成有意义的增强数据。至关重要的是，语义传播的转换也使Marvolo能够安全地传播从原始生成的数据样本到新生成的数据样本的标签，而无需强制二进制的昂贵反向工程。此外，Marvolo通过最大化给定时间（或资源）预算中生成的不同数据样本的密度来最大化，从业人员嵌入了几种关键优化，这些优化使从业者保持较低的成本。使用广泛的商业恶意软件数据集和最近的ML驱动的恶意软件探测器的实验表明，Marvolo将准确性提高了5％，而仅在潜在的输入二进制文件的一小部分（15％）上运行。

Data augmentation has been rare in the cyber security domain due to technical difficulties in altering data in a manner that is semantically consistent with the original data. This shortfall is particularly onerous given the unique difficulty of acquiring benign and malicious training data that runs into copyright restrictions, and that institutions like banks and governments receive targeted malware that will never exist in large quantities. We present MARVOLO, a binary mutator that programmatically grows malware (and benign) datasets in a manner that boosts the accuracy of ML-driven malware detectors. MARVOLO employs semantics-preserving code transformations that mimic the alterations that malware authors and defensive benign developers routinely make in practice , allowing us to generate meaningful augmented data. Crucially, semantics-preserving transformations also enable MARVOLO to safely propagate labels from original to newly-generated data samples without mandating expensive reverse engineering of binaries. Further, MARVOLO embeds several key optimizations that keep costs low for practitioners by maximizing the density of diverse data samples generated within a given time (or resource) budget. Experiments using wide-ranging commercial malware datasets and a recent ML-driven malware detector show that MARVOLO boosts accuracies by up to 5%, while operating on only a small fraction (15%) of the potential input binaries.

下载PDF全文

下载文献需遵守相关版权规定

论文标题