使用正则化Mahalanobis度量的差异私有文本扰动方法

论文标题

使用正则化Mahalanobis度量的差异私有文本扰动方法

A Differentially Private Text Perturbation Method Using a Regularized Mahalanobis Metric

论文作者

Xu, Zekun, Aggarwal, Abhinav, Feyisetan, Oluwaseyi, Teissier, Nathanael

论文摘要

平衡隐私 - 实用性权衡是许多处理敏感客户数据的实用机器学习系统的重要要求。传播隐私文本分析的一种流行方法是噪声注入，其中首先将文本数据映射到连续的嵌入空间中，并通过从适当的分布中采样球形噪声而受到干扰，然后将其投射回离散的词汇空间。尽管这允许扰动接收所需的度量差异隐私，但在此扰动数据上模拟的下游任务的实用性通常很低，因为球形噪声无法解释嵌入空间中不同单词周围密度的可变性。特别是，即使噪声量表很大，稀疏区域中的单词也可能没有变化。％使用该机制的全局灵敏度可能会为嵌入空间的密集区域中的单词增加过多的噪声，从而导致高公用事业损失，而使用局部灵敏度可以通过添加的噪声的规模泄漏信息。在本文中，我们提出了一种基于Mahalanobis Metric经过精心设计的正规变体的文本扰动机制，以克服此问题。对于任何给定的噪声量表，该度量标准添加了椭圆噪声，以说明嵌入空间中的协方差结构。沿着不同方向的噪声量表中的这种异质性有助于确保稀疏区域中的单词在不牺牲整体效用的情况下具有足够的替代可能性。我们根据该指标提供了一种文本扰动算法，并正式证明其隐私保证。此外，我们从经验上表明，与最先进的拉普拉斯机制相比，我们的机制改善了实现效用水平的隐私统计。

Balancing the privacy-utility tradeoff is a crucial requirement of many practical machine learning systems that deal with sensitive customer data. A popular approach for privacy-preserving text analysis is noise injection, in which text data is first mapped into a continuous embedding space, perturbed by sampling a spherical noise from an appropriate distribution, and then projected back to the discrete vocabulary space. While this allows the perturbation to admit the required metric differential privacy, often the utility of downstream tasks modeled on this perturbed data is low because the spherical noise does not account for the variability in the density around different words in the embedding space. In particular, words in a sparse region are likely unchanged even when the noise scale is large. %Using the global sensitivity of the mechanism can potentially add too much noise to the words in the dense regions of the embedding space, causing a high utility loss, whereas using local sensitivity can leak information through the scale of the noise added. In this paper, we propose a text perturbation mechanism based on a carefully designed regularized variant of the Mahalanobis metric to overcome this problem. For any given noise scale, this metric adds an elliptical noise to account for the covariance structure in the embedding space. This heterogeneity in the noise scale along different directions helps ensure that the words in the sparse region have sufficient likelihood of replacement without sacrificing the overall utility. We provide a text-perturbation algorithm based on this metric and formally prove its privacy guarantees. Additionally, we empirically show that our mechanism improves the privacy statistics to achieve the same level of utility as compared to the state-of-the-art Laplace mechanism.

下载PDF全文

下载文献需遵守相关版权规定

论文标题