共享临床语音记录的重新识别风险

论文标题

共享临床语音记录的重新识别风险

Risk of re-identification for shared clinical speech recordings

论文作者

Wiepert, Daniela A., Malin, Bradley A., Duffy, Joseph R., Utianski, Rene L., Stricker, John L., Jones, David T., Botha, Hugo

论文摘要

需要大型策划数据集以利用医疗保健中的基于语音的工具。这些产品的生产成本很高，从而增加了对数据共享的兴趣。由于言语可以潜在地识别说话者（即语音印刷），因此共享记录引起了隐私问题。我们使用最先进的扬声器识别系统研究了语音记录的重新识别风险，而无需提及人口统计或元数据。我们证明，风险与对手必须考虑的比较数量成反比，即搜索空间。对于一个较小的搜索空间，风险很高，但随着搜索空间的增长而下降（$ precision> 0.85 $，$ <1*10^{6} $比较，$ precision <0.5 $ for $> 3*10^{6} $比较）。接下来，我们表明语音记录的性质会影响重新识别风险，而无连接的语音（例如元音延长）很难识别。我们的发现表明，在特定情况下，可以使用说话者识别系统来重新识别参与者，但实际上，重新识别风险似乎很低。

Large, curated datasets are required to leverage speech-based tools in healthcare. These are costly to produce, resulting in increased interest in data sharing. As speech can potentially identify speakers (i.e., voiceprints), sharing recordings raises privacy concerns. We examine the re-identification risk for speech recordings, without reference to demographic or metadata, using a state-of-the-art speaker recognition system. We demonstrate that the risk is inversely related to the number of comparisons an adversary must consider, i.e., the search space. Risk is high for a small search space but drops as the search space grows ($precision >0.85$ for $<1*10^{6}$ comparisons, $precision <0.5$ for $>3*10^{6}$ comparisons). Next, we show that the nature of a speech recording influences re-identification risk, with non-connected speech (e.g., vowel prolongation) being harder to identify. Our findings suggest that speaker recognition systems can be used to re-identify participants in specific circumstances, but in practice, the re-identification risk appears low.

下载PDF全文

下载文献需遵守相关版权规定

论文标题