端到端的语音识别模型从取消识别数据

论文标题

端到端的语音识别模型从取消识别数据

End-to-end speech recognition modeling from de-identified data

论文作者

Flechl, Martin, Yin, Shou-Chun, Park, Junho, Skala, Peter

论文摘要

去识别用于自动语音识别建模的数据是保护隐私的关键组成部分，尤其是在医疗领域。但是，仅将所有个人身份信息（PII）从端到端模型培训数据中删除，尤其是在识别类似类别的名称，日期，位置和单词的情况下会导致重大的性能下降。我们提出并评估一种两步方法，以部分回收这一损失。首先，识别PII，并用同一类别的随机单词序列代替每种情况。然后，通过文本到语音或将匹配从语料库提取的匹配音频片段拼接在一起产生相应的音频。这些人造音频/标签对以及来自没有PII的原始数据的扬声器转向训练模型。我们评估了这种方法在医疗对话内部数据上的性能，并观察到一般单词错误率中几乎整个性能下降的恢复，同时仍保持强劲的诊断性能。我们的主要重点是改善与PII相关单词的识别中的回忆和精度。根据PII类别，可以使用我们建议的方法在$ 50 \％-90 \％$之间降级。

De-identification of data used for automatic speech recognition modeling is a critical component in protecting privacy, especially in the medical domain. However, simply removing all personally identifiable information (PII) from end-to-end model training data leads to a significant performance degradation in particular for the recognition of names, dates, locations, and words from similar categories. We propose and evaluate a two-step method for partially recovering this loss. First, PII is identified, and each occurrence is replaced with a random word sequence of the same category. Then, corresponding audio is produced via text-to-speech or by splicing together matching audio fragments extracted from the corpus. These artificial audio/label pairs, together with speaker turns from the original data without PII, are used to train models. We evaluate the performance of this method on in-house data of medical conversations and observe a recovery of almost the entire performance degradation in the general word error rate while still maintaining a strong diarization performance. Our main focus is the improvement of recall and precision in the recognition of PII-related words. Depending on the PII category, between $50\% - 90\%$ of the performance degradation can be recovered using our proposed method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题