多演讲者会议的语音分离，诊断和认可：系统描述，比较和分析

论文标题

多演讲者会议的语音分离，诊断和认可：系统描述，比较和分析

Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis

论文作者

Raj, Desh, Denisov, Pavel, Chen, Zhuo, Erdogan, Hakan, Huang, Zili, He, Maokui, Watanabe, Shinji, Du, Jun, Yoshioka, Takuya, Luo, Yi, Kanda, Naoyuki, Li, Jinyu, Wisdom, Scott, Hershey, John R.

论文摘要

对未分段记录的多扬声器语音识别具有不同的应用，例如满足转录和自动字幕生成。在过去十年来处理语音分离，说话者诊断和自动语音识别（ASR）的系统方面的技术进步，已经有可能构建在此任务上实现合理错误率的管道。在本文中，我们为图书馆提供了一个端到端的模块化系统，该系统满足数据，该系统以该顺序结合了独立训练的分离，诊断和识别组件。我们在管道的每个阶段研究了不同最新方法的效果，并使用特定于SDR和DER等任务指标以及下游WER报告结果。实验表明，通过训练有素的分离模块的存在可以有效地减轻诊断和ASR语音重叠的问题。我们的最佳系统实现了12.7％的扬声器属性，这与非重叠的ASR接近。

Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题