论文标题

ASR感知的端到端神经诊断

ASR-Aware End-to-end Neural Diarization

论文作者

Khare, Aparna, Han, Eunjung, Yang, Yuguang, Stolcke, Andreas

论文摘要

我们提出了一个基于构象异构体的端到端神经读物(EEND)模型,该模型使用声学输入和自动语音识别(ASR)模型得出的特征。探索了两类功能:直接从ASR输出(手机,词点和单词边界)得出的功能,以及源自词汇扬声器更改检测模型的功能,该功能通过对ASR输出的预审计的BERT模型进行训练。提出了对基于构象异构体的构建结构进行三个修改,以结合这些功能。首先,ASR特征与声学特征相连。其次,我们提出了一种称为上下文化的自我注意的新的注意机制,该机制利用ASR功能来构建强大的说话者表示。最后,多任务学习用于训练模型,以最大程度地减少ASR特征的分类损失以及诊断损失。有关调节板+SRE数据集的两扬声器英语对话的实验表明,使用位置中的位置信息的多任务学习是使用ASR功能的最有效方法,将诊断错误率(DER)降低了20%,相对于基线。

We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features derived from a lexical speaker change detection model, trained by fine-tuning a pretrained BERT model on the ASR output. Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features. First, ASR features are concatenated with acoustic features. Second, we propose a new attention mechanism called contextualized self-attention that utilizes ASR features to build robust speaker representations. Finally, multi-task learning is used to train the model to minimize classification loss for the ASR features along with diarization loss. Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features, reducing the diarization error rate (DER) by 20% relative to the baseline.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源