通过扬声器追踪缓冲区的在线端到端神经诊断

论文标题

通过扬声器追踪缓冲区的在线端到端神经诊断

Online End-to-End Neural Diarization with Speaker-Tracing Buffer

论文作者

Xue, Yawen, Horiguchi, Shota, Fujita, Yusuke, Watanabe, Shinji, Nagamatsu, Kenji

论文摘要

本文提出了一种新颖的在线演讲者诊断算法，该算法基于完全监督的自发机制（SA-EEND）。在线诊断固有地提出了说话者的置换问题，因为在整个录音中可能会错误地分配说话者区域。为了避免这种不一致，我们提出了一种扬声器追踪的缓冲机制，该机制选择了几个代表以前块中说话者置换信息的输入框架，并将它们存储在缓冲区中。这些缓冲的帧与当前块中的输入框架堆叠在一起，并馈入自我发项网络。我们的方法通过检查其相应输出之间的相关性来确保跨缓冲区和当前块的一致诊断输出。此外，我们培训了SA-EEND，其块大小可变，以减轻训练和通过扬声器追踪缓冲机制引入的推理之间的不匹配。实验结果，包括在线SA-EEND和可变块大小，Callhome的DER为12.54％，CSJ的实验结果为12.77％，实际延迟为1.4。

This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND). Online diarization inherently presents a speaker's permutation problem due to the possibility to assign speaker regions incorrectly across the recording. To circumvent this inconsistency, we proposed a speaker-tracing buffer mechanism that selects several input frames representing the speaker permutation information from previous chunks and stores them in a buffer. These buffered frames are stacked with the input frames in the current chunk and fed into a self-attention network. Our method ensures consistent diarization outputs across the buffer and the current chunk by checking the correlation between their corresponding outputs. Additionally, we trained SA-EEND with variable chunk-sizes to mitigate the mismatch between training and inference introduced by the speaker-tracing buffer mechanism. Experimental results, including online SA-EEND and variable chunk-size, achieved DERs of 12.54% for CALLHOME and 20.77% for CSJ with 1.4s actual latency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题