流式扬声器 - 属于ASR，带有令牌级式嵌入者的嵌入

论文标题

流式扬声器 - 属于ASR，带有令牌级式嵌入者的嵌入

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

论文作者

Kanda, Naoyuki, Wu, Jian, Wu, Yu, Xiao, Xiong, Meng, Zhong, Wang, Xiaofei, Gaur, Yashesh, Chen, Zhuo, Li, Jinyu, Yoshioka, Takuya

论文摘要

本文介绍了流式扬声器的自动语音识别（SA-ASR）模型，该模型可以识别``即使多个人同时讲话，谁说'谁说什么”。我们的模型基于令牌级别的序列化输出训练（T-SOT），该培训最近提议以流媒体方式转录多对词的演讲。为了进一步认识说话者的身份，我们提出了一个基于编码器的说话者嵌入提取器，该提取器可以估算每个公认的代币的说话者表示，不仅是从非重叠的语音中，而且还来自重叠的语音。所提出的说话者的嵌入为T-vector，与T-SOT ASR模型同步提取，从而可以通过低潜伏期的多词器转录来联合执行说话者识别（SID）或说话者诊断（SD）。我们通过使用LibrisPeechMix和Libralics Corpora评估了ASR和SID/SD联合任务的建议模型。所提出的模型比以前的流媒体模型获得了更高的准确性，并且与最新的离线SA-ASR模型相当甚至更高的结果。

This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize ``who spoke what'' with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities, we propose an encoder-decoder based speaker embedding extractor that can estimate a speaker representation for each recognized token not only from non-overlapping speech but also from overlapping speech. The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency. We evaluate the proposed model for a joint task of ASR and SID/SD by using LibriSpeechMix and LibriCSS corpora. The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题