口音识别的基于语言声学相似性转移

论文标题

口音识别的基于语言声学相似性转移

Linguistic-Acoustic Similarity Based Accent Shift for Accent Recognition

论文作者

Shao, Qijie, Yan, Jinghao, Kang, Jian, Guo, Pengcheng, Shi, Xian, Hu, Pengfei, Xie, Lei

论文摘要

一般的重音识别（AR）模型倾向于直接从频谱中提取低级信息，这些信息总是在说话者或频道上显着过于拟合。考虑到重音可以被视为相对于天然发音的一系列变化，以重音转移为输入，区分重音将是一个更容易的任务。但是由于缺乏本地话语作为锚点，因此很难估计重音转移。在本文中，我们提出了基于语言声学相似性的重音转移（LASA），以实现AR任务。对于强调语音话语，将相应的文本矢量映射到多个强调相关的空间后，可以通过声学嵌入和这些锚点之间的相似性来估计其强调转移。然后，我们将重音转移与降低的文本矢量相连，以获得语言声学双峰表示。与纯净的声学嵌入相比，通过充分利用语言和声学信息，双峰表示可以有效地改善AR性能，更丰富，更清晰。关于重音英语语音识别挑战（AESRC）数据集的实验表明，我们的方法在测试集上达到了77.42％的精度，在挑战中获得了比竞争性系统的相对改进的6.94％。

General accent recognition (AR) models tend to directly extract low-level information from spectrums, which always significantly overfit on speakers or channels. Considering accent can be regarded as a series of shifts relative to native pronunciation, distinguishing accents will be an easier task with accent shift as input. But due to the lack of native utterance as an anchor, estimating the accent shift is difficult. In this paper, we propose linguistic-acoustic similarity based accent shift (LASAS) for AR tasks. For an accent speech utterance, after mapping the corresponding text vector to multiple accent-associated spaces as anchors, its accent shift could be estimated by the similarities between the acoustic embedding and those anchors. Then, we concatenate the accent shift with a dimension-reduced text vector to obtain a linguistic-acoustic bimodal representation. Compared with pure acoustic embedding, the bimodal representation is richer and more clear by taking full advantage of both linguistic and acoustic information, which can effectively improve AR performance. Experiments on Accented English Speech Recognition Challenge (AESRC) dataset show that our method achieves 77.42% accuracy on Test set, obtaining a 6.94% relative improvement over a competitive system in the challenge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题