用综合音频替换人类音频，以进行开发标点符号预测

论文标题

用综合音频替换人类音频，以进行开发标点符号预测

Replacing Human Audio with Synthetic Audio for On-device Unspoken Punctuation Prediction

论文作者

Soboleva, Daria, Skopek, Ondrej, Šajgalík, Márius, Cărbune, Victor, Weissenberger, Felix, Proskurnia, Julia, Prisacari, Bogdan, Valcarce, Daniel, Lu, Justin, Prabhavalkar, Rohit, Miklos, Balint

论文摘要

我们提出了一种新型的多模式的不言而喻的标点符号预测系统，该预测系统结合了声学和文本功能。我们首次证明，通过仅依靠使用韵律吸引的文本到语音系统生成的合成数据，我们可以胜过在不言而喻的标点符号预测问题上接受昂贵的人类音频录音训练的模型。我们的模型架构非常适合在设备上使用。这是通过利用基于哈希的语音识别文本输出的基于Hash的嵌入来实现的，并结合声学特征作为准转发神经网络的输入，使模型大小较小且延迟较低。

We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features. We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem. Our model architecture is well suited for on-device use. This is achieved by leveraging hash-based embeddings of automatic speech recognition text output in conjunction with acoustic features as input to a quasi-recurrent neural network, keeping the model size small and latency low.

下载PDF全文

下载文献需遵守相关版权规定

论文标题