论文标题

部分可观测时空混沌系统的无模型预测

Korean Tokenization for Beam Search Rescoring in Speech Recognition

论文作者

Shim, Kyuhong, Bae, Hyewon, Sung, Wonyong

论文摘要

通过使用外部语言模型(LM)进行适当的光束搜索解码,可以极大地改善自动语音识别(ASR)模型的性能。人们对韩国语音认可的兴趣越来越大,但并没有多少研究集中在解码程序上。在本文中,我们建议一种用于韩国ASR的基于神经网络的LM的韩国令牌化方法。尽管常见的方法是将外部LM与ASR模型使用相同的令牌化方法,但我们表明它可能不是韩语的最佳选择。我们提出了一种新的令牌化方法,该方法在韩国音节中没有尾随的辅音时插入了特殊的令牌Skiptc。通过利用所提出的SKIPTC令牌,LM的输入序列变得非常定期,以便LM可以更好地学习语言特征。我们的实验表明,与没有SKIPTC的同一LM模型相比,所提出的方法达到了较低的单词错误率。此外,我们是第一个报告最近推出的大规模7,600h韩国语音数据集的ASR性能。

The performance of automatic speech recognition (ASR) models can be greatly improved by proper beam-search decoding with external language model (LM). There has been an increasing interest in Korean speech recognition, but not many studies have been focused on the decoding procedure. In this paper, we propose a Korean tokenization method for neural network-based LM used for Korean ASR. Although the common approach is to use the same tokenization method for external LM as the ASR model, we show that it may not be the best choice for Korean. We propose a new tokenization method that inserts a special token, SkipTC, when there is no trailing consonant in a Korean syllable. By utilizing the proposed SkipTC token, the input sequence for LM becomes very regularly patterned so that the LM can better learn the linguistic characteristics. Our experiments show that the proposed approach achieves a lower word error rate compared to the same LM model without SkipTC. In addition, we are the first to report the ASR performance for the recently introduced large-scale 7,600h Korean speech dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源