论文标题
计算机辅助发音训练 - 几乎是您需要的语音综合
Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need
论文作者
论文摘要
长期以来,研究界已经研究了非本地语音的计算机辅助发音训练(上尉)方法。研究人员致力于研究各种模型架构,例如贝叶斯网络和深度学习方法,以及分析语音信号的不同表示。尽管近年来取得了重大进展,但现有的CAPT方法仍无法以高精度检测发音误差(在40 \%-80 \%召回时只有60 \%精度)。关键问题之一是发音错误检测模型的可靠培训所需的不发音语音的可用性较低。如果我们有一个可以模仿非本地语音并产生任何数量的训练数据的生成模型,那么检测发音错误的任务将容易得多。我们介绍了三种基于音素到音量(P2P),文本到语音(T2S)以及语音到语音(S2S)转换的创新技术,以生成正确发音和错误发音的合成语音。我们表明,这些技术不仅提高了三种机器学习模型的准确性,以检测发音错误,还可以帮助建立现场的新最新技术。较早的研究使用了简单的语音生成技术,例如P2P转换,但仅是提高发音误差准确性的附加机制。另一方面,我们认为语音生成是检测发音误差的第一类方法。这些技术的有效性在检测发音和词汇应力误差的任务中进行了评估。评估中使用了非本地英语语音语料库。与最先进的方法相比,最佳提出的S2S技术将AUC度量中的发音误差的准确性从0.528提高到0.749。
The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect pronunciation errors with high accuracy (only 60\% precision at 40\%-80\% recall). One of the key problems is the low availability of mispronounced speech that is needed for the reliable training of pronunciation error detection models. If we had a generative model that could mimic non-native speech and produce any amount of training data, then the task of detecting pronunciation errors would be much easier. We present three innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S), and speech-to-speech (S2S) conversion to generate correctly pronounced and mispronounced synthetic speech. We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field. Earlier studies have used simple speech generation techniques such as P2P conversion, but only as an additional mechanism to improve the accuracy of pronunciation error detection. We, on the other hand, consider speech generation to be the first-class method of detecting pronunciation errors. The effectiveness of these techniques is assessed in the tasks of detecting pronunciation and lexical stress errors. Non-native English speech corpora of German, Italian, and Polish speakers are used in the evaluations. The best proposed S2S technique improves the accuracy of detecting pronunciation errors in AUC metric by 41\% from 0.528 to 0.749 compared to the state-of-the-art approach.