使用单语培训数据的代码切换语言模型

论文标题

使用单语培训数据的代码切换语言模型

Code Switching Language Model Using Monolingual Training Data

论文作者

Ullah, Asad, Ahmed, Tauseef

论文摘要

仅使用单语言数据培训代码转换（CS）语言模型仍然是一个持续的研究问题。在本文中，仅使用单语培训数据对CS语言模型进行培训。由于复发性神经网络（RNN）模型最适合预测顺序数据。在这项工作中，使用仅单语言英语和西班牙数据的替代批次对RNN语言模型进行了训练，并计算了语言模型的困惑。从结果可以得出的结论是，在训练中使用替代单语言数据的替代批次减少了CS语言模型的困惑。在基于RNN的语言模型的输出嵌入中，使用均方根误差（MSE）始终提高结果。通过将这两种方法结合起来，困惑度从299.63降低到80.38。所提出的方法与语言模型微调具有与代码切换培训数据相媲美。

Training a code-switching (CS) language model using only monolingual data is still an ongoing research problem. In this paper, a CS language model is trained using only monolingual training data. As recurrent neural network (RNN) models are best suited for predicting sequential data. In this work, an RNN language model is trained using alternate batches from only monolingual English and Spanish data and the perplexity of the language model is computed. From the results, it is concluded that using alternate batches of monolingual data in training reduced the perplexity of a CS language model. The results were consistently improved using mean square error (MSE) in the output embeddings of RNN based language model. By combining both methods, perplexity is reduced from 299.63 to 80.38. The proposed methods were comparable to the language model fine tune with code-switch training data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题