论文标题
通过语音级别和音素级掩蔽方法改善语音表示学习
Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach
论文作者
论文摘要
恢复蒙面的语音框架被广泛应用于语音表示学习。但是,这些模型中的大多数在预训练中使用随机掩蔽。在这项工作中,我们提出了两种掩盖方法:(1)语音级掩蔽,使模型掩盖了语音段的比沉默段更多,(2)音素级掩盖,迫使模型掩盖了音素的整个框架,而不是音素片段。我们通过这两种方法对模型进行了预先培训,并在两个下游任务(音素分类和说话者识别)上进行了评估。实验表明,所提出的掩蔽方法有益于提高语音表示的性能。
Recovering the masked speech frames is widely applied in speech representation learning. However, most of these models use random masking in the pre-training. In this work, we proposed two kinds of masking approaches: (1) speech-level masking, making the model to mask more speech segments than silence segments, (2) phoneme-level masking, forcing the model to mask the whole frames of the phoneme, instead of phoneme pieces. We pre-trained the model via these two approaches, and evaluated on two downstream tasks, phoneme classification and speaker recognition. The experiments demonstrated that the proposed masking approaches are beneficial to improve the performance of speech representation.