使用自我监督的语音模型进行音素分割

论文标题

使用自我监督的语音模型进行音素分割

Phoneme Segmentation Using Self-Supervised Speech Models

论文作者

Strgar, Luke, Harwath, David

论文摘要

我们将转移学习应用于音素分割的任务，并演示在任务的自我监督预训练中学到的表示的实用性。我们的模型通过战略性放置的卷积扩展了变压器式编码器，这些卷积会操纵在预训练中学到的特征。使用Timit和Buckeye Corpora，我们在监督和无监督的设置中训练和测试该模型。后一种情况是通过提供单独模型的预测嘈杂的标签集来完成的，它已经以无监督的方式进行了培训。结果表明，在两个设置和两个数据集中，我们的模型都会黯然失色。最后，在发布了已发表的代码审查中的观察并尝试重现过去细分结果的尝试之后，我们发现有必要歧义广泛使用的评估指标的定义和实施。我们通过描述两个不同的评估方案并描述其细微差别来解决这种歧义。

We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of representations learned in self-supervised pre-training for the task. Our model extends transformer-style encoders with strategically placed convolutions that manipulate features learned in pre-training. Using the TIMIT and Buckeye corpora we train and test the model in the supervised and unsupervised settings. The latter case is accomplished by furnishing a noisy label-set with the predictions of a separate model, it having been trained in an unsupervised fashion. Results indicate our model eclipses previous state-of-the-art performance in both settings and on both datasets. Finally, following observations during published code review and attempts to reproduce past segmentation results, we find a need to disambiguate the definition and implementation of widely-used evaluation metrics. We resolve this ambiguity by delineating two distinct evaluation schemes and describing their nuances.

下载PDF全文

下载文献需遵守相关版权规定

论文标题