论文标题
使用通讯和暹罗网络的语音学习的无监督功能学习
Unsupervised feature learning for speech using correspondence and Siamese networks
论文作者
论文摘要
在零资源的设置中,转录语音音频是不可用的,无监督的功能学习对于下游语音处理任务至关重要。在这里,我们比较了框架级别的声学特征学习的两种最新方法。对于这两种方法,无监督的术语发现都用于查找相同未知类型的一对单词示例。然后,动态编程用于对齐每个单词对之间的特征框架,作为两个模型的弱自上而下的监督。对于通信自动编码器(CAE),将匹配帧表示为输入输出对。 Triamese网络使用对比损失来减少相同预测单词类型的框架之间的距离,同时增加负面示例之间的距离。首次使用相同的弱监督对将这些特征提取器在相同的歧视任务上进行比较。我们发现,在此处考虑的两个数据集中,CAE胜过Triamese网络。但是,我们表明,一种新的混合通讯方法(ctriamese)始终以英语和Xitsonga评估数据的平均精度和ABX错误率在平均精度和ABX错误率方面都超过了CAE和Triamese模型。
In zero-resource settings where transcribed speech audio is unavailable, unsupervised feature learning is essential for downstream speech processing tasks. Here we compare two recent methods for frame-level acoustic feature learning. For both methods, unsupervised term discovery is used to find pairs of word examples of the same unknown type. Dynamic programming is then used to align the feature frames between each word pair, serving as weak top-down supervision for the two models. For the correspondence autoencoder (CAE), matching frames are presented as input-output pairs. The Triamese network uses a contrastive loss to reduce the distance between frames of the same predicted word type while increasing the distance between negative examples. For the first time, these feature extractors are compared on the same discrimination tasks using the same weak supervision pairs. We find that, on the two datasets considered here, the CAE outperforms the Triamese network. However, we show that a new hybrid correspondence-Triamese approach (CTriamese), consistently outperforms both the CAE and Triamese models in terms of average precision and ABX error rates on both English and Xitsonga evaluation data.