使用全球上下文指导渠道和时频转换的说话者表示学习

论文标题

使用全球上下文指导渠道和时频转换的说话者表示学习

Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations

论文作者

Xia, Wei, Hansen, John H. L.

论文摘要

在这项研究中，我们提出了全球环境引导的渠道和时频转换，以模拟说话者表示中的远程，非本地时间频率依赖性和通道差异。我们使用全球上下文信息来增强重要的渠道并通过计算全局上下文和本地特征之间的相似性来重新校准显着的时频位置。在Voxceleb1数据集上评估了所提出的模块以及流行的基于RESNET的模型，该模块是在野外收集的大型扬声器验证语料库。与基线RESNET-LDE模型相比，这种轻巧的块可以轻松地将其合并到CNN模型中，并有效地提高了扬声器验证性能，并且挤压与激发块的幅度很大。还进行了详细的消融研究，以分析可能影响拟议模块性能的各种因素。我们发现，通过采用拟议的L2-TF-GTFC变换块，相等的错误率从4.56％降低到3.07％，相对32.68％的降低和相对27.28％的DCF分数提高了27.28％。结果表明，我们提出的全球背景指导转换模块可以通过实现时间频率和渠道特征重新校准来有效地改善学习的说话者表示。

In this study, we propose the global context guided channel and time-frequency transformations to model the long-range, non-local time-frequency dependencies and channel variances in speaker representations. We use the global context information to enhance important channels and recalibrate salient time-frequency locations by computing the similarity between the global context and local features. The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset, which is a large scale speaker verification corpus collected in the wild. This lightweight block can be easily incorporated into a CNN model with little additional computational costs and effectively improves the speaker verification performance compared to the baseline ResNet-LDE model and the Squeeze&Excitation block by a large margin. Detailed ablation studies are also performed to analyze various factors that may impact the performance of the proposed modules. We find that by employing the proposed L2-tf-GTFC transformation block, the Equal Error Rate decreases from 4.56% to 3.07%, a relative 32.68% reduction, and a relative 27.28% improvement in terms of the DCF score. The results indicate that our proposed global context guided transformation modules can efficiently improve the learned speaker representations by achieving time-frequency and channel-wise feature recalibration.

下载PDF全文

下载文献需遵守相关版权规定

论文标题