远场演讲者在短言语中的深入嵌入者的嵌入

论文标题

远场演讲者在短言语中的深入嵌入者的嵌入

Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances

论文作者

Gusev, Aleksei, Volokhov, Vladimir, Andzhukaev, Tseren, Novoselov, Sergey, Lavrentyeva, Galina, Volkova, Marina, Gazizullina, Alice, Shulipa, Andrey, Gorlanov, Artem, Avdeeva, Anastasia, Ivanov, Artem, Kozlov, Alexander, Pekhovsky, Timur, Matveev, Yuri

论文摘要

基于深扬声器嵌入的说话者识别系统根据早期NIST SRE（说话者识别评估）数据集获得的结果在受控条件下取得了显着性能。从实际的角度来看，考虑到虚拟助手的兴趣增加（例如Amazon Alexa，Google Home，Applesiri等），对不受控制的嘈杂环境条件中简短话语的演讲者验证是最具挑战性且需求最高的任务之一。本文介绍了旨在实现两个目标的方法：a）在存在环境噪音，混响和b）在短言语的情况下提高远场扬声器验证系统的质量。为了这些目的，我们考虑了基于TDNN（Timedelay神经网络）和Resnet（残留神经网络）块的深神经网络体系结构。我们尝试了最先进的嵌入提取器及其训练程序。获得的结果证实，Resnet体系结构在长期和短期话语的说话者验证质量方面优于标准X矢量方法。我们还研究了语音活动探测器，不同评分模型，适应和评分归一化技术的影响。为Voxceleb1，Voxceleb2和Voices数据集提供了有关公开可用数据和验证协议的实验结果。

Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions according to the results obtained for early NIST SRE (Speaker Recognition Evaluation) datasets. From the practical point of view, taking into account the increased interest in virtual assistants (such as Amazon Alexa, Google Home, AppleSiri, etc.), speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances. For these purposes, we considered deep neural network architectures based on TDNN (TimeDelay Neural Network) and ResNet (Residual Neural Network) blocks. We experimented with state-of-the-art embedding extractors and their training procedures. Obtained results confirm that ResNet architectures outperform the standard x-vector approach in terms of speaker verification quality for both long-duration and short-duration utterances. We also investigate the impact of speech activity detector, different scoring models, adaptation and score normalization techniques. The experimental results are presented for publicly available data and verification protocols for the VoxCeleb1, VoxCeleb2, and VOiCES datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题