通用演讲者的识别编码器的不同语音段持续时间

论文标题

通用演讲者的识别编码器的不同语音段持续时间

Universal speaker recognition encoders for different speech segments duration

论文作者

Novoselov, Sergey, Volokhov, Vladimir, Lavrentyeva, Galina

论文摘要

创建对不同声学和语音持续时间条件的强大的通用扬声器编码器是当今的巨大挑战。根据我们的观察系统，接受了短语音段培训的系统对于简短的扬声器验证是最佳的，并且在长段中培训的系统对于长段验证而言是优越的。同时在汇总的短语音段同时培训的系统不会给出最佳的验证结果，并且通常会在短段和长段中降低。本文解决了为不同语音段持续时间创建通用扬声器编码器的问题。我们描述了用于培训通用扬声器编码器的简单配方，以适用于任何类型的选定神经网络体系结构。根据我们对基于NIST SRE和VOXCELEB1基准的基于WAV2VEC-TDNN系统的评估结果，提出的通用编码器在不同的注册和测试语音段持续时间内提供了扬声器验证改进。提出的编码器的关键特征是它具有与所选神经网络体系结构相同的推理时间。

Creating universal speaker encoders which are robust for different acoustic and speech duration conditions is a big challenge today. According to our observations systems trained on short speech segments are optimal for short phrase speaker verification and systems trained on long segments are superior for long segments verification. A system trained simultaneously on pooled short and long speech segments does not give optimal verification results and usually degrades both for short and long segments. This paper addresses the problem of creating universal speaker encoders for different speech segments duration. We describe our simple recipe for training universal speaker encoder for any type of selected neural network architecture. According to our evaluation results of wav2vec-TDNN based systems obtained for NIST SRE and VoxCeleb1 benchmarks the proposed universal encoder provides speaker verification improvements in case of different enrollment and test speech segment duration. The key feature of the proposed encoder is that it has the same inference time as the selected neural network architecture.

下载PDF全文

下载文献需遵守相关版权规定

论文标题