论文标题
从人类的看法中学习以改善风格不匹配条件下的自动扬声器验证
Learning from human perception to improve automatic speaker verification in style-mismatched conditions
论文作者
论文摘要
我们先前的实验表明,人类和机器似乎采用了不同的方法来歧视说话者的歧视,尤其是在说话风格可变性的情况下。实验检查了阅读与对话演讲。听众专注于特定于说话者的特质,同时“一起告诉说话者”,在“告诉说话者分开”时,在共享声学空间中相对距离。但是,无论目标或非目标试验如何,自动扬声器验证(ASV)系统使用相同的损失函数。为了在风格变异性的存在下提高ASV性能,从人类感知中学到的见解被用来设计一种新的训练损失功能,我们称之为“ CLLRCE损失”。 CLLRCE损失既使用说话者特定的特质,又使用扬声器之间的相对声学距离来训练ASV系统。当使用UCLA扬声器可变性数据库时,在X-Vector和条件设置中,与X-Vector基线相比,CLLCE损失会导致EER显着相对改善1-66%,而MindCF分别提高了1-31%和1-56%。使用涉及不同对话性语音任务的SITW评估任务,拟议的损失与自我发场调节相结合,导致EER的显着相对改善增长了2-5%,而MindCF则比基线相比6-12%。在SITW案例中,绩效的提高仅与调理一致。
Our prior experiments show that humans and machines seem to employ different approaches to speaker discrimination, especially in the presence of speaking style variability. The experiments examined read versus conversational speech. Listeners focused on speaker-specific idiosyncrasies while "telling speakers together", and on relative distances in a shared acoustic space when "telling speakers apart". However, automatic speaker verification (ASV) systems use the same loss function irrespective of target or non-target trials. To improve ASV performance in the presence of style variability, insights learnt from human perception are used to design a new training loss function that we refer to as "CllrCE loss". CllrCE loss uses both speaker-specific idiosyncrasies and relative acoustic distances between speakers to train the ASV system. When using the UCLA speaker variability database, in the x-vector and conditioning setups, CllrCE loss results in significant relative improvements in EER by 1-66%, and minDCF by 1-31% and 1-56%, respectively, when compared to the x-vector baseline. Using the SITW evaluation tasks, which involve different conversational speech tasks, the proposed loss combined with self-attention conditioning results in significant relative improvements in EER by 2-5% and minDCF by 6-12% over baseline. In the SITW case, performance improvements were consistent only with conditioning.