使用变分自动编码器和对抗训练的两阶段深度表示基于学习的语音增强方法

论文标题

使用变分自动编码器和对抗训练的两阶段深度表示基于学习的语音增强方法

A Two-Stage Deep Representation Learning-Based Speech Enhancement Method Using Variational Autoencoder and Adversarial Training

论文作者

Xiang, Yang, Højvang, Jesper Lisby, Rasmussen, Morten Højfeldt, Christensen, Mads Græsbøll

论文摘要

本文着重于利用深层表示学习（DRL）进行语音增强（SE）。通常，深神经网络（DNN）的性能在很大程度上取决于数据表示的学习。但是，在许多基于DNN的SE算法中，DRL的重要性通常被忽略。为了获得更高质量的增强语音，我们通过对抗训练提出了一种基于两阶段DRL的SE方法。在第一阶段，我们会删除不同的潜在变量，因为分离的表示可以帮助DNN产生更好的语音。具体而言，我们使用$β$ - 变量自动编码器（VAE）算法来获取来自观察到的信号的语音和噪声后验估计和相关表示。但是，由于后代和表示形式是棘手的，我们只能应用有条件的假设来估计它们，因此很难确保这些估计始终非常准确，这可能会降低信号估计的最终准确性。为了进一步提高增强语音的质量，在第二阶段，我们引入了对抗训练，以减少后置对信号重建的不准确的影响并提高信号估计的准确性，从而使我们的算法对潜在不准确的后后估计更加强大。结果，可以实现更好的SE性能。实验结果表明，所提出的策略可以帮助基于DNN的SE算法获得更高的短时客观可理解性（StoI），语音质量的感知评估（PESQ）和规模不变的信号渗透率（SI-SDR）得分。此外，拟议的算法还可以优于最近的竞争性SE算法。

This paper focuses on leveraging deep representation learning (DRL) for speech enhancement (SE). In general, the performance of the deep neural network (DNN) is heavily dependent on the learning of data representation. However, the DRL's importance is often ignored in many DNN-based SE algorithms. To obtain a higher quality enhanced speech, we propose a two-stage DRL-based SE method through adversarial training. In the first stage, we disentangle different latent variables because disentangled representations can help DNN generate a better enhanced speech. Specifically, we use the $β$-variational autoencoder (VAE) algorithm to obtain the speech and noise posterior estimations and related representations from the observed signal. However, since the posteriors and representations are intractable and we can only apply a conditional assumption to estimate them, it is difficult to ensure that these estimations are always pretty accurate, which may potentially degrade the final accuracy of the signal estimation. To further improve the quality of enhanced speech, in the second stage, we introduce adversarial training to reduce the effect of the inaccurate posterior towards signal reconstruction and improve the signal estimation accuracy, making our algorithm more robust for the potentially inaccurate posterior estimations. As a result, better SE performance can be achieved. The experimental results indicate that the proposed strategy can help similar DNN-based SE algorithms achieve higher short-time objective intelligibility (STOI), perceptual evaluation of speech quality (PESQ), and scale-invariant signal-to-distortion ratio (SI-SDR) scores. Moreover, the proposed algorithm can also outperform recent competitive SE algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题