论文标题
稳健波形产生的脱索和散布层次神经声码器
Denoising-and-Dereverberation Hierarchical Neural Vocoder for Robust Waveform Generation
论文作者
论文摘要
本文介绍了denoising and derverberation层次的神经声码器(DNR-HINET),以将嘈杂和回响的声学特征转化为干净的语音波形。我们主要通过修改原始Hinet Vocoder中的振幅频谱预测器(ASP)来实现它。这种修改的deNoising和derverberation ASP(DNR-ASP)可以从输入降解的声学特征中预测清洁的对数振幅光谱(LAS)。为了实现这一目标,DNR-ASP首先预测了与噪声信息相关的嘈杂和回响的LA,噪声LA,以及与混响信息有关的房间脉冲响应,然后执行初始的脱索和替代性。然后,将另一个神经网络作为最终的清洁LAS增强了初始处理的LA。为了进一步提高生成的清洁LA的质量,我们还引入了DNR-ASP中的带宽扩展模型和频率分辨率扩展模型。实验结果表明,鉴于嘈杂和回响的声学特征,DNR-Hinet Vocoder能够产生一个不透明的,取代的波形,并且表现优于原始的Hinet Vocoder和其他一些神经唱片。我们还将DNR-Hinet Vocoder应用于语音增强任务,其性能具有多种高级语音增强方法的竞争力。
This paper presents a denoising and dereverberation hierarchical neural vocoder (DNR-HiNet) to convert noisy and reverberant acoustic features into a clean speech waveform. We implement it mainly by modifying the amplitude spectrum predictor (ASP) in the original HiNet vocoder. This modified denoising and dereverberation ASP (DNR-ASP) can predict clean log amplitude spectra (LAS) from input degraded acoustic features. To achieve this, the DNR-ASP first predicts the noisy and reverberant LAS, noise LAS related to the noise information, and room impulse response related to the reverberation information then performs initial denoising and dereverberation. The initial processed LAS are then enhanced by another neural network as the final clean LAS. To further improve the quality of the generated clean LAS, we also introduce a bandwidth extension model and frequency resolution extension model in the DNR-ASP. The experimental results indicate that the DNR-HiNet vocoder was able to generate a denoised and dereverberated waveform given noisy and reverberant acoustic features and outperformed the original HiNet vocoder and a few other neural vocoders. We also applied the DNR-HiNet vocoder to speech enhancement tasks, and its performance was competitive with several advanced speech enhancement methods.