论文标题
频率门控:改进的卷积神经网络,以增强时间频域的语音
Frequency Gating: Improved Convolutional Neural Networks for Speech Enhancement in the Time-Frequency Domain
论文作者
论文摘要
传统卷积神经网络(CNN)的优势之一是它们固有的转化不变性。但是,对于时间频域中语音增强的任务,由于频率方向缺乏不变性,因此无法完全利用此属性。在本文中,我们建议通过引入一种称为频率门控的方法来纠正此效率,以计算CNN内核的乘法权重,以使其依赖于频率。探索了几种机制:时间门控,其中权重取决于先前的时间范围,本地门控,其权重是基于单个时间框架和与之相邻的时间生成的,以及频率的门控,每个内核分配了一个与输入数据无关的权重。使用SKIP连接的自动编码器神经网络进行的实验表明,局部和频率的门控的表现都优于基线,因此是改善基于CNN的语音增强神经网络的可行方法。此外,引入了基于延长的短时客观可理解性评分(ESTOI)的损失函数,我们显示的表现要优于标准平均误差(MSE)损耗函数。
One of the strengths of traditional convolutional neural networks (CNNs) is their inherent translational invariance. However, for the task of speech enhancement in the time-frequency domain, this property cannot be fully exploited due to a lack of invariance in the frequency direction. In this paper we propose to remedy this inefficiency by introducing a method, which we call Frequency Gating, to compute multiplicative weights for the kernels of the CNN in order to make them frequency dependent. Several mechanisms are explored: temporal gating, in which weights are dependent on prior time frames, local gating, whose weights are generated based on a single time frame and the ones adjacent to it, and frequency-wise gating, where each kernel is assigned a weight independent of the input data. Experiments with an autoencoder neural network with skip connections show that both local and frequency-wise gating outperform the baseline and are therefore viable ways to improve CNN-based speech enhancement neural networks. In addition, a loss function based on the extended short-time objective intelligibility score (ESTOI) is introduced, which we show to outperform the standard mean squared error (MSE) loss function.