论文标题
朝着时间和频域语音分离的统一全神经束成绩
Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation
论文作者
论文摘要
最近,用于多通道语音分离的频域全神经范围的方法取得了显着的进步。同时,时域网络结构和波束形成的整合也引起了很大的关注。这项研究提出了一种新型的全神经束成式方法,并试图统一全日制波束形成管道,以进行时域和频域多通道语音分离。所提出的模型由两个模块组成:分离和波束形成。两个模块都执行时间光谱空间建模,并使用关节损耗函数从端到端进行训练。这项研究的新颖性在于两倍。首先,提出了以目标扬声器方向为条件的时域定向特征,该特征可以在时间域架构中共同优化,以增强目标信号估计。其次,时域中的全神经束成型网络旨在完善预分离的结果。该模块具有参数时间变化的边际系数估计,而没有明确的遵循可能导致上限的最佳过滤器的推导。对所提出的方法对来自Aishell-1语料库得出的模拟混响重叠的语音数据进行评估。实验结果表明,与频域的最新范围,理想的尺寸掩码和现有时域神经光束形成方法相比,性能的显着改善。
Recently, frequency domain all-neural beamforming methods have achieved remarkable progress for multichannel speech separation. In parallel, the integration of time domain network structure and beamforming also gains significant attention. This study proposes a novel all-neural beamforming method in time domain and makes an attempt to unify the all-neural beamforming pipelines for time domain and frequency domain multichannel speech separation. The proposed model consists of two modules: separation and beamforming. Both modules perform temporal-spectral-spatial modeling and are trained from end-to-end using a joint loss function. The novelty of this study lies in two folds. Firstly, a time domain directional feature conditioned on the direction of the target speaker is proposed, which can be jointly optimized within the time domain architecture to enhance target signal estimation. Secondly, an all-neural beamforming network in time domain is designed to refine the pre-separated results. This module features with parametric time-variant beamforming coefficient estimation, without explicitly following the derivation of optimal filters that may lead to an upper bound. The proposed method is evaluated on simulated reverberant overlapped speech data derived from the AISHELL-1 corpus. Experimental results demonstrate significant performance improvements over frequency domain state-of-the-arts, ideal magnitude masks and existing time domain neural beamforming methods.