单声道语音的言语加权多污染时间卷积网络

论文标题

单声道语音的言语加权多污染时间卷积网络

Utterance Weighted Multi-Dilation Temporal Convolutional Networks for Monaural Speech Dereverberation

论文作者

Ravenscroft, William, Goetze, Stefan, Hain, Thomas

论文摘要

在许多语音技术应用中，语音覆盖是一个重要阶段。该领域的最新工作已由深度神经网络模型主导。时间卷积网络（TCN）是深度学习模型，已在消除语音的任务中为序列建模所提出。在这项工作中，提出了加权多污水深度分离的卷积，以替代TCN模型中标准的深度避免卷积。该提出的卷积使TCN能够在网络中每个卷积块的接收场中动态关注或多或少的本地信息。结果表明，这种加权的多滴度时间卷积网络（WD-TCN）始终优于各种模型配置的TCN，并且使用WD-TCN模型是提高模型性能的更有效方法，而不是增加卷积块的数量。基线TCN的最佳性能改进是0.55 dB量表信号距离（SISDR），并且性能最佳的WD-TCN模型在WHAMR数据集上的最佳性能为12.26 dB SISDR。

Speech dereverberation is an important stage in many speech technology applications. Recent work in this area has been dominated by deep neural network models. Temporal convolutional networks (TCNs) are deep learning models that have been proposed for sequence modelling in the task of dereverberating speech. In this work a weighted multi-dilation depthwise-separable convolution is proposed to replace standard depthwise-separable convolutions in TCN models. This proposed convolution enables the TCN to dynamically focus on more or less local information in its receptive field at each convolutional block in the network. It is shown that this weighted multi-dilation temporal convolutional network (WD-TCN) consistently outperforms the TCN across various model configurations and using the WD-TCN model is a more parameter efficient method to improve the performance of the model than increasing the number of convolutional blocks. The best performance improvement over the baseline TCN is 0.55 dB scale-invariant signal-to-distortion ratio (SISDR) and the best performing WD-TCN model attains 12.26 dB SISDR on the WHAMR dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题