使用立体声音频频道的话语聚类

论文标题

使用立体声音频频道的话语聚类

Utterance Clustering Using Stereo Audio Channels

论文作者

Dong, Yingjun, MacLaren, Neil G., Cao, Yiding, Yammarino, Francis J., Dionne, Shelley D., Mumford, Michael D., Connelly, Shane, Sayama, Hiroki, Ruark, Gregory A.

论文摘要

话语聚类是音频信号处理和机器学习中积极研究的主题之一。这项研究旨在通过处理多通道（立体声）音频信号来提高话语聚类的性能。通过以几种方式组合左和右通道音频信号，然后从那些处理后的音频信号中提取嵌入式功能（也称为D-VECTOR）来生成处理后的音频信号。这项研究应用了高斯混合模型进行监督的话语聚类。在训练阶段，进行了一个参数共享高斯混合模型，以训练每个说话者的模型。在测试阶段，选择具有最大可能性的扬声器作为检测到的扬声器。具有多人讨论会议的真实音频记录的实验结果表明，使用多通道音频信号的提出方法比在更复杂的条件下具有单声道音频信号的传统方法的性能要好得多。

Utterance clustering is one of the actively researched topics in audio signal processing and machine learning. This study aims to improve the performance of utterance clustering by processing multichannel (stereo) audio signals. Processed audio signals were generated by combining left- and right-channel audio signals in a few different ways and then extracted embedded features (also called d-vectors) from those processed audio signals. This study applied the Gaussian mixture model for supervised utterance clustering. In the training phase, a parameter sharing Gaussian mixture model was conducted to train the model for each speaker. In the testing phase, the speaker with the maximum likelihood was selected as the detected speaker. Results of experiments with real audio recordings of multi-person discussion sessions showed that the proposed method that used multichannel audio signals achieved significantly better performance than a conventional method with mono audio signals in more complicated conditions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题