基于WFST和SEQ2SEQ模型的视听决策融合

论文标题

基于WFST和SEQ2SEQ模型的视听决策融合

Audio-Visual Decision Fusion for WFST-based and seq2seq Models

论文作者

Aralikatti, Rohith, Roy, Sharad, Thanda, Abhinav, Margam, Dilip Kumar, Kandala, Pujitha Appan, Sharma, Tanay, Venkatesan, Shankar M

论文摘要

在嘈杂的条件下，语音识别系统患有高单词错误率（WER）。在这种情况下，包括扬声器唇部运动的视觉方式的信息可以帮助提高性能。在这项工作中，我们提出了新的方法，以在推理时融合来自音频和视觉方式的信息。这使我们能够独立训练声学和视觉模型。首先，我们训练单独的基于RNN-HMM的声学和视觉模型。通过对HMM组件的特殊结合而生成的常见WFST用于使用修改后的Viterbi算法解码。其次，我们训练单独的SEQ2SEQ声学和视觉模型。使用浅融合同时对两种方式同时执行解码步骤，同时保持共同的假设光束。我们还为没有称重参数的新型SEQ2SEQ融合提供了结果。我们在不同的SNR上介绍了结果，并表明我们的方法对仅声学的WER可以显着改善。

Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER). In such cases, information from the visual modality comprising the speaker lip movements can help improve the performance. In this work, we propose novel methods to fuse information from audio and visual modalities at inference time. This enables us to train the acoustic and visual models independently. First, we train separate RNN-HMM based acoustic and visual models. A common WFST generated by taking a special union of the HMM components is used for decoding using a modified Viterbi algorithm. Second, we train separate seq2seq acoustic and visual models. The decoding step is performed simultaneously for both modalities using shallow fusion while maintaining a common hypothesis beam. We also present results for a novel seq2seq fusion without the weighing parameter. We present results at varying SNR and show that our methods give significant improvements over acoustic-only WER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题