将您的嵌入方式绑定到：跨模式的潜在空间，以了解端到端的口语理解

论文标题

将您的嵌入方式绑定到：跨模式的潜在空间，以了解端到端的口语理解

Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding

论文作者

Agrawal, Bhuvan, Müller, Markus, Radfar, Martin, Choudhary, Samridhi, Mouchtaris, Athanasios, Kunzmann, Siegfried

论文摘要

端到端（E2E）口语理解（SLU）系统可以直接从音频信号推断出语音的语义。但是，培训E2E系统仍然是一个挑战，这主要是由于配对的音频 - 声学数据的稀缺性。在本文中，我们将E2E系统视为一种多模式模型，音频和文本作为其两种模式，并使用跨模式潜在空间（CMLS）体系结构，其中在“声学”和“文本”嵌入之间学习了共享的潜在空间。我们建议使用不同的多模式损失明确指导声学嵌入，以更接近从语义强大的预训练的BERT模型获得的文本嵌入。我们在两个公开可用的E2E数据集上对CMLS模型进行训练，这些数据集跨越了不同的跨模式损失，并表明我们提出的三重态损失功能可实现最佳性能。在没有跨模式空间的E2E模型上，相对提高分别为1.4％和4％，使用$ L_2 $损失，相对于先前发布的CMLS模型，相对提高了0.7％和1％。对于较小，更复杂的E2E数据集的增益较高，证明了使用有效的跨模式损耗函数的功效，尤其是在可用的E2E培训数据有限的情况下。

End-to-end (E2E) spoken language understanding (SLU) systems can infer the semantics of a spoken utterance directly from an audio signal. However, training an E2E system remains a challenge, largely due to the scarcity of paired audio-semantics data. In this paper, we treat an E2E system as a multi-modal model, with audio and text functioning as its two modalities, and use a cross-modal latent space (CMLS) architecture, where a shared latent space is learned between the `acoustic' and `text' embeddings. We propose using different multi-modal losses to explicitly guide the acoustic embeddings to be closer to the text embeddings, obtained from a semantically powerful pre-trained BERT model. We train the CMLS model on two publicly available E2E datasets, across different cross-modal losses and show that our proposed triplet loss function achieves the best performance. It achieves a relative improvement of 1.4% and 4% respectively over an E2E model without a cross-modal space and a relative improvement of 0.7% and 1% over a previously published CMLS model using $L_2$ loss. The gains are higher for a smaller, more complicated E2E dataset, demonstrating the efficacy of using an efficient cross-modal loss function, especially when there is limited E2E training data available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题