论文标题
将您的嵌入方式绑定到:跨模式的潜在空间,以了解端到端的口语理解
Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding
论文作者
论文摘要
端到端(E2E)口语理解(SLU)系统可以直接从音频信号推断出语音的语义。但是,培训E2E系统仍然是一个挑战,这主要是由于配对的音频 - 声学数据的稀缺性。在本文中,我们将E2E系统视为一种多模式模型,音频和文本作为其两种模式,并使用跨模式潜在空间(CMLS)体系结构,其中在“声学”和“文本”嵌入之间学习了共享的潜在空间。我们建议使用不同的多模式损失明确指导声学嵌入,以更接近从语义强大的预训练的BERT模型获得的文本嵌入。我们在两个公开可用的E2E数据集上对CMLS模型进行训练,这些数据集跨越了不同的跨模式损失,并表明我们提出的三重态损失功能可实现最佳性能。在没有跨模式空间的E2E模型上,相对提高分别为1.4%和4%,使用$ L_2 $损失,相对于先前发布的CMLS模型,相对提高了0.7%和1%。对于较小,更复杂的E2E数据集的增益较高,证明了使用有效的跨模式损耗函数的功效,尤其是在可用的E2E培训数据有限的情况下。
End-to-end (E2E) spoken language understanding (SLU) systems can infer the semantics of a spoken utterance directly from an audio signal. However, training an E2E system remains a challenge, largely due to the scarcity of paired audio-semantics data. In this paper, we treat an E2E system as a multi-modal model, with audio and text functioning as its two modalities, and use a cross-modal latent space (CMLS) architecture, where a shared latent space is learned between the `acoustic' and `text' embeddings. We propose using different multi-modal losses to explicitly guide the acoustic embeddings to be closer to the text embeddings, obtained from a semantically powerful pre-trained BERT model. We train the CMLS model on two publicly available E2E datasets, across different cross-modal losses and show that our proposed triplet loss function achieves the best performance. It achieves a relative improvement of 1.4% and 4% respectively over an E2E model without a cross-modal space and a relative improvement of 0.7% and 1% over a previously published CMLS model using $L_2$ loss. The gains are higher for a smaller, more complicated E2E dataset, demonstrating the efficacy of using an efficient cross-modal loss function, especially when there is limited E2E training data available.