自动分割语音的上下文化翻译

论文标题

自动分割语音的上下文化翻译

Contextualized Translation of Automatically Segmented Speech

论文作者

Gaido, Marco, Di Gangi, Mattia Antonino, Negri, Matteo, Cettolo, Mauro, Turchi, Marco

论文摘要

直接的语音到文本翻译（ST）模型通常在句子级别的语料库上进行训练，但是在推理时，它们通常会被语音活动探测器（VAD）分开的音频供电。由于VAD细分未构成语法信息，因此结果段不一定与说话者说出的良好句子相对应，而是与一个或多个句子的片段相对应。这种分割的不匹配大大降低了ST模型输出的质量。到目前为止，研究人员致力于改善音频细分，以产生类似句子的分裂。相反，在本文中，我们解决了模型中的问题，使其对不同的，潜在的次级优势分割更加健壮。为此，我们将模型训练在随机分段的数据上，并比较两种方法：微调和添加上一个段作为上下文。我们表明，我们的上下文感知解决方案对于VAD段的输入更加强大，表现优于强大的基本模型，并且对英语 - 德国测试的不同VAD分割进行了微调，该测试设置为4.25个BLEU点。

Direct speech-to-text translation (ST) models are usually trained on corpora segmented at sentence level, but at inference time they are commonly fed with audio split by a voice activity detector (VAD). Since VAD segmentation is not syntax-informed, the resulting segments do not necessarily correspond to well-formed sentences uttered by the speaker but, most likely, to fragments of one or more sentences. This segmentation mismatch degrades considerably the quality of ST models' output. So far, researchers have focused on improving audio segmentation towards producing sentence-like splits. In this paper, instead, we address the issue in the model, making it more robust to a different, potentially sub-optimal segmentation. To this aim, we train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context. We show that our context-aware solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.

下载PDF全文

下载文献需遵守相关版权规定

论文标题