聆听，适应，更好的WER：自动语音识别的无源单量测试时间适应

论文标题

聆听，适应，更好的WER：自动语音识别的无源单量测试时间适应

Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech Recognition

论文作者

Lin, Guan-Ting, Li, Shang-Wen, Lee, Hung-yi

论文摘要

尽管近年来基于深度学习的端到端自动语音识别（ASR）表现出色，但它在不同数据分布中得出的测试样本中遭受了严重的性能回归。以前在计算机视觉区域探索的测试时间适应（TTA）旨在调整对源域进行训练的模型，以便在不访问源数据的情况下对经常不域的测试样品（通常是域外）进行更好的预测。在这里，我们提出了ASR的单一测试时间适应（SUTA）框架，这是我们最佳知识对ASR进行的首次TTA研究。单块tta是一个更现实的设置，不假设测试数据是从相同的分布中采样的，并且由于预集批次的适应性数据而不会延迟按需推断。 Suta由具有有效适应策略的无监督目标组成。经验结果表明，原子能有效地提高了对多个室外目标语料库和内域测试样本评估的源ASR模型的性能。

Although deep learning-based end-to-end Automatic Speech Recognition (ASR) has shown remarkable performance in recent years, it suffers severe performance regression on test samples drawn from different data distributions. Test-time Adaptation (TTA), previously explored in the computer vision area, aims to adapt the model trained on source domains to yield better predictions for test samples, often out-of-domain, without accessing the source data. Here, we propose the Single-Utterance Test-time Adaptation (SUTA) framework for ASR, which is the first TTA study on ASR to our best knowledge. The single-utterance TTA is a more realistic setting that does not assume test data are sampled from identical distribution and does not delay on-demand inference due to pre-collection for the batch of adaptation data. SUTA consists of unsupervised objectives with an efficient adaptation strategy. Empirical results demonstrate that SUTA effectively improves the performance of the source ASR model evaluated on multiple out-of-domain target corpora and in-domain test samples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题