使用解开构象体的持续学习对evice语音识别

论文标题

使用解开构象体的持续学习对evice语音识别

Continual Learning for On-Device Speech Recognition using Disentangled Conformers

论文作者

Diwan, Anuj, Yeh, Ching-Feng, Hsu, Wei-Ning, Tomasello, Paden, Choi, Eunsol, Harwath, David, Mohamed, Abdelrahman

论文摘要

自动语音识别研究的重点是在静态数据集上进行培训和评估。但是，随着语音模型越来越多地部署在个人设备上，此类模型会遇到特定于用户的分销变化。为了模拟这种现实世界的场景，我们介绍了LibriconTinual，这是源自Librivox Audiobooks的扬声器特定领域适应性的持续学习基准，其数据对应于118个单独的扬声器和不同尺寸的扬声器6级火车拆分。此外，当前的语音识别模型和持续学习算法并未优化为计算效率。我们适应了ASR的通用训练算法Netaug，并创建了一种新颖的构象异构体变体，称为Disconformer（distangled conformer）。该算法产生的ASR模型由用于通用使用的冷冻“核心”网络组成，以及用于扬声器特定调整的几种可调的“增强”网络。使用这样的模型，我们提出了一种称为DisentangledCl的新型计算有效的持续学习算法。我们的实验表明，脱位模型在一般ASR上的表现明显优于基本线，即librispeech（15.58％rel。weron test-other）。在特定于扬声器的图书馆上，它们的表现明显优于可训练参数匹配的基线（测试时为20.65％的相关基线），甚至在某些情况下甚至匹配了完全易经的基线。

Automatic speech recognition research focuses on training and evaluating on static datasets. Yet, as speech models are increasingly deployed on personal devices, such models encounter user-specific distributional shifts. To simulate this real-world scenario, we introduce LibriContinual, a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks, with data corresponding to 118 individual speakers and 6 train splits per speaker of different sizes. Additionally, current speech recognition models and continual learning algorithms are not optimized to be compute-efficient. We adapt a general-purpose training algorithm NetAug for ASR and create a novel Conformer variant called the DisConformer (Disentangled Conformer). This algorithm produces ASR models consisting of a frozen 'core' network for general-purpose use and several tunable 'augment' networks for speaker-specific tuning. Using such models, we propose a novel compute-efficient continual learning algorithm called DisentangledCL. Our experiments show that the DisConformer models significantly outperform baselines on general ASR i.e. LibriSpeech (15.58% rel. WER on test-other). On speaker-specific LibriContinual they significantly outperform trainable-parameter-matched baselines (by 20.65% rel. WER on test) and even match fully finetuned baselines in some settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题