论文标题
通过层聚集增强语音识别解码
Enhancing Speech Recognition Decoding via Layer Aggregation
论文作者
论文摘要
最近提出的语音识别系统旨在使用由其顶层产生的表示形式进行预测,并采用贪婪解码将每个时间步与其余序列分离出来。为了提高性能,经常使用梁搜索算法,并合并了语言模型以帮助对顶级候选人进行排名。在这项工作中,我们尝试了几种语音识别模型,发现使用顶层预测的逻辑可能会从获得最佳结果中妨碍梁搜索。具体而言,我们表明,被罚款的WAV2VEC 2.0和Hubert产生了高度自信的预测,并假设预测是基于本地信息,并且可能无法充分利用中间层中编码的信息。为此,我们执行层分析,以揭示和可视化预测如何在整个推理流中演变。然后,我们提出了一种预测方法,该方法汇总了顶部M层,可能利用中间层中编码的有用信息和放松模型置信度。我们通过梁搜索解码展示了方法的有效性,在Librispeech测试和开发集和实现WER上进行了实验,分别降低了高达10%和22%。
Recently proposed speech recognition systems are designed to predict using representations generated by their top layers, employing greedy decoding which isolates each timestep from the rest of the sequence. Aiming for improved performance, a beam search algorithm is frequently utilized and a language model is incorporated to assist with ranking the top candidates. In this work, we experiment with several speech recognition models and find that logits predicted using the top layers may hamper beam search from achieving optimal results. Specifically, we show that fined-tuned Wav2Vec 2.0 and HuBERT yield highly confident predictions, and hypothesize that the predictions are based on local information and may not take full advantage of the information encoded in intermediate layers. To this end, we perform a layer analysis to reveal and visualize how predictions evolve throughout the inference flow. We then propose a prediction method that aggregates the top M layers, potentially leveraging useful information encoded in intermediate layers and relaxing model confidence. We showcase the effectiveness of our approach via beam search decoding, conducting our experiments on Librispeech test and dev sets and achieving WER, and CER reduction of up to 10% and 22%, respectively.