论文标题
Cuside:块,模拟未来的上下文和用于流媒体的解码
CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR
论文作者
论文摘要
已知历史和未来的上下文信息对于准确的声学建模很重要。但是,获取未来的上下文会为流媒体流量带来延迟。在本文中,我们提出了一个新的框架 - 块,模拟未来的上下文和解码(Cuside)以进行流语音识别。引入了一个新的仿真模块,以递归地模拟未来的上下文帧,而无需等待未来的上下文。使用自我监督损失与ASR模型共同训练模拟模块; ASR模型通过通常的ASR损失(例如我们实验中使用的CTC-CRF)进行了优化。实验表明,与使用真实的未来框架作为正确的上下文相比,使用模拟的未来上下文可以大大降低延迟,同时保持识别精度。使用Cuside,我们在Aishell-1数据集中获得了新的最新流媒体ASR结果。
History and future contextual information are known to be important for accurate acoustic modeling. However, acquiring future context brings latency for streaming ASR. In this paper, we propose a new framework - Chunking, Simulating Future Context and Decoding (CUSIDE) for streaming speech recognition. A new simulation module is introduced to recursively simulate the future contextual frames, without waiting for future context. The simulation module is jointly trained with the ASR model using a self-supervised loss; the ASR model is optimized with the usual ASR loss, e.g., CTC-CRF as used in our experiments. Experiments show that, compared to using real future frames as right context, using simulated future context can drastically reduce latency while maintaining recognition accuracy. With CUSIDE, we obtain new state-of-the-art streaming ASR results on the AISHELL-1 dataset.