论文标题
超级人类的表现在线低延迟识别会话演讲
Super-Human Performance in Online Low-latency Recognition of Conversational Speech
论文作者
论文摘要
由于研究人员致力于越来越具有挑战性的任务,因此在认可人类言论方面取得超人的表现一直是一个目标。在1990年代,人们发现,两个人之间的对话演讲要比阅读言语,偏见,虚假的开始和草率的表达更加困难,这使声学处理复杂化,并需要对声学,词汇和语言的稳健处理。统计模型的早期尝试只能达到50%以上的错误率,而远离人类绩效(约为5.5%)。神经混合模型和最近的基于注意的编码器模型已经大大提高了性能,因为现在可以以整体方式学习这样的环境。但是,处理此类上下文需要整个话语表现,因此在输出识别结果之前引入了不必要的延迟。在本文中,我们解决了性能和延迟。我们为一个可以实现超人性能的系统(以5.0%的速度,在总机对话基准测试)上介绍了一个基于单词延迟的系统的结果,仅落后于演讲者的演讲1秒钟。该系统使用多个基于注意力的编码器 - 码头网络集成在新型低潜伏期增量推理方法中。
Achieving super-human performance in recognizing human speech has been a goal for several decades, as researchers have worked on increasingly challenging tasks. In the 1990's it was discovered, that conversational speech between two humans turns out to be considerably more difficult than read speech as hesitations, disfluencies, false starts and sloppy articulation complicate acoustic processing and require robust handling of acoustic, lexical and language context, jointly. Early attempts with statistical models could only reach error rates over 50% and far from human performance (WER of around 5.5%). Neural hybrid models and recent attention-based encoder-decoder models have considerably improved performance as such contexts can now be learned in an integral fashion. However, processing such contexts requires an entire utterance presentation and thus introduces unwanted delays before a recognition result can be output. In this paper, we address performance as well as latency. We present results for a system that can achieve super-human performance (at a WER of 5.0%, over the Switchboard conversational benchmark) at a word based latency of only 1 second behind a speaker's speech. The system uses multiple attention-based encoder-decoder networks integrated within a novel low latency incremental inference approach.