整合基于端到端的神经和聚类的诊断：在两全其美

论文标题

整合基于端到端的神经和聚类的诊断：在两全其美

Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds

论文作者

Kinoshita, Keisuke, Delcroix, Marc, Tawara, Naohiro

论文摘要

最近的诊断技术可以分为两种方法，即聚类和端到端神经方法，它们具有不同的优缺点。基于群集的方法通过将扬声器的嵌入（例如X-vectors）分配给语音区域。尽管可以将其视为一种当前的最新方法，可用于以合理的鲁棒性和准确性为各种具有挑战性的数据，但它具有至关重要的缺点，即它无法处理自然对话数据中不可避免的重叠语音。相反，设计了使用神经网络直接预测诊断标签的端到端神经腹泻（EEND），以处理重叠的语音。尽管可以轻松地结合新兴的深度学习技术的回报，但在某些现实的数据库中已经开始超过X矢量聚类方法，但由于其大量的记忆消耗，因此很难使其适用于“长”录音（例如，录音超过10分钟）。宽阔的独立处理也很困难，因为它构成了块间标签置换问题，即，块之间说话者标签分配的歧义。在本文中，我们提出了一个简单但有效的混合诊断框架，该框架与重叠的语音一起使用，并为包含任意数量的说话者的长录音。它修改了传统的回弹框架以同时输出全局扬声器嵌入，以便可以在跨块上执行扬声器聚类以解决排列问题。通过基于模拟嘈杂的Reverberant 2-Spever会议式数据的实验，我们表明所提出的框架的起作用明显优于原始电源，尤其是在输入数据较长的情况下。

Recent diarization technologies can be categorized into two approaches, i.e., clustering and end-to-end neural approaches, which have different pros and cons. The clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors. While it can be seen as a current state-of-the-art approach that works for various challenging data with reasonable robustness and accuracy, it has a critical disadvantage that it cannot handle overlapped speech that is inevitable in natural conversational data. In contrast, the end-to-end neural diarization (EEND), which directly predicts diarization labels using a neural network, was devised to handle the overlapped speech. While the EEND, which can easily incorporate emerging deep-learning technologies, has started outperforming the x-vector clustering approach in some realistic database, it is difficult to make it work for `long' recordings (e.g., recordings longer than 10 minutes) because of, e.g., its huge memory consumption. Block-wise independent processing is also difficult because it poses an inter-block label permutation problem, i.e., an ambiguity of the speaker label assignments between blocks. In this paper, we propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers. It modifies the conventional EEND framework to simultaneously output global speaker embeddings so that speaker clustering can be performed across blocks to solve the permutation problem. With experiments based on simulated noisy reverberant 2-speaker meeting-like data, we show that the proposed framework works significantly better than the original EEND especially when the input data is long.

下载PDF全文

下载文献需遵守相关版权规定

论文标题