通过基于轮廓的一代创建跨语性对话数据集创建

论文标题

通过基于轮廓的一代创建跨语性对话数据集创建

Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation

论文作者

Majewska, Olga, Razumovskaia, Evgeniia, Ponti, Edoardo Maria, Vulić, Ivan, Korhonen, Anna

论文摘要

多语言以任务为导向的对话（TOD）促进了许多演讲者（社区）的服务和信息的访问。然而，这项技术的潜力尚未完全实现，因为当前用于多语言TOD的数据集（用于模块化和端到端建模）都受到严重限制。 1）当从头开始创建时，它们通常规模很小，无法涵盖许多可能的对话流。 2）基于翻译的TOD数据集可能缺乏目标语言中的自然性和文化特异性。在这项工作中，为了应对这些限制，我们为多语言TOD数据集提出了一个新颖的基于大纲的注释过程，该数据集在该数据集中，在该数据集中，对话的特定于域特定的抽象模式被映射到自然语言大纲中。这些反过来指导目标语言注释者通过提供有关每个转弯意图和插槽的说明来编写对话。通过此过程，我们注释了一个新的大规模数据集，以培训和评估多语言和跨语义的TOD系统。我们的跨语言基于轮廓的对话数据集（称为COD）可以采用4种不同语言的自然语言理解，对话状态跟踪以及端到端的对话建模和评估：阿拉伯语，印尼语，俄语和基斯瓦希里语。 COD与基于等效的基于翻译的数据集的定性和定量分析表明，数据质量的改进，通过基于轮廓的方法解锁。最后，我们基准了一系列用于跨语言TOD的最先进的系统，为将来的工作设置参考分数，并证明COD可以防止过度膨胀的性能，通常符合先前基于翻译的TOD数据集。

Multilingual task-oriented dialogue (ToD) facilitates access to services and information for many (communities of) speakers. Nevertheless, the potential of this technology is not fully realised, as current datasets for multilingual ToD - both for modular and end-to-end modelling - suffer from severe limitations. 1) When created from scratch, they are usually small in scale and fail to cover many possible dialogue flows. 2) Translation-based ToD datasets might lack naturalness and cultural specificity in the target language. In this work, to tackle these limitations we propose a novel outline-based annotation process for multilingual ToD datasets, where domain-specific abstract schemata of dialogue are mapped into natural language outlines. These in turn guide the target language annotators in writing a dialogue by providing instructions about each turn's intents and slots. Through this process we annotate a new large-scale dataset for training and evaluation of multilingual and cross-lingual ToD systems. Our Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding, dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages: Arabic, Indonesian, Russian, and Kiswahili. Qualitative and quantitative analyses of COD versus an equivalent translation-based dataset demonstrate improvements in data quality, unlocked by the outline-based approach. Finally, we benchmark a series of state-of-the-art systems for cross-lingual ToD, setting reference scores for future work and demonstrating that COD prevents over-inflated performance, typically met with prior translation-based ToD datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题