论文标题
关于开放域对话数据集的重叠问题的实证研究
An Empirical Study on the Overlapping Problem of Open-Domain Dialogue Datasets
论文作者
论文摘要
开放域对话系统旨在通过文本与人类交谈,对话研究在很大程度上依赖于基准数据集。在这项工作中,我们观察到DailyDialog和OpenSubtitles中的重叠问题,这是两个流行的开放域对话基准基准数据集。然后,我们的系统分析表明,可以利用这种重叠以获得伪造的最新性能。最后,我们通过清洁这些数据集并设置适当的数据处理程序来解决此问题。
Open-domain dialogue systems aim to converse with humans through text, and dialogue research has heavily relied on benchmark datasets. In this work, we observe the overlapping problem in DailyDialog and OpenSubtitles, two popular open-domain dialogue benchmark datasets. Our systematic analysis then shows that such overlapping can be exploited to obtain fake state-of-the-art performance. Finally, we address this issue by cleaning these datasets and setting up a proper data processing procedure for future research.