论文标题

用极其辛辣的众包注释来识别中国意见表达

Identifying Chinese Opinion Expressions with Extremely-Noisy Crowdsourcing Annotations

论文作者

Zhang, Xin, Xu, Guangwei, Sun, Yueheng, Zhang, Meishan, Wang, Xiaobin, Zhang, Min

论文摘要

最近的意见表达识别(OEI)的工作在很大程度上取决于手动构造的培训语料库的质量和规模,这可能非常困难。众包是解决此问题的一种实用解决方案,旨在创建一个大规模但优质的枪击语料库。在这项工作中,我们通过众包注释来调查中国OEI,以非常低的成本构建数据集。跟随Zhang等。 (2021),我们通过在人群注释方面介绍所有注释作为金标准的注释,并使用合成专家来测试模型,该模型是所有注释者的混合物。由于这种用于测试的注释分子从未在训练阶段进行明确建模,因此我们建议通过相关的混合策略生成合成训练样本,以使训练和测试高度一致。我们构造的数据集中的仿真实验表明,众包对OEI非常有前途,而我们提出的注释器混合可以进一步增强众包建模。

Recent works of opinion expression identification (OEI) rely heavily on the quality and scale of the manually-constructed training corpus, which could be extremely difficult to satisfy. Crowdsourcing is one practical solution for this problem, aiming to create a large-scale but quality-unguaranteed corpus. In this work, we investigate Chinese OEI with extremely-noisy crowdsourcing annotations, constructing a dataset at a very low cost. Following zhang et al. (2021), we train the annotator-adapter model by regarding all annotations as gold-standard in terms of crowd annotators, and test the model by using a synthetic expert, which is a mixture of all annotators. As this annotator-mixture for testing is never modeled explicitly in the training phase, we propose to generate synthetic training samples by a pertinent mixup strategy to make the training and testing highly consistent. The simulation experiments on our constructed dataset show that crowdsourcing is highly promising for OEI, and our proposed annotator-mixup can further enhance the crowdsourcing modeling.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源