论文标题

重新访问语法错误校正评估及其他

Revisiting Grammatical Error Correction Evaluation and Beyond

论文作者

Gong, Peiyuan, Liu, Xuebo, Huang, Heyan, Zhang, Min

论文摘要

基于预处理的(基于PT)的自动评估指标(例如,Bertscore和Bartscore)已被广泛用于多个句子生成任务(例如,机器翻译和文本摘要),因为它们与人类对基于传统基于重叠方法的人类判断的更好相关性。尽管基于PT的方法已成为训练语法校正(GEC)系统的事实上的标准,但GEC评估仍未受益于经过验证的知识。本文迈出了第一步,朝着理解和改善GEC评估的情况下。我们首先发现,任意将基于PT的指标应用于GEC评估会带来不令人满意的相关结果,因为对不必要的系统输出(例如,不变的部分)的关注过多。为了减轻限制,我们提出了一种新型的GEC评估指标来实现两全其美的最佳状态,即PT-M2仅使用基于PT的指标来评分这些校正的部分。 CONLL14评估任务的实验结果表明,PT-M2显着胜过现有方法,实现了0.949 Pearson相关性的新最新结果。进一步的分析表明,PT-M2可以评估竞争性GEC系统。源代码和脚本可在https://github.com/pygongnlp/pt-m2上免费获得。

Pretraining-based (PT-based) automatic evaluation metrics (e.g., BERTScore and BARTScore) have been widely used in several sentence generation tasks (e.g., machine translation and text summarization) due to their better correlation with human judgments over traditional overlap-based methods. Although PT-based methods have become the de facto standard for training grammatical error correction (GEC) systems, GEC evaluation still does not benefit from pretrained knowledge. This paper takes the first step towards understanding and improving GEC evaluation with pretraining. We first find that arbitrarily applying PT-based metrics to GEC evaluation brings unsatisfactory correlation results because of the excessive attention to inessential systems outputs (e.g., unchanged parts). To alleviate the limitation, we propose a novel GEC evaluation metric to achieve the best of both worlds, namely PT-M2 which only uses PT-based metrics to score those corrected parts. Experimental results on the CoNLL14 evaluation task show that PT-M2 significantly outperforms existing methods, achieving a new state-of-the-art result of 0.949 Pearson correlation. Further analysis reveals that PT-M2 is robust to evaluate competitive GEC systems. Source code and scripts are freely available at https://github.com/pygongnlp/PT-M2.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源