论文标题
只是等级:用单词和句子相似性重新思考评估
Just Rank: Rethinking Evaluation with Word and Sentence Similarities
论文作者
论文摘要
单词和句子嵌入是自然语言处理中的有用特征表示。但是,对嵌入的固有评估落后于远远落后,自过去十年以来,没有进行重大更新。单词和句子相似性任务已成为事实上的评估方法。它导致模型过度拟合此类评估,从而对嵌入模型的开发产生负面影响。本文首先指出了使用语义相似性作为单词和句子嵌入评估的黄金标准的问题。此外,我们提出了一种称为evalrank的新的内在评估方法,该方法显示了与下游任务的相关性更强。广泛的实验是根据60多个模型和流行数据集进行的,以证明我们的判断。最后,为将来的基准测试目的发布了实际评估工具包。
Word and sentence embeddings are useful feature representations in natural language processing. However, intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade. Word and sentence similarity tasks have become the de facto evaluation method. It leads models to overfit to such evaluations, negatively impacting embedding models' development. This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations. Further, we propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks. Extensive experiments are conducted based on 60+ models and popular datasets to certify our judgments. Finally, the practical evaluation toolkit is released for future benchmarking purposes.