标识符拆分可以改善代码的开放式摄影语言模型吗？

论文标题

标识符拆分可以改善代码的开放式摄影语言模型吗？

Can Identifier Splitting Improve Open-Vocabulary Language Model of Code?

论文作者

Shi, Jieke, Yang, Zhou, He, Junda, Xu, Bowen, Lo, David

论文摘要

源代码上的统计语言模型已成功协助软件工程任务。但是，开发人员可以在编写源代码时创建或选择任意标识符。自由选择的标识符会导致臭名昭著的量不变（OOV）问题，该问题会对模型性能产生负面影响。最近，Karampatsis等人。表明，使用字节对编码（BPE）算法来解决OOV问题可以改善语言模型在源代码上的预测性能。但是，BPE的缺点是它不能以保持有意义的语义的方式分开标识符。先前的研究人员还表明，将复合标识符分解为反映语义的子字可以使软件开发工具受益。这两个事实激发了我们探索是否可以利用标识符分裂技术来增强BPE算法，并提高Karampatsis等人的工作中考虑的开放式摄影语言模型的性能。本文建议在构建词汇和处理模型输入过程中拆分标识符，从而利用将标识符分开为代码完成任务的标识符分开的三个不同设置。我们将模型在这些设置下的性能进行了对比，发现仅将标识符插入管道中的标识符会损害模型性能，而结合标识符拆分和BPE算法的混合策略可以优于原始的开放式视频计师模型，以预测召回率的3.68％和6.32％的均值等级。结果还表明，混合策略可以将语言模型的熵提高2.02％。

Statistical language models on source code have successfully assisted software engineering tasks. However, developers can create or pick arbitrary identifiers when writing source code. Freely chosen identifiers lead to the notorious out-of-vocabulary (OOV) problem that negatively affects model performance. Recently, Karampatsis et al. showed that using the Byte Pair Encoding (BPE) algorithm to address the OOV problem can improve the language models' predictive performance on source code. However, a drawback of BPE is that it cannot split the identifiers in a way that preserves the meaningful semantics. Prior researchers also show that splitting compound identifiers into sub-words that reflect the semantics can benefit software development tools. These two facts motivate us to explore whether identifier splitting techniques can be utilized to augment the BPE algorithm and boost the performance of open-vocabulary language models considered in Karampatsis et al.'s work. This paper proposes to split identifiers in both constructing vocabulary and processing model inputs procedures, thus exploiting three different settings of applying identifier splitting to language models for the code completion task. We contrast models' performance under these settings and find that simply inserting identifier splitting into the pipeline hurts the model performance, while a hybrid strategy combining identifier splitting and the BPE algorithm can outperform the original open-vocabulary models on predicting identifiers by 3.68% of recall and 6.32% of Mean Reciprocal Rank. The results also show that the hybrid strategy can improve the entropy of language models by 2.02%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题