编码器：代码搜索的单峰和双峰对比度学习

论文标题

编码器：代码搜索的单峰和双峰对比度学习

CodeRetriever: Unimodal and Bimodal Contrastive Learning for Code Search

论文作者

Li, Xiaonan, Gong, Yeyun, Shen, Yelong, Qiu, Xipeng, Zhang, Hang, Yao, Bolun, Qi, Weizhen, Jiang, Daxin, Chen, Weizhu, Duan, Nan

论文摘要

在本文中，我们提出了编码器模型，该模型通过大规模的代码文本对比预训练来学习功能级代码语义表示。我们在编码器中采用了两个对比的学习方案：单峰对比度学习和双峰对比学习。对于单峰对比度学习，我们设计了一种无监督的学习方法，以基于文档和功能名称构建与语义相关的代码对。对于双峰对比学习，我们利用代码的文档和在线注释来构建代码文本对。两种对比目标都可以完全利用大规模代码语料库进行预训练。广泛的实验结果表明，CodeRetriever在11个域/特定于语言的代码搜索任务上，具有与现有代码预培训的模型相比，具有不同的代码搜索任务，具有不同的代码搜索任务，具有不同的代码粒度（功能级别，SNIPPET级别和语句级别）。这些结果证明了编码器的有效性和鲁棒性。

In this paper, we propose the CodeRetriever model, which learns the function-level code semantic representations through large-scale code-text contrastive pre-training. We adopt two contrastive learning schemes in CodeRetriever: unimodal contrastive learning and bimodal contrastive learning. For unimodal contrastive learning, we design an unsupervised learning approach to build semantic-related code pairs based on the documentation and function name. For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build code-text pairs. Both contrastive objectives can fully leverage large-scale code corpus for pre-training. Extensive experimental results show that CodeRetriever achieves new state-of-the-art with significant improvement over existing code pre-trained models, on eleven domain/language-specific code search tasks with six programming languages in different code granularity (function-level, snippet-level and statement-level). These results demonstrate the effectiveness and robustness of CodeRetriever.

下载PDF全文

下载文献需遵守相关版权规定

论文标题