论文标题
语言模型的预处理源代码注释
Preprocessing Source Code Comments for Linguistic Models
论文作者
论文摘要
评论是源代码的重要组成部分,是文档的主要来源。这使人们有兴趣使用大量的评论来训练或评估消耗或生产它们的工具,例如生成甲骨文,甚至是从注释中生成代码,或者自动生成代码摘要。这项工作的大部分都对评论的结构和质量做出了强烈的假设,例如假设它们主要由适当的英语句子组成。但是,我们对这些用例的现有评论的实际质量知之甚少。评论通常包含在其他类型的文本中看不到的独特结构和元素,并且从中过滤或提取信息需要额外的谨慎。本文探讨了来自GitHub的840个最受欢迎的开源项目的Python评论的内容和质量,以及来自Srilab数据集的8422个项目,以及对幼稚与深入过滤的影响,可以使用现有的注释来用于对产生注释的系统的培训和评估。
Comments are an important part of the source code and are a primary source of documentation. This has driven interest in using large bodies of comments to train or evaluate tools that consume or produce them -- such as generating oracles or even code from comments, or automatically generating code summaries. Most of this work makes strong assumptions about the structure and quality of comments, such as assuming they consist mostly of proper English sentences. However, we know little about the actual quality of existing comments for these use cases. Comments often contain unique structures and elements that are not seen in other types of text, and filtering or extracting information from them requires some extra care. This paper explores the contents and quality of Python comments drawn from 840 most popular open source projects from GitHub and 8422 projects from SriLab dataset, and the impact of naïve vs. in-depth filtering can have on the use of existing comments for training and evaluation of systems that generate comments.