论文标题
分叉而无需单击:如何识别软件存储库
Forking Without Clicking: on How to Identify Software Repository Forks
论文作者
论文摘要
随着时间的流逝,软件“叉”的概念一直从(负面的)现象转变为社区分歧的现象,这些现象导致创建单独的开发线和最终的软件产品,转变为(正面)使用分布式版本控制系统(VCS)存储库的(积极的)实践,以在不跨越每个人脚步的情况下进行协作改善单一产品。在这两种情况下,VCS存储库都参与叉子共享共同发展历史的一部分。软件叉的研究通常依赖于托管平台元数据,例如github,是构成叉子的真理来源。但是,这些“伪造叉”只能识别为在平台上创建的叉子存储库,例如,通过单击平台用户界面上的''fork'按钮。代码托管平台(例如GitLab)和重要开发社区的习惯(例如,Linux内核,并非主要托管在任何单个平台上)的多样性增加,呼吁质疑信任代码托管平台以识别叉子的可靠性。这样做可能会引入实证研究中的选择和方法论偏见。在本文中,我们探讨了“软件分叉”的各种定义,试图捕获现实世界中存在的分叉工作流程。我们根据各种定义量化了在Github上将多少存储库识别为叉子的差异的差异,从而证实只能通过仅考虑Forge Forks来忽略大量数量。我们研究叉网络的结构和大小,观察它们如何受到所提出的定义的影响,并讨论对经验研究的潜在影响。
The notion of software ''fork'' has been shifting over time from the (negative) phenomenon of community disagreements that result in the creation of separate development lines and ultimately software products, to the (positive) practice of using distributed version control system (VCS) repositories to collaboratively improve a single product without stepping on each others toes. In both cases the VCS repositories participating in a fork share parts of a common development history. Studies of software forks generally rely on hosting platform metadata, such as GitHub, as the source of truth for what constitutes a fork. These ''forge forks'' however can only identify as forks repositories that have been created on the platform, e.g., by clicking a ''fork'' button on the platform user interface. The increased diversity in code hosting platforms (e.g., GitLab) and the habits of significant development communities (e.g., the Linux kernel, which is not primarily hosted on any single platform) call into question the reliability of trusting code hosting platforms to identify forks. Doing so might introduce selection and methodological biases in empirical studies. In this article we explore various definitions of ''software forks'', trying to capture forking workflows that exist in the real world. We quantify the differences in how many repositories would be identified as forks on GitHub according to the various definitions, confirming that a significant number could be overlooked by only considering forge forks. We study the structure and size of fork networks , observing how they are affected by the proposed definitions and discuss the potential impact on empirical research.