论文标题

部分可观测时空混沌系统的无模型预测

Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI

论文作者

Pushkarna, Mahima, Zaldivar, Andrew, Kjartansson, Oddur

论文摘要

随着研究和行业朝着能够完成许多下游任务的大规模模型的发展,理解多模式数据集的复杂性,这些数据集使模型迅速增加。对数据集的起源,开发,意图,道德考虑和进化的清晰明确的理解成为负责任且知情的模型部署的必要步骤,尤其是那些在面向人们的环境和高风险领域的模型。但是,这种理解的负担通常依赖于文档的可理解性,简洁性和全面性。它需要在所有涉及的所有数据集的文档中的一致性和可比性,因此必须将文档本身视为以用户为中心的产品。在本文中,我们建议在行业和研究的实际情况下培养数据集的透明,有目的性和以人为本的数据卡的数据卡。数据卡是有关责任AI开发的数据集中利益相关者所需的ML数据集的各个方面的基本事实的结构化摘要。这些摘要提供了塑造数据的过程和理由的解释,从而塑造了数据,因此可以塑造模型,例如上游来源,数据收集和注释方法;培训和评估方法,预期用途;或影响模型性能的决策。我们还提出了以现实世界实用性和以人为中心的框架进行的框架。使用两个案例研究,我们报告了支持跨领域,组织结构和受众群体采用的理想特征。最后,我们介绍从部署20多个数据卡中学到的经验教训。

As research and industry moves towards large-scale models capable of numerous downstream tasks, the complexity of understanding multi-modal datasets that give nuance to models rapidly increases. A clear and thorough understanding of a dataset's origins, development, intent, ethical considerations and evolution becomes a necessary step for the responsible and informed deployment of models, especially those in people-facing contexts and high-risk domains. However, the burden of this understanding often falls on the intelligibility, conciseness, and comprehensiveness of the documentation. It requires consistency and comparability across the documentation of all datasets involved, and as such documentation must be treated as a user-centric product in and of itself. In this paper, we propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets within the practical contexts of industry and research. Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a dataset's lifecycle for responsible AI development. These summaries provide explanations of processes and rationales that shape the data and consequently the models, such as upstream sources, data collection and annotation methods; training and evaluation methods, intended use; or decisions affecting model performance. We also present frameworks that ground Data Cards in real-world utility and human-centricity. Using two case studies, we report on desirable characteristics that support adoption across domains, organizational structures, and audience groups. Finally, we present lessons learned from deploying over 20 Data Cards.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源