论文标题

从用户生成的数据中学习的陷阱:主观类问题的深入分析

A Pitfall of Learning from User-generated Data: In-depth Analysis of Subjective Class Problem

论文作者

Nemoto, Kei, Jain, Shweta

论文摘要

监督学习算法领域的研究隐含地假设培训数据由域专家或至少通过亚马逊机械Turk等众包服务访问的域专家或至少半专业的标签标记。随着Internet的出现,数据已变得丰富,大量基于机器学习的系统开始使用用户生成的数据培训,使用分类数据作为真实标签。但是,用用户定义的标签在监督学习领域的工作很少,在这些标签中,用户不一定是专家,并且可能有动机提供不正确的标签,以改善自己的实用程序。在本文中,我们在用户定义的标签中提出了两种类型的类:主观类和客观类 - 表明目标类就像由域专家提供的一样可靠,而主观类则受到用户的偏见和操纵。我们将其定义为主观类问题,并提供了一个框架,用于在不查询Oracle的情况下检测数据集中的主观标签。使用此框架,数据挖掘从业人员可以在项目的早期阶段检测一个主观类别,并避免通过使用传统的机器学习技术处理主观的类问题来浪费其宝贵的时间和资源。

Research in the supervised learning algorithms field implicitly assumes that training data is labeled by domain experts or at least semi-professional labelers accessible through crowdsourcing services like Amazon Mechanical Turk. With the advent of the Internet, data has become abundant and a large number of machine learning based systems started being trained with user-generated data, using categorical data as true labels. However, little work has been done in the area of supervised learning with user-defined labels where users are not necessarily experts and might be motivated to provide incorrect labels in order to improve their own utility from the system. In this article, we propose two types of classes in user-defined labels: subjective class and objective class - showing that the objective classes are as reliable as if they were provided by domain experts, whereas the subjective classes are subject to bias and manipulation by the user. We define this as a subjective class issue and provide a framework for detecting subjective labels in a dataset without querying oracle. Using this framework, data mining practitioners can detect a subjective class at an early stage of their projects, and avoid wasting their precious time and resources by dealing with subjective class problem with traditional machine learning techniques.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源