在联邦学习中克服嘈杂和无关紧要的数据

论文标题

在联邦学习中克服嘈杂和无关紧要的数据

Overcoming Noisy and Irrelevant Data in Federated Learning

论文作者

Tuor, Tiffany, Wang, Shiqiang, Ko, Bong Jun, Liu, Changchang, Leung, Kin K.

论文摘要

许多图像和视力应用都需要大量的模型培训数据。由于数据隐私和通信带宽限制，在中心位置收集所有此类数据可能会具有挑战性。联合学习是从客户设备收集的本地数据中以分布式方式培训机器学习模型的有效方法，该模型不需要在客户之间交换原始数据。一个挑战是，在每个客户收集的大量数据中，可能只有一个子集与学习任务有关，而其余数据对模型培训产生了负面影响。因此，在开始学习过程之前，选择与给定联合学习任务相关的数据子集很重要。在本文中，我们提出了一种分布式选择相关数据的方法，在该方法中，我们使用在特定于任务的小型基准数据集中训练的基准模型，以评估每个客户端对单个数据样本的相关性，并以足够高的相关性选择数据。然后，每个客户仅在联合学习过程中使用其数据的选定子集。与所有数据相比，与所有数据培训相比，在模拟系统中，在模拟系统中的多个现实世界图像数据集评估了我们提出的方法的有效性。

Many image and vision applications require a large amount of data for model training. Collecting all such data at a central location can be challenging due to data privacy and communication bandwidth restrictions. Federated learning is an effective way of training a machine learning model in a distributed manner from local data collected by client devices, which does not require exchanging the raw data among clients. A challenge is that among the large variety of data collected at each client, it is likely that only a subset is relevant for a learning task while the rest of data has a negative impact on model training. Therefore, before starting the learning process, it is important to select the subset of data that is relevant to the given federated learning task. In this paper, we propose a method for distributedly selecting relevant data, where we use a benchmark model trained on a small benchmark dataset that is task-specific, to evaluate the relevance of individual data samples at each client and select the data with sufficiently high relevance. Then, each client only uses the selected subset of its data in the federated learning process. The effectiveness of our proposed approach is evaluated on multiple real-world image datasets in a simulated system with a large number of clients, showing up to $25\%$ improvement in model accuracy compared to training with all data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题