数据流的公平和代表性子集选择

论文标题

数据流的公平和代表性子集选择

Fair and Representative Subset Selection from Data Streams

论文作者

Wang, Yanhao, Fabbri, Francesco, Mathioudakis, Michael

论文摘要

我们研究从大数据流中提取一小部分代表性项目的问题。在许多数据挖掘和机器学习应用程序（例如社交网络分析和推荐系统）中，可以将此问题提出为最大化单调的supdodular函数，但受基数约束$ k $。在这项工作中，我们考虑了流中数据项属于几个不相交组之一的设置，并使用额外的\ emph {fairness}约束调查优化问题，该设置将选择限制为每个组中给定数量的项目。然后，我们提出了流媒体下最大化问题的公平感知变体的有效算法。特别是，我们首先给出$（\ frac {1} {2} - \ varepsilon）$ - approximation算法，该算法需要$ o（\ frac {1} {\ varepsilon} \ varepsilon} \ log \ log \ frac \ frac {k} {k} {\ varepsilon {\ varepsilon} $ pass $ passes $ passe $ passes $ sreact in convers $ sreact in creampes $ cream for not for not for in Creampers。此外，当允许使用无限的缓冲尺寸和后处理时间时，我们给出了一种单频流算法，其近似值为$（\ frac {1} {2} {2} - \ varepsilon）$，并讨论如何将其调整到更实用的设置中，该设置在某些情况下是限制的。最后，我们证明了我们提出的算法对两个现实世界应用的效率和有效性，即\ emph {大图上的最大覆盖范围}和\ emph {个性化推荐}。

We study the problem of extracting a small subset of representative items from a large data stream. In many data mining and machine learning applications such as social network analysis and recommender systems, this problem can be formulated as maximizing a monotone submodular function subject to a cardinality constraint $k$. In this work, we consider the setting where data items in the stream belong to one of several disjoint groups and investigate the optimization problem with an additional \emph{fairness} constraint that limits selection to a given number of items from each group. We then propose efficient algorithms for the fairness-aware variant of the streaming submodular maximization problem. In particular, we first give a $ (\frac{1}{2}-\varepsilon) $-approximation algorithm that requires $ O(\frac{1}{\varepsilon} \log \frac{k}{\varepsilon}) $ passes over the stream for any constant $ \varepsilon>0 $. Moreover, we give a single-pass streaming algorithm that has the same approximation ratio of $(\frac{1}{2}-\varepsilon)$ when unlimited buffer sizes and post-processing time are permitted, and discuss how to adapt it to more practical settings where the buffer sizes are bounded. Finally, we demonstrate the efficiency and effectiveness of our proposed algorithms on two real-world applications, namely \emph{maximum coverage on large graphs} and \emph{personalized recommendation}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题