fold-se：一种基于规则的机器学习算法，具有可扩展性的性能

论文标题

fold-se：一种基于规则的机器学习算法，具有可扩展性的性能

FOLD-SE: An Efficient Rule-based Machine Learning Algorithm with Scalable Explainability

论文作者

Wang, Huaduo, Gupta, Gopal

论文摘要

我们提出了fold-se，这是一种用于分类任务的有效，可解释的机器学习算法，并给出包含数值和分类值的表格数据。 fold-se基于（可解释的）训练有素的模型，生成一组默认的规则符合分层的正常逻辑程序。 Fold-SE提供的解释性是可扩展的，这意味着无论数据集的大小如何，学到的规则和学习文字的数量保持很小，同时保持良好的分类精度。对于人类而言，具有较少规则和文字的模型更容易理解。 Fold-SE具有最新的机器学习算法（例如XGBoost和多层感知器（MLP）WRT预测的精度）的竞争。但是，与XGBoost和MLP不同，可以解释fold-se算法。 FOLD-SE算法建立在我们早期开发可解释的折叠R ++机器学习算法的工作，用于二进制分类，并继承其所有积极特征。因此，不需要使用诸如单热编码的技术对数据集进行预处理。与fold-r ++一样，fold-se使用前缀总和来加快计算的速度，从而使fold-se在执行速度下比xgboost和MLP快。 FOLD-SE算法的表现优于fold-r ++以及其他规则学习算法，例如效率，性能和可伸缩性，尤其是对于大型数据集的效率，性能和可扩展性。可扩展性解释性的主要原因是使用基于Gini杂质的字面选择启发式方法，而不是fold-r ++中使用的信息增益。还提出了fold-se的多类分类版本。

We present FOLD-SE, an efficient, explainable machine learning algorithm for classification tasks given tabular data containing numerical and categorical values. FOLD-SE generates a set of default rules-essentially a stratified normal logic program-as an (explainable) trained model. Explainability provided by FOLD-SE is scalable, meaning that regardless of the size of the dataset, the number of learned rules and learned literals stay quite small while good accuracy in classification is maintained. A model with smaller number of rules and literals is easier to understand for human beings. FOLD-SE is competitive with state-of-the-art machine learning algorithms such as XGBoost and Multi-Layer Perceptrons (MLP) wrt accuracy of prediction. However, unlike XGBoost and MLP, the FOLD-SE algorithm is explainable. The FOLD-SE algorithm builds upon our earlier work on developing the explainable FOLD-R++ machine learning algorithm for binary classification and inherits all of its positive features. Thus, pre-processing of the dataset, using techniques such as one-hot encoding, is not needed. Like FOLD-R++, FOLD-SE uses prefix sum to speed up computations resulting in FOLD-SE being an order of magnitude faster than XGBoost and MLP in execution speed. The FOLD-SE algorithm outperforms FOLD-R++ as well as other rule-learning algorithms such as RIPPER in efficiency, performance and scalability, especially for large datasets. A major reason for scalable explainability of FOLD-SE is the use of a literal selection heuristics based on Gini Impurity, as opposed to Information Gain used in FOLD-R++. A multi-category classification version of FOLD-SE is also presented.

下载PDF全文

下载文献需遵守相关版权规定

论文标题