论文标题
距离相关的特征选择
Feature Selection with Distance Correlation
论文作者
论文摘要
选择要用作多变量决策算法输入的数据的属性 - 又称功能选择 - 是解决机器学习的任何问题的重要步骤。虽然有一个明显的趋势是针对大量相对未经处理的输入(所谓的自动化功能工程)培训复杂的深层网络,但对于许多物理学的任务,理论上有良好的动机和良好的特征已经存在。使用此类功能可以带来许多好处,包括更大的解释性,减少培训和运行时间以及增强的稳定性和鲁棒性。我们基于距离相关性(DISCO)开发了一种新的功能选择方法,并在提升的上和$ W $ TAGGing的任务上演示了其有效性。使用我们的方法从一组7,000多个能量流多项式的功能中选择功能,我们表明我们只使用十个功能和两个含量的模型参数来匹配更深层次的架构的性能。
Choosing which properties of the data to use as input to multivariate decision algorithms -- a.k.a. feature selection -- is an important step in solving any problem with machine learning. While there is a clear trend towards training sophisticated deep networks on large numbers of relatively unprocessed inputs (so-called automated feature engineering), for many tasks in physics, sets of theoretically well-motivated and well-understood features already exist. Working with such features can bring many benefits, including greater interpretability, reduced training and run time, and enhanced stability and robustness. We develop a new feature selection method based on Distance Correlation (DisCo), and demonstrate its effectiveness on the tasks of boosted top- and $W$-tagging. Using our method to select features from a set of over 7,000 energy flow polynomials, we show that we can match the performance of much deeper architectures, by using only ten features and two orders-of-magnitude fewer model parameters.