论文标题
标记N'火车:一种在未标记数据上训练改进分类器的技术
Tag N' Train: A Technique to Train Improved Classifiers on Unlabeled Data
论文作者
论文摘要
将机器学习技术应用于对撞机和喷气物理学的分类问题方面取得了重大进展。但是,随着这些技术在复杂性中的增长,它们对喷气机的微妙特征变得越来越敏感,这些特征可能无法很好地模拟模拟。因此,依靠模拟进行培训将导致数据中的次优性能,但是缺乏真正的类标签使得很难在真实数据上训练。为了应对这一挑战,我们引入了一种称为标签N'火车(TNT)的新方法,该方法可以应用于具有两个不同的子对象的未标记数据。该技术将弱分类器用于其中一个对象来标记信号富含信号和背景的样本。然后,这些样品用于训练另一个对象的更强大的分类器。我们通过将其应用于Dijet共振搜索来证明该方法的功能。从直接接受数据作为弱分类器培训的自动编码器开始,我们使用TNT来培训大大改进的分类器。我们表明,标签n'火车可以成为模型不平衡搜索的强大工具,并讨论其他潜在的应用程序。
There has been substantial progress in applying machine learning techniques to classification problems in collider and jet physics. But as these techniques grow in sophistication, they are becoming more sensitive to subtle features of jets that may not be well modeled in simulation. Therefore, relying on simulations for training will lead to sub-optimal performance in data, but the lack of true class labels makes it difficult to train on real data. To address this challenge we introduce a new approach, called Tag N' Train (TNT), that can be applied to unlabeled data that has two distinct sub-objects. The technique uses a weak classifier for one of the objects to tag signal-rich and background-rich samples. These samples are then used to train a stronger classifier for the other object. We demonstrate the power of this method by applying it to a dijet resonance search. By starting with autoencoders trained directly on data as the weak classifiers, we use TNT to train substantially improved classifiers. We show that Tag N' Train can be a powerful tool in model-agnostic searches and discuss other potential applications.