论文标题
利用标签语义来提取公司在公司的嘈杂标签下提取更高的性能
Harnessing label semantics to extract higher performance under noisy label for Company to Industry matching
论文作者
论文摘要
将适当的行业标签分配给公司是金融机构的重要任务,因为它会影响各种金融机构。但是,这仍然是一项复杂的任务。通常,此类行业标签应在对行业定义评估公司业务线路后,由主题专家(SME)分配。随着公司继续增加新业务并形成了新的行业定义,它变得更加具有挑战性。鉴于任务的周期性,可以合理地假设可以开发人工智能(AI)代理以有效地执行它。尽管这是一个令人兴奋的前景,但挑战源于对这种标签分配(或标签)的历史模式的需求。标签通常被认为是机器学习中最昂贵的任务(ML),因为它依赖中小型企业和手动工作。因此,通常在企业设置中,ML项目遇到嘈杂和依赖标签。这样的标签为ML模型创造了技术障碍,以产生强大的标签分配。我们提出了一种使用语义相似性匹配的ML管道作为多标签文本分类的替代方案,同时使用标签相似性矩阵和最小标签策略。我们证明该管道对噪声有了显着改善,并且具有强大的预测能力。
Assigning appropriate industry tag(s) to a company is a critical task in a financial institution as it impacts various financial machineries. Yet, it remains a complex task. Typically, such industry tags are to be assigned by Subject Matter Experts (SME) after evaluating company business lines against the industry definitions. It becomes even more challenging as companies continue to add new businesses and newer industry definitions are formed. Given the periodicity of the task it is reasonable to assume that an Artificial Intelligent (AI) agent could be developed to carry it out in an efficient manner. While this is an exciting prospect, the challenges appear from the need of historical patterns of such tag assignments (or Labeling). Labeling is often considered the most expensive task in Machine Learning (ML) due its dependency on SMEs and manual efforts. Therefore, often, in enterprise set up, an ML project encounters noisy and dependent labels. Such labels create technical hindrances for ML Models to produce robust tag assignments. We propose an ML pipeline which uses semantic similarity matching as an alternative to multi label text classification, while making use of a Label Similarity Matrix and a minimum labeling strategy. We demonstrate this pipeline achieves significant improvements over the noise and exhibit robust predictive capabilities.