表格数据的小语言模型

论文标题

表格数据的小语言模型

Small Language Models for Tabular Data

论文作者

Badger, Benjamin L.

论文摘要

监督的深度学习最常用于在大型且经常经过广泛策划的数据集上定义的困难问题。在这里，我们通过将输入信息编码为每个输入字段固定数量的字符组成的抽象序列来证明深度表示学习解决分类问题和从小且形成不佳的表格数据集回归问题的能力。我们发现，小型模型具有足够的能力来近似各种函数并实现记录分类基准准确性。这种模型被证明可以在其隐藏层中形成各种输入特征的有用嵌入，即使学习的任务并不明确需要了解这些功能。这些模型也适合输入归因，从而估算每个输入元素对模型输出的重要性以及哪些输入特征有效地嵌入了模型中。我们提供了概念验证，用于将小语言模型应用于混合表格数据，而无需明确的功能工程，清洁或预处理，依靠模型作为表示过程的一部分执行这些任务。

Supervised deep learning is most commonly applied to difficult problems defined on large and often extensively curated datasets. Here we demonstrate the ability of deep representation learning to address problems of classification and regression from small and poorly formed tabular datasets by encoding input information as abstracted sequences composed of a fixed number of characters per input field. We find that small models have sufficient capacity for approximation of various functions and achieve record classification benchmark accuracy. Such models are shown to form useful embeddings of various input features in their hidden layers, even if the learned task does not explicitly require knowledge of those features. These models are also amenable to input attribution, allowing for an estimation of the importance of each input element to the model output as well as of which inputs features are effectively embedded in the model. We present a proof-of-concept for the application of small language models to mixed tabular data without explicit feature engineering, cleaning, or preprocessing, relying on the model to perform these tasks as part of the representation learning process.

下载PDF全文

下载文献需遵守相关版权规定

论文标题