论文标题
用汽车解析的分类编码
Parsed Categoric Encodings with Automunge
论文作者
论文摘要
用于表格数据预处理的Automunge开源Python库平台可自动化数值编码和缺少数据填充的功能工程数据转换,以在指定的火车集合中符合列的属性,以在指定的火车集合中获得整洁的数据,以一致地应用于随后的数据管道应用,以适用于派出的分支列出的属性列,以分类为单位和分支列的分支列表,这些列是不同的。转换库中包含的方法是通过自动化字符串解析从有限的分类字符串集中提取结构的方法,其中对唯一值集中的条目之间的比较进行了解析,以识别字符子集重叠,这些重叠可能由布尔值重叠检测激活或更换识别的重叠分段的布尔值重叠检测激活或更换字符串的弦乐列来编码。其他也可以应用于无界分类集的字符串解析选项包括从条目或搜索功能中提取数字子字符串分区,以识别指定的子字符串分区的存在。这些方法将这些方法汇总到“家谱”转换集中,以自动从分类字符串组成中自动提取与列中的条目相关的结构,例如可以应用于准备机器学习的分类字符串编码,以便无需人工干预即可。
The Automunge open source python library platform for tabular data pre-processing automates feature engineering data transformations of numerical encoding and missing data infill to received tidy data on bases fit to properties of columns in a designated train set for consistent and efficient application to subsequent data pipelines such as for inference, where transformations may be applied to distinct columns in "family tree" sets with generations and branches of derivations. Included in the library of transformations are methods to extract structure from bounded categorical string sets by way of automated string parsing, in which comparisons between entries in the set of unique values are parsed to identify character subset overlaps which may be encoded by appended columns of boolean overlap detection activations or by replacing string entries with identified overlap partitions. Further string parsing options, which may also be applied to unbounded categoric sets, include extraction of numeric substring partitions from entries or search functions to identify presence of specified substring partitions. The aggregation of these methods into "family tree" sets of transformations are demonstrated for use to automatically extract structure from categoric string compositions in relation to the set of entries in a column, such as may be applied to prepare categoric string set encodings for machine learning without human intervention.