论文标题
首先,中间和姓氏的种族和种族数据
Race and ethnicity data for first, middle, and last names
论文作者
论文摘要
我们提供最大的公开词典,其中包括贝叶斯改进的姓氏地理编码(BISG),以推出种族和种族的目的。词典基于六个南部州的选民档案,这些档案是在选民注册后收集自我报告的种族数据的。我们的数据涵盖了比任何可比数据集更大的名称范围,其中包含大约100万个名字,110万个中间名和140万个姓氏。个人被归类为五个相互排斥的种族和族裔 - 白色,黑人,西班牙裔,亚洲和其他种族 - 每个词典中的每个名称都提供了名称的种族/族裔计数。然后可以按列表或列的标准化计数,以获取给定名称或名称的种族的有条件概率。然后可以将这些条件概率部署在数据分析任务中,以实现真相和种族数据的基础。
We provide the largest compiled publicly available dictionaries of first, middle, and last names for the purpose of imputing race and ethnicity using, for example, Bayesian Improved Surname Geocoding (BISG). The dictionaries are based on the voter files of six Southern states that collect self-reported racial data upon voter registration. Our data cover a much larger scope of names than any comparable dataset, containing roughly one million first names, 1.1 million middle names, and 1.4 million surnames. Individuals are categorized into five mutually exclusive racial and ethnic groups -- White, Black, Hispanic, Asian, and Other -- and racial/ethnic counts by name are provided for every name in each dictionary. Counts can then be normalized row-wise or column-wise to obtain conditional probabilities of race given name or name given race. These conditional probabilities can then be deployed for imputation in a data analytic task for which ground truth racial and ethnic data is not available.