Procode：用于自动编码和重新编码职业和经济活动的瑞士多语言解决方案

论文标题

Procode：用于自动编码和重新编码职业和经济活动的瑞士多语言解决方案

Procode: the Swiss Multilingual Solution for Automatic Coding and Recoding of Occupations and Economic Activities

论文作者

Savic, Nenad, Bovio, Nicolas, Gilbert, Fabian, Canu, Irina Guseva

论文摘要

客观的。流行病学研究需要与为职业或经济活动建立的分类保持一致的数据。分类通常包括数百个代码和标题。原始数据的手动编码可能会导致错误分类并耗时。目的是开发和测试名为Procode的网络工具，以编码自由文本针对分类和重新编码不同分类之间的编码。方法。使用K折交叉验证研究了三个文本分类器，即补体幼稚的贝叶斯（CNB），支持向量机（SVM）和随机森林分类器（RFC）。有30,000个自由文本，具有手动指定的法国职业分类（PC）和法国活动分类（NAF）的分类代码（NAF）。对于重新编码，Procode集成了一个工作流，该工作流将一个分类的代码转换为另一个分类的代码，根据现有人行横道。由于这是一个直接的操作，因此仅测量重新编码时间。结果。在研究的三个文本分类器中，CNB产生了最佳性能，在该性能中，分类器的PC和NAF分别为57-81％和63-83％的分类代码。 SVM导致较低的结果（提高1-2％），而RFC则准确地编码了30％的数据。编码操作需要每10 000条记录一分钟，而重新编码更快，即5-10秒。结论。在Procode中集成的算法表现出令人满意的性能，因为该工具必须通过选择500-700个不同的选择来分配正确的代码。根据结果，作者决定在Procode中实现CNB。将来，如果另一个分类器显示出卓越的性能，则更新将包括所需的修改。

Objective. Epidemiological studies require data that are in alignment with the classifications established for occupations or economic activities. The classifications usually include hundreds of codes and titles. Manual coding of raw data may result in misclassification and be time consuming. The goal was to develop and test a web-tool, named Procode, for coding of free-texts against classifications and recoding between different classifications. Methods. Three text classifiers, i.e. Complement Naive Bayes (CNB), Support Vector Machine (SVM) and Random Forest Classifier (RFC), were investigated using a k-fold cross-validation. 30 000 free-texts with manually assigned classification codes of French classification of occupations (PCS) and French classification of activities (NAF) were available. For recoding, Procode integrated a workflow that converts codes of one classification to another according to existing crosswalks. Since this is a straightforward operation, only the recoding time was measured. Results. Among the three investigated text classifiers, CNB resulted in the best performance, where the classifier predicted accurately 57-81% and 63-83% classification codes for PCS and NAF, respectively. SVM lead to somewhat lower results (by 1-2%), while RFC coded accurately up to 30% of the data. The coding operation required one minute per 10 000 records, while the recoding was faster, i.e. 5-10 seconds. Conclusion. The algorithm integrated in Procode showed satisfactory performance, since the tool had to assign the right code by choosing between 500-700 different choices. Based on the results, the authors decided to implement CNB in Procode. In future, if another classifier shows a superior performance, an update will include the required modifications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题