论文标题

在历史文档中提高脱机手写文本识别,很少有标签线

Boosting offline handwritten text recognition in historical documents with few labeled lines

论文作者

Aradillas, José Carlos, Murillo-Fuentes, Juan José, Olmos, Pablo M.

论文摘要

在本文中,当很少有标签样本可用时,我们在历史文档中面临离线手写文本识别(HTR)的问题,并且其中一些在火车集合中包含错误。开发了三个主要贡献。首先,我们分析如何执行从大量数据库到较小的历史数据库的转移学习(TL),分析模型的哪些需要进行微调过程。其次,我们分析方法有效地组合TL和数据增强(DA)。最后,提出了一种算法来减轻训练集中不正确标记的影响。在ICFHR 2018竞赛数据库,华盛顿和Parzival上分析了这些方法。结合了所有这些技术,我们在测试集中证明了CER的显着降低(在某些情况下最多6%),开销很少。

In this paper, we face the problem of offline handwritten text recognition (HTR) in historical documents when few labeled samples are available and some of them contain errors in the train set. Three main contributions are developed. First we analyze how to perform transfer learning (TL) from a massive database to a smaller historical database, analyzing which layers of the model need a fine-tuning process. Second, we analyze methods to efficiently combine TL and data augmentation (DA). Finally, an algorithm to mitigate the effects of incorrect labelings in the training set is proposed. The methods are analyzed over the ICFHR 2018 competition database, Washington and Parzival. Combining all these techniques, we demonstrate a remarkable reduction of CER (up to 6% in some cases) in the test set with little complexity overhead.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源