自定进度学习以改善带有缺少标签的历史文档中的文本行检测

论文标题

自定进度学习以改善带有缺少标签的历史文档中的文本行检测

Self-paced learning to improve text row detection in historical documents with missing labels

论文作者

Gaman, Mihaela, Ghadamiyan, Lida, Ionescu, Radu Tudor, Popescu, Marius

论文摘要

光学特征识别系统的重要初步步骤是检测文本行。为了在缺少标签的历史数据中解决此任务，我们提出了一种能够提高行检测性能的自定进度学习算法。我们猜想，具有更多地面界限框的页面不太可能缺少注释。基于此假设，我们就基面框数的数量按降序排序训练示例，并将其整理成K批次。使用我们的自定进度学习方法，我们在K迭代中训练一排探测器，并逐渐增加了较少的地面注释的批次。在每次迭代中，我们使用非最大最大抑制作用将地面真相边界框与一个伪装框（由模型本身预测的边界框）组合在一起，我们在下一次训练迭代中包括所得的注释。我们证明，我们的自进度学习策略在两个历史文档的数据集上带来了显着的绩效提高，从而提高了Yolov4的平均精度，一个数据集超过12％，另一个数据集则超过39％。

An important preliminary step of optical character recognition systems is the detection of text rows. To address this task in the context of historical data with missing labels, we propose a self-paced learning algorithm capable of improving the row detection performance. We conjecture that pages with more ground-truth bounding boxes are less likely to have missing annotations. Based on this hypothesis, we sort the training examples in descending order with respect to the number of ground-truth bounding boxes, and organize them into k batches. Using our self-paced learning method, we train a row detector over k iterations, progressively adding batches with less ground-truth annotations. At each iteration, we combine the ground-truth bounding boxes with pseudo-bounding boxes (bounding boxes predicted by the model itself) using non-maximum suppression, and we include the resulting annotations at the next training iteration. We demonstrate that our self-paced learning strategy brings significant performance gains on two data sets of historical documents, improving the average precision of YOLOv4 with more than 12% on one data set and 39% on the other.

下载PDF全文

下载文献需遵守相关版权规定

论文标题