论文标题
DeepSolo:让变压器解码器带有明确点独奏用于文本斑点
DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting
论文作者
论文摘要
端到端的文本发现旨在将场景文本检测和识别集成到统一的框架中。处理两个子任务之间的关系在设计有效的观点师方面起着关键作用。尽管基于变压器的方法消除了启发式后处理,但它们仍然遭受子任务和低训练效率之间的协同问题。在本文中,我们提出了DeepSolo,这是一种简单的DEDR样基线,可以使单个解码器具有明确的点独奏,以同时进行文本检测和识别。从技术上讲,对于每个文本实例,我们将字符序列表示为有序点,并使用可学习的显式查询对其进行建模。通过单个解码器后,点查询编码了必要的文本语义和位置,因此可以通过非常简单的预测头并行地将文本的中心线,边界,脚本和信心解码。此外,我们还引入了文本匹配标准,以提供更准确的监督信号,从而实现更有效的培训。对公共基准测试的定量实验表明,DeepSolo的表现优于先前的最先进方法,并提高了更好的培训效率。此外,DeepSolo还与线注释兼容,这比多边形所需的注释成本要少得多。该代码可在https://github.com/vitae-transformer/deepsolo上找到。
End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although Transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. In this paper, we present DeepSolo, a simple DETR-like baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations, thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel. Besides, we also introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training. Quantitative experiments on public benchmarks demonstrate that DeepSolo outperforms previous state-of-the-art methods and achieves better training efficiency. In addition, DeepSolo is also compatible with line annotations, which require much less annotation cost than polygons. The code is available at https://github.com/ViTAE-Transformer/DeepSolo.