论文标题
图像字幕的教师关键培训策略
Teacher-Critical Training Strategies for Image Captioning
论文作者
论文摘要
现有的图像字幕模型通常通过跨凝结(XE)损失和增强学习(RL)训练,该模型将地面真实单词设置为硬目标,并迫使字幕模型向其学习。但是,在RL培训中,广泛采用的培训策略在XE培训和不适当的奖励任务中遇到了不对。为了解决这些问题,我们介绍了一个教师模型,该模型通过生成一些易于学习的单词建议作为软目标,用作地面真相标题和标题模型之间的桥梁。教师模型是通过将基线图像属性纳入基线字幕模型来构建的。为了有效地从教师模型中学习,我们提出了XE和RL培训的教师关键培训策略(TCT),以促进标题模型的更好学习过程。基准MSCOCO数据集上几种广泛采用的标题模型的实验评估表明,在两个训练阶段,提出的TCT全面增强了大多数评估指标,尤其是BLEU和Rouge-L得分。 TCT能够在MSCOCO KARPATHY测试拆分上实现迄今为止最佳发表的单型BLEU-4和Rouge-L表现为40.2%和59.4%。我们的代码和预培训模型将是开源的。
Existing image captioning models are usually trained by cross-entropy (XE) loss and reinforcement learning (RL), which set ground-truth words as hard targets and force the captioning model to learn from them. However, the widely adopted training strategies suffer from misalignment in XE training and inappropriate reward assignment in RL training. To tackle these problems, we introduce a teacher model that serves as a bridge between the ground-truth caption and the caption model by generating some easier-to-learn word proposals as soft targets. The teacher model is constructed by incorporating the ground-truth image attributes into the baseline caption model. To effectively learn from the teacher model, we propose Teacher-Critical Training Strategies (TCTS) for both XE and RL training to facilitate better learning processes for the caption model. Experimental evaluations of several widely adopted caption models on the benchmark MSCOCO dataset show the proposed TCTS comprehensively enhances most evaluation metrics, especially the Bleu and Rouge-L scores, in both training stages. TCTS is able to achieve to-date the best published single model Bleu-4 and Rouge-L performances of 40.2% and 59.4% on the MSCOCO Karpathy test split. Our codes and pre-trained models will be open-sourced.