论文标题

Vu-Bert:视觉对话的统一框架

VU-BERT: A Unified framework for Visual Dialog

论文作者

Ye, Tong, Si, Shijing, Wang, Jianzong, Wang, Rui, Cheng, Ning, Xiao, Jing

论文摘要

视觉对话框任务试图训练一个代理,以回答给定图像的多转弯问题,这需要对图像和对话历史记录之间的相互作用深入了解。现有的研究倾向于采用特定于模态的模块来对交互作用进行建模,这可能很麻烦。为了填补这一空白,我们为图像文本嵌入(名为Vu-bert)提出了一个统一的框架,并应用贴片投影以获取视觉嵌入在视觉对话框任务中以简化模型。该模型经过了两个任务的训练:蒙版的语言建模和下一个话语检索。这些任务有助于学习视觉概念,话语依赖性以及这两种方式之间的关系。最后,我们的Vu-Bert在Visdial V1.0数据集上实现了竞争性能(0.7287 NDCG得分)。

The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history. Existing researches tend to employ the modality-specific modules to model the interactions, which might be troublesome to use. To fill in this gap, we propose a unified framework for image-text joint embedding, named VU-BERT, and apply patch projection to obtain vision embedding firstly in visual dialog tasks to simplify the model. The model is trained over two tasks: masked language modeling and next utterance retrieval. These tasks help in learning visual concepts, utterances dependence, and the relationships between these two modalities. Finally, our VU-BERT achieves competitive performance (0.7287 NDCG scores) on VisDial v1.0 Datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源