框架 - 鞋业的自我审视，用于多模式视频问题回答

论文标题

框架 - 鞋业的自我审视，用于多模式视频问题回答

Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering

论文作者

Wang, Jiong, Zhao, Zhou, Jin, Weike

论文摘要

多模式视频问题答案旨在预测正确的答案并定位与问题相关的时间边界。问题的时间注释提高了质量检查的质量质量和解释性，但它们通常是经验和昂贵的。为了避免时间注释，我们设计了一个弱监督的问题接地（WSQG）设置，在该设置中，仅使用QA注释，并且根据时间注意力分数生成相关的时间界。为了替代时间注释，我们将框架和字幕之间的对应关系转换为框架求和词（FS）自学，这有助于优化时间注意力分数，从而改善视频QA模型中的视频语言理解。关于TVQA和TVQA+数据集的广泛实验表明，提议的WSQG策略在问题接地上具有可比性的性能，而FS自我策略有助于改善在两个QA-Supervision和全面审议的设置上的答案和接地性能。

Multi-modal video question answering aims to predict correct answer and localize the temporal boundary relevant to the question. The temporal annotations of questions improve QA performance and interpretability of recent works, but they are usually empirical and costly. To avoid the temporal annotations, we devise a weakly supervised question grounding (WSQG) setting, where only QA annotations are used and the relevant temporal boundaries are generated according to the temporal attention scores. To substitute the temporal annotations, we transform the correspondence between frames and subtitles to Frame-Subtitle (FS) self-supervision, which helps to optimize the temporal attention scores and hence improve the video-language understanding in VideoQA model. The extensive experiments on TVQA and TVQA+ datasets demonstrate that the proposed WSQG strategy gets comparable performance on question grounding, and the FS self-supervision helps improve the question answering and grounding performance on both QA-supervision only and full-supervision settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题