论文标题
学习视频检索模型与相关性吸引在线挖掘
Learning video retrieval models with relevance-aware online mining
论文作者
论文摘要
由于每小时上传的视频和相关字幕数量,因此基于学习的深度学习解决方案吸引了越来越多的关注。一种典型的方法在于学习联合文本视频嵌入空间,在该空间中,视频的相似性及其相关的标题是最大化的,而其他所有其他字幕(称为否定)则执行了较低的相似性。这种方法假设只有数据集中的视频和标题对是有效的,但是不同的字幕(阳性)也可以描述其视觉内容,因此其中一些可能会被错误地惩罚。为了解决这一缺点,我们提出了相关性感知的否定性和阳性采矿(RANP),基于负面的语义,它改善了他们的选择,同时也提高了其他有效阳性的相似性。我们探讨了这些技术对两个视频文本数据集的影响:Epic-Kitchens-100和MSR-VTT。通过使用所提出的技术,我们在NDCG和MAP方面取得了可观的改进,从而导致最新结果,例如+5.3%NDCG和 +3.0%的Epic-Kitchens-100地图。我们在\ url {https://github.com/aranciokov/ranp}共享代码和预估计的模型。
Due to the amount of videos and related captions uploaded every hour, deep learning-based solutions for cross-modal video retrieval are attracting more and more attention. A typical approach consists in learning a joint text-video embedding space, where the similarity of a video and its associated caption is maximized, whereas a lower similarity is enforced with all the other captions, called negatives. This approach assumes that only the video and caption pairs in the dataset are valid, but different captions - positives - may also describe its visual contents, hence some of them may be wrongly penalized. To address this shortcoming, we propose the Relevance-Aware Negatives and Positives mining (RANP) which, based on the semantics of the negatives, improves their selection while also increasing the similarity of other valid positives. We explore the influence of these techniques on two video-text datasets: EPIC-Kitchens-100 and MSR-VTT. By using the proposed techniques, we achieve considerable improvements in terms of nDCG and mAP, leading to state-of-the-art results, e.g. +5.3% nDCG and +3.0% mAP on EPIC-Kitchens-100. We share code and pretrained models at \url{https://github.com/aranciokov/ranp}.