通过多标签视觉语义嵌入模型选择微视频缩略图

论文标题

通过多标签视觉语义嵌入模型选择微视频缩略图

Towards Micro-video Thumbnail Selection via a Multi-label Visual-semantic Embedding Model

论文作者

Bo, Liu

论文摘要

缩略图作为微观视频的第一眼，在吸引用户点击和观看方面起着关键作用。虽然在实际情况下，缩略图对用户的满足程度越多，则单击微型视频的可能性就越大。在本文中，我们旨在选择符合大多数用户兴趣的给定微型视频的缩略图。为此，我们提出了一个多标签的视觉语义嵌入模型，以估计每个帧的对与用户感兴趣的流行主题之间的相似性。在此模型中，视觉和文本信息嵌入了共享的语义空间中，可以直接测量相似性，甚至可以直接测量单词。此外，要将框架与流行主题中的所有单词进行比较，我们设计了一个与语义注意投影相关的关注空间。在这两个嵌入空间的帮助下，框架的受欢迎程度得分是由相应的视觉信息和流行主题对的相似性得分的总和来定义的。最终，我们融合了视觉表示得分和每个帧的受欢迎程度得分，以选择给定微型视频的有吸引力的缩略图。在现实世界数据集上进行的广泛实验表明，我们的模型明显优于几个最先进的基线。

The thumbnail, as the first sight of a micro-video, plays a pivotal role in attracting users to click and watch. While in the real scenario, the more the thumbnails satisfy the users, the more likely the micro-videos will be clicked. In this paper, we aim to select the thumbnail of a given micro-video that meets most users` interests. Towards this end, we present a multi-label visual-semantic embedding model to estimate the similarity between the pair of each frame and the popular topics that users are interested in. In this model, the visual and textual information is embedded into a shared semantic space, whereby the similarity can be measured directly, even the unseen words. Moreover, to compare the frame to all words from the popular topics, we devise an attention embedding space associated with the semantic-attention projection. With the help of these two embedding spaces, the popularity score of a frame, which is defined by the sum of similarity scores over the corresponding visual information and popular topic pairs, is achieved. Ultimately, we fuse the visual representation score and the popularity score of each frame to select the attractive thumbnail for the given micro-video. Extensive experiments conducted on a real-world dataset have well-verified that our model significantly outperforms several state-of-the-art baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题