论文标题
通过切割预测和标记进行自我监督的视频对象分割
Self-Supervised Video Object Segmentation via Cutout Prediction and Tagging
论文作者
论文摘要
我们提出了一种新颖的自我监督视频对象分割(VOS)方法,该方法致力于实现更好的对象背景可区分性以进行准确的对象分割。与以前的自我监管的VOS方法不同,我们的方法基于歧视性学习损失公式,该方法同时考虑了对象和背景信息,以确保对象背景的可辨别性,而不是仅使用对象外观。歧视性学习损失包括基于切口的重建(切口区域代表框架的一部分,其像素被一些恒定值替换)和TAG预测损失项。基于切口的重建术语利用一个简单的切口方案来学习当前框架和以前的帧之间的像素对应关系,以便重建原始的当前框架,其中添加了切口区域。引入的切片补丁指导该模型将感兴趣对象的重要特征与较差的对象的重要特征相同,从而暗中将模型装备以解决基于闭塞的方案。接下来,TAG预测术语通过对切口区域中所有像素的标签进行分组相似的标签,同时将它们与重建的框架像素的其余标签分开,从而鼓励对象背景可分离性。此外,我们引入了一个缩放方案,该方案通过在多个尺度上捕获精细的结构信息来解决小物体分割的问题。我们提出的方法称为CT-VOS,在两个具有挑战性的基准:Davis-2017和YouTube-Vos上实现了最先进的结果。详细的消融展示了拟议的损失公式对有效捕获对象背景的可区分性的重要性以及我们的缩放方案对准确分割小型对象的影响。
We propose a novel self-supervised Video Object Segmentation (VOS) approach that strives to achieve better object-background discriminability for accurate object segmentation. Distinct from previous self-supervised VOS methods, our approach is based on a discriminative learning loss formulation that takes into account both object and background information to ensure object-background discriminability, rather than using only object appearance. The discriminative learning loss comprises cutout-based reconstruction (cutout region represents part of a frame, whose pixels are replaced with some constant values) and tag prediction loss terms. The cutout-based reconstruction term utilizes a simple cutout scheme to learn the pixel-wise correspondence between the current and previous frames in order to reconstruct the original current frame with added cutout region in it. The introduced cutout patch guides the model to focus as much on the significant features of the object of interest as the less significant ones, thereby implicitly equipping the model to address occlusion-based scenarios. Next, the tag prediction term encourages object-background separability by grouping tags of all pixels in the cutout region that are similar, while separating them from the tags of the rest of the reconstructed frame pixels. Additionally, we introduce a zoom-in scheme that addresses the problem of small object segmentation by capturing fine structural information at multiple scales. Our proposed approach, termed CT-VOS, achieves state-of-the-art results on two challenging benchmarks: DAVIS-2017 and Youtube-VOS. A detailed ablation showcases the importance of the proposed loss formulation to effectively capture object-background discriminability and the impact of our zoom-in scheme to accurately segment small-sized objects.