论文标题
savi ++:从现实世界的视频中迈向端到端以对象为中心的学习
SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos
论文作者
论文摘要
视觉世界可以以稀疏相互作用的不同实体来嘲笑。在动态视觉场景中发现这种组成结构已被证明对端到端的计算机视觉方法具有挑战性,除非提供明确的实例级别的监督。利用运动提示的基于插槽的模型最近在没有直接监督的情况下在学习代表,细分和跟踪对象方面表现出了巨大的希望,但是它们仍然无法扩展到复杂的现实世界多对象视频。为了弥合这一差距,我们从人类发展中汲取灵感,并假设以深度信号形式的场景几何形状的信息可以促进以对象为中心的学习。我们介绍了一种以对象为中心的视频模型Savi ++,该模型经过训练,可以预测基于插槽的视频表示的深度信号。通过进一步利用模型缩放的最佳实践,我们能够训练SAVI ++以细分使用移动摄像机记录的复杂动态场景,其中包含在自然主义背景上具有不同外观的静态和移动对象,而无需进行分割监督。最后,我们证明,通过使用从LIDAR获得的稀疏深度信号,Savi ++能够从真实World Waymo Open DataSet中的视频中学习新兴对象细分和跟踪。
The visual world can be parsimoniously characterized in terms of distinct entities with sparse interactions. Discovering this compositional structure in dynamic visual scenes has proven challenging for end-to-end computer vision approaches unless explicit instance-level supervision is provided. Slot-based models leveraging motion cues have recently shown great promise in learning to represent, segment, and track objects without direct supervision, but they still fail to scale to complex real-world multi-object videos. In an effort to bridge this gap, we take inspiration from human development and hypothesize that information about scene geometry in the form of depth signals can facilitate object-centric learning. We introduce SAVi++, an object-centric video model which is trained to predict depth signals from a slot-based video representation. By further leveraging best practices for model scaling, we are able to train SAVi++ to segment complex dynamic scenes recorded with moving cameras, containing both static and moving objects of diverse appearance on naturalistic backgrounds, without the need for segmentation supervision. Finally, we demonstrate that by using sparse depth signals obtained from LiDAR, SAVi++ is able to learn emergent object segmentation and tracking from videos in the real-world Waymo Open dataset.