论文标题
插槽:无监督的视觉动力学模拟,以对象为中心的模型
SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models
论文作者
论文摘要
了解视觉观察的动态是一个具有挑战性的问题,需要将单个对象与场景解开并学习其互动。尽管最近以对象为中心的模型可以成功地将场景分解为对象,但有效地建模其动态仍然是一个挑战。我们通过介绍Slotformer来解决此问题,这是一种基于变压器的自动回归模型,以学习对象为中心的表示。鉴于一个视频剪辑,我们对对象特征的方法为了建模时空关系并预测准确的未来对象状态。在本文中,我们成功地应用了插槽器来在具有复杂对象交互的数据集上执行视频预测。此外,无监督的插槽动态模型可用于改善监督下游任务的性能,例如视觉问答答案(VQA)和目标条件计划。与过去的动态建模作品相比,我们的方法在保留高质量的视觉生成的同时,可以实现对物体动力学的长期综合。此外,Slotformer还可以使VQA模型在没有对象级标签的情况下推理未来,甚至超过了使用地面真实注释的对应物。最后,我们展示了它作为基于模型计划的世界模型的能力,该模型具有专门为此类任务设计的方法竞争。
Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -- a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without object-level labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.