对象场景表示变压器

论文标题

对象场景表示变压器

Object Scene Representation Transformer

论文作者

Sajjadi, Mehdi S. M., Duckworth, Daniel, Mahendran, Aravindh, van Steenkiste, Sjoerd, Pavetić, Filip, Lučić, Mario, Guibas, Leonidas J., Greff, Klaus, Kipf, Thomas

论文摘要

从物体及其在3D空间中的几何形状方面对世界的组成理解被认为是人类认知的基石。促进在神经网络中学习这种代表性的知识有望实质上提高标记的数据效率。作为朝这个方向发展的关键步骤，我们在学习3D一致的复杂场景的分解问题上取得了进展，以无监督的方式将复杂场景分解为单个对象。我们介绍对象场景表示变压器（OSRT），这是一个以3D为中心的模型，其中各个对象表示通过新颖的视图综合自然出现。与现有方法相比，OSRT缩放到具有更大的物体和背景多样性的明显复杂场景。同时，由于其光场参数化和新型的老虎机搅拌机解码器，它在组成渲染时的多个数量级更快。我们认为，这项工作不仅将加速未来的建筑探索和扩展工作，而且还将成为以对象为中心和神经场景表示社区的有用工具。

A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition. Facilitating the learning of such a representation in neural networks holds promise for substantially improving labeled data efficiency. As a key step in this direction, we make progress on the problem of learning 3D-consistent decompositions of complex scenes into individual objects in an unsupervised fashion. We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods. At the same time, it is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder. We believe this work will not only accelerate future architecture exploration and scaling efforts, but it will also serve as a useful tool for both object-centric as well as neural scene representation learning communities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题