生锈：未经未受图像的潜在神经场景表示

论文标题

生锈：未经未受图像的潜在神经场景表示

RUST: Latent Neural Scene Representations from Unposed Imagery

论文作者

Sajjadi, Mehdi S. M., Mahendran, Aravindh, Kipf, Thomas, Pot, Etienne, Duckworth, Daniel, Lucic, Mario, Greff, Klaus

论文摘要

从2D观察中推断3D场景的结构是计算机视觉中的基本挑战。最近，基于神经场景表征的最近普及的方法已经实现了巨大的影响，并已在各种应用中应用。在这个领域的剩余挑战之一是训练单个模型，该模型可以提供潜在的表示，从而有效地将其推广到一个场景之外。场景表示变压器（SRT）在这个方向上表现出了希望，但是将其扩展到更大的不同场景是具有挑战性的，因此需要准确提出地面真相数据。为了解决这个问题，我们提出了Rust（真正未经审查的场景表示变压器），这是一种无姿势的方法，用于仅在RGB图像上训练的新型视图合成。我们的主要见解是，一个人可以训练一个姿势编码器，该姿势编码器可以窥视目标图像，并学习一个潜在的姿势嵌入，该姿势嵌入被解码器用于视图合成。我们对学习的潜在姿势结构进行了实证研究，并表明它允许有意义的测试时间摄像头变换和准确的明确姿势读数。也许令人惊讶的是，Rust的质量与能够获得完美相机姿势的方法相似，从而释放了对摊销神经场景表示的大规模培训的潜力。

Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis. We perform an empirical investigation into the learned latent pose structure and show that it allows meaningful test-time camera transformations and accurate explicit pose readouts. Perhaps surprisingly, RUST achieves similar quality as methods which have access to perfect camera pose, thereby unlocking the potential for large-scale training of amortized neural scene representations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题