论文标题
生锈:未经未受图像的潜在神经场景表示
RUST: Latent Neural Scene Representations from Unposed Imagery
论文作者
论文摘要
从2D观察中推断3D场景的结构是计算机视觉中的基本挑战。最近,基于神经场景表征的最近普及的方法已经实现了巨大的影响,并已在各种应用中应用。在这个领域的剩余挑战之一是训练单个模型,该模型可以提供潜在的表示,从而有效地将其推广到一个场景之外。场景表示变压器(SRT)在这个方向上表现出了希望,但是将其扩展到更大的不同场景是具有挑战性的,因此需要准确提出地面真相数据。为了解决这个问题,我们提出了Rust(真正未经审查的场景表示变压器),这是一种无姿势的方法,用于仅在RGB图像上训练的新型视图合成。我们的主要见解是,一个人可以训练一个姿势编码器,该姿势编码器可以窥视目标图像,并学习一个潜在的姿势嵌入,该姿势嵌入被解码器用于视图合成。我们对学习的潜在姿势结构进行了实证研究,并表明它允许有意义的测试时间摄像头变换和准确的明确姿势读数。也许令人惊讶的是,Rust的质量与能够获得完美相机姿势的方法相似,从而释放了对摊销神经场景表示的大规模培训的潜力。
Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis. We perform an empirical investigation into the learned latent pose structure and show that it allows meaningful test-time camera transformations and accurate explicit pose readouts. Perhaps surprisingly, RUST achieves similar quality as methods which have access to perfect camera pose, thereby unlocking the potential for large-scale training of amortized neural scene representations.