视听平面图重建

论文标题

视听平面图重建

Audio-Visual Floorplan Reconstruction

论文作者

Purushwalkam, Senthil, Gari, Sebastian Vicenc Amengual, Ithapu, Vamsi Krishna, Schissler, Carl, Robinson, Philip, Gupta, Abhinav, Grauman, Kristen

论文摘要

只有几眼瞥见环境，我们可以推断出其整个平面图多少？现有方法只能从上下文中映射可见的内容或立即明显的内容，因此需要通过空间进行实质性移动才能完全绘制它。我们探索音频和视觉传感如何从有限的观点中提供快速平面图重建。音频不仅有助于感知相机视野之外的几何形状，而且还揭示了遥远的自由空间的存在（例如，狗在另一个房间里吠叫），建议相机看不见的房间（例如，洗碗机在左边必须是厨房里的洗碗机嗡嗡声）。我们介绍了AV-MAP，这是一种新型的多模式编码器框架，该框架共同考虑了音频和视觉，以从短输入视频序列中重建平面图。我们训练模型以预测环境的内部结构和相关房间的语义标签。我们对85个大型现实世界环境的结果表明了影响：只有几乎占地26％的瞥见，我们可以以66％的精度估算整个区域 - 比推断视觉图的最新方法要好得多。

Given only a few glimpses of an environment, how much can we infer about its entire floorplan? Existing methods can map only what is visible or immediately apparent from context, and thus require substantial movements through a space to fully map it. We explore how both audio and visual sensing together can provide rapid floorplan reconstruction from limited viewpoints. Audio not only helps sense geometry outside the camera's field of view, but it also reveals the existence of distant freespace (e.g., a dog barking in another room) and suggests the presence of rooms not visible to the camera (e.g., a dishwasher humming in what must be the kitchen to the left). We introduce AV-Map, a novel multi-modal encoder-decoder framework that reasons jointly about audio and vision to reconstruct a floorplan from a short input video sequence. We train our model to predict both the interior structure of the environment and the associated rooms' semantic labels. Our results on 85 large real-world environments show the impact: with just a few glimpses spanning 26% of an area, we can estimate the whole area with 66% accuracy -- substantially better than the state of the art approach for extrapolating visual maps.

下载PDF全文

下载文献需遵守相关版权规定

论文标题