提升和提升样本：高效的3D人姿势估计，带有升高的变压器

论文标题

提升和提升样本：高效的3D人姿势估计，带有升高的变压器

Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers

论文作者

Einfalt, Moritz, Ludwig, Katja, Lienhart, Rainer

论文摘要

视频中单眼3D人类姿势估计的最先进的是由2到3D姿势提升的范式主导。虽然提升方法本身相当有效，但真正的计算复杂性取决于每个框架2D姿势估计。在本文中，我们提出了一个基于变压器的姿势提升方案，该方案可以在时间稀疏的2D姿势序列上运行，但仍会产生时间致密的3D姿势估计。我们展示了如何将蒙版的令牌建模用于变压器块内的时间上采样。这允许将输入2D姿势的采样速率和视频的目标帧速率分解，并大大降低了总计算复杂性。此外，我们探讨了预先培训大型运动捕获档案的选择，到目前为止，该档案已被忽略了。我们在两个流行的基准数据集上评估了我们的方法：Human3.6M和MPI-INF-3DHP。我们提出的方法分别为45.0毫米和46.9毫米的MPJPE可以与最先进的方法竞争，同时将推理时间减少12倍。这使实时吞吐量具有固定和移动应用中可变的消费者硬件的实时吞吐量。我们在https://github.com/goldbricklemon/uplift-upsample-3dhpe上发布代码和模型

The state-of-the-art for monocular 3D human pose estimation in videos is dominated by the paradigm of 2D-to-3D pose uplifting. While the uplifting methods themselves are rather efficient, the true computational complexity depends on the per-frame 2D pose estimation. In this paper, we present a Transformer-based pose uplifting scheme that can operate on temporally sparse 2D pose sequences but still produce temporally dense 3D pose estimates. We show how masked token modeling can be utilized for temporal upsampling within Transformer blocks. This allows to decouple the sampling rate of input 2D poses and the target frame rate of the video and drastically decreases the total computational complexity. Additionally, we explore the option of pre-training on large motion capture archives, which has been largely neglected so far. We evaluate our method on two popular benchmark datasets: Human3.6M and MPI-INF-3DHP. With an MPJPE of 45.0 mm and 46.9 mm, respectively, our proposed method can compete with the state-of-the-art while reducing inference time by a factor of 12. This enables real-time throughput with variable consumer hardware in stationary and mobile applications. We release our code and models at https://github.com/goldbricklemon/uplift-upsample-3dhpe

下载PDF全文

下载文献需遵守相关版权规定

论文标题