并非所有令牌都是平等的：通过令牌聚类变压器通过以人为本的视觉分析

论文标题

并非所有令牌都是平等的：通过令牌聚类变压器通过以人为本的视觉分析

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

论文作者

Zeng, Wang, Jin, Sheng, Liu, Wentao, Qian, Chen, Luo, Ping, Ouyang, Wanli, Wang, Xiaogang

论文摘要

视觉变压器在许多计算机视觉任务中取得了巨大的成功。大多数方法通过将图像将图像拆分为常规和固定网格并将每个单元视为令牌来产生视觉令牌。但是，并非所有区域在以人为中心的视觉任务中同样重要，例如，人体需要许多令牌代表，而图像背景可以由一些令牌建模。为了解决这个问题，我们提出了一种新型的视觉变压器，称为令牌聚类变压器（TCFormer），该变压器通过渐进聚类合并令牌，可以从具有灵活形状和尺寸的不同位置合并令牌。 TCFormer中的令牌不仅可以专注于重要领域，还可以调整令牌形状以适合语义概念，并为包含关键细节的区域采用精细的分辨率，这有助于捕获详细信息。广泛的实验表明，TCFormer在3DPW上始终优于其以不同挑战性的人为中心的任务和数据集的对应物，包括对可可叶全身的全身姿势估计和3D人类网格重建。代码可从https://github.com/zengwang430521/tcformer.git获得。

Vision transformers have achieved great successes in many computer vision tasks. Most methods generate vision tokens by splitting an image into a regular and fixed grid and treating each cell as a token. However, not all regions are equally important in human-centric vision tasks, e.g., the human body needs a fine representation with many tokens, while the image background can be modeled by a few tokens. To address this problem, we propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer), which merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes. The tokens in TCFormer can not only focus on important areas but also adjust the token shapes to fit the semantic concept and adopt a fine resolution for regions containing critical details, which is beneficial to capturing detailed information. Extensive experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets, including whole-body pose estimation on COCO-WholeBody and 3D human mesh reconstruction on 3DPW. Code is available at https://github.com/zengwang430521/TCFormer.git

下载PDF全文

下载文献需遵守相关版权规定

论文标题