论文标题
MotionClip:将人类运动产生暴露于剪辑空间
MotionCLIP: Exposing Human Motion Generation to CLIP Space
论文作者
论文摘要
我们介绍了MotionClip,这是一种3D人体运动自动编码器,其具有潜在的嵌入,表现得很好,表现良好并支持高度语义的文本描述。 MotionClip通过将其潜在空间与对比度图像预训练(剪辑)模型保持一致,从而获得了独特的力量。将人类运动歧管对准剪辑空间会隐式地注入夹子的极富语义知识。特别是,它通过将语义上相似的动作彼此接近,并从夹层空间结构继承来有助于连续性。 MotionClip包括一个基于变压器的运动自动编码器,经过训练,可以重建运动,同时与文本标签在剪辑空间中的位置保持一致。我们进一步利用了剪辑的独特视觉理解,并通过对齐运动以自我监督的方式渲染框架来注入更强的信号。我们表明,尽管剪辑从未见过运动域,但MotionClip提供了前所未有的文本到动作能力,允许跨域操作,删除编辑和抽象语言规范。例如,由于舌相似性,文本提示“沙发”被解码为坐下运动,而及时的“蜘蛛侠”导致了类似网络的类似网络的解决方案,该解决方案远远遥遥无期。此外,我们还展示了如何利用引入的潜在空间进行运动插值,编辑和识别。
We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions. MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive Language-Image Pre-training (CLIP) model. Aligning the human motion manifold to CLIP space implicitly infuses the extremely rich semantic knowledge of CLIP into the manifold. In particular, it helps continuity by placing semantically similar motions close to one another, and disentanglement, which is inherited from the CLIP-space structure. MotionCLIP comprises a transformer-based motion auto-encoder, trained to reconstruct motion while being aligned to its text label's position in CLIP-space. We further leverage CLIP's unique visual understanding and inject an even stronger signal through aligning motion to rendered frames in a self-supervised manner. We show that although CLIP has never seen the motion domain, MotionCLIP offers unprecedented text-to-motion abilities, allowing out-of-domain actions, disentangled editing, and abstract language specification. For example, the text prompt "couch" is decoded into a sitting down motion, due to lingual similarity, and the prompt "Spiderman" results in a web-swinging-like solution that is far from seen during training. In addition, we show how the introduced latent space can be leveraged for motion interpolation, editing and recognition.