Unimotion: Unifying 3D Human Motion Synthesis and Understanding

1University of Tübingen, 2Tübingen AI Center,
3Max Planck Institute for Informatics, Saarland Informatics Campus
PDF as Image

Unimotion: Our model can generate motion from compositional sequence- and frame-level text (Hierarchical Text to Motion), generate detailed per-frame motion descriptions (Motion to Text), and generate motion with accurate frame-level text descriptions from noise (Unconditional Joint Generation), among other use cases outlined in our experiments section. Tasks can be combined for controllable generation: users can generate motion from a coarse sentence, and our model additionally generates detailed text descriptions, which can be edited and used for regeneration, producing the desired edited motion (Motion Generation and Editing).


Abstract

TL;DR: We introduce Unimotion, the first unified multi-task human motion model capable of both flexible motion control and frame-level motion understanding.

While existing works control avatar motion with global text conditioning, or with fine-grained per frame scripts, none can do both at once. In addition, none of the existing works can output frame-level text paired with the generated poses. In contrast, Unimotion allows to control motion with global text, or local frame-level text, or both at once, providing more flexible control for users. Importantly, Unimotion is the first model which by design outputs local text paired with the generated poses, allowing users to know what motion happens and when, which is necessary for a wide range of applications. We show Unimotion opens up new applications: 1. hierarchical control, allowing users to specify motion at different levels of detail, 2. obtaining motion text descriptions for existing MoCap data or youtube videos 3. allowing for editability, generating motion from text and editing the motion via text edits. Moreover, Unimotion attains state-of-the-art results for the frame-level text-to-motion task on the established HumanML3D dataset.


Unimotion can be conditioned on a) human motion, b) clip-embedded frame-level text, or c) sequence-level text (Input) or any subsets thereof, or none, and instead be supplied with noise. At its core, it allows for diffusing motion and text individually, implemented via separate denoising timesteps tx and ty. After training with frame-level text losses and motion losses (Loss), Unimotion can output clean, noise-free motion and frame-level text descriptions explaining the generated motions (Output).

Results

Motion Generation and Editing (Combined Task)

Users can generate motion from a coarse sentence, and our model additionally generates detailed text descriptions,
which can be edited and used for regeneration, producing the desired edited motion.

Unconditional Joint Generation

Our model can generate motion with accurate frame-level text descriptions from noise

Motion-to-Text

MoCap

We take the MoCap data from AMASS as input motion and run our method to get the frame-level text annotation

YouTube

We use 4DHuman to get the 3d human pose estimation and then run our method on the estimtion to get the frame-level text annotation

Text-to-Motion Generation

Our model flexibly generates motion from the compositional properties of sequence- and frame-level text, either combined or individually.

Acknowledgement

Special thanks to Xiaohan Zhang for helping with the related work and other RVH and AVG members for the help and discussion. Thanks to Mathis Petrovich, Léore Bensabath, and Prof. Gül Varol for the discussion and helpful information on TMR++. Prof. Gerard Pons-Moll and Prof. Andreas Geiger are members of the Machine Learning Cluster of Excellence, EXC number 2064/1 - Project number 390727645. Gerard Pons-moll is endowed by the Carl Zeiss Foundation. Andreas Geiger was supported by the ERC Starting Grant LEGO-3D (850533). Julian Chibane is a fellow of the Meta Research PhD Fellowship Program - area: AR/VR Human Understanding.



Tübingen AI Center University of Tübingen Carl-Zeiss-Stiftung mpi-inf eu

BibTeX

@article{li2024unimotion,
      title={Unimotion: Unifying 3D Human Motion Synthesis and Understanding},
      author={Li, Chuqiao and Chibane, Julian and He, Yannan and Pearl, Naama and Geiger, Andreas and Pons-Moll, Gerard},
      journal={arXiv preprint arXiv:2409.15904},
      year={2024}
    }