Users can generate motion from a coarse sentence, and our model additionally generates detailed text descriptions,
which can be edited and used for regeneration, producing the desired edited motion.
Our model can generate motion with accurate frame-level text descriptions from noise
We take the MoCap data from AMASS as input motion and run our method to get the frame-level text annotation
We use 4DHuman to get the 3d human pose estimation and then run our method on the estimtion to get the frame-level text annotation
Our model flexibly generates motion from the compositional properties of sequence- and frame-level text, either combined or individually.
Sequence-level text input: The man is pacing back and forth.
MDM
Unimotion
Sequence-level text input: A person sprinting ahead and then slowing down.
MDM
Unimotion
Frame-level Text-to-Motion
Sequence-level text input:
A person waves hand above head.
Hierarchical (Frame-level+Sequence-level) Text-to-Motion
Frame-level Text-to-Motion
Sequence-level text input:
Running forward taking a left turn and continue running.
Hierarchical (Frame-level+Sequence-level) Text-to-Motion
Special thanks to Xiaohan Zhang for helping with the related work and other RVH and AVG members for the help and discussion. Thanks to Mathis Petrovich, Léore Bensabath, and Prof. Gül Varol for the discussion and helpful information on TMR++. Prof. Gerard Pons-Moll and Prof. Andreas Geiger are members of the Machine Learning Cluster of Excellence, EXC number 2064/1 - Project number 390727645. Gerard Pons-moll is endowed by the Carl Zeiss Foundation. Andreas Geiger was supported by the ERC Starting Grant LEGO-3D (850533). Julian Chibane is a fellow of the Meta Research PhD Fellowship Program - area: AR/VR Human Understanding.
@article{li2024unimotion,
title={Unimotion: Unifying 3D Human Motion Synthesis and Understanding},
author={Li, Chuqiao and Chibane, Julian and He, Yannan and Pearl, Naama and Geiger, Andreas and Pons-Moll, Gerard},
journal={arXiv preprint arXiv:2409.15904},
year={2024}
}