Real-time Demo: ActionPlan enables high-quality real-time motion streaming — showcased inside our interactive interface.
Our streaming motion generation transfers directly to a Unitree G1 humanoid robot using SONIC as the low-level controller, demonstrating real-world deployment.
We present ActionPlan, a unified motion diffusion framework that bridges real-time streaming with high-quality offline generation within a single model. The core idea is to introduce a per-frame action plan: the model predicts frame-level text latents that act as dense semantic anchors throughout denoising, and uses them to denoise the full motion sequence with combined semantic and motion cues.
To support this structured workflow, we design latent-specific diffusion steps, allowing each motion latent to be denoised independently and sampled in flexible orders at inference. As a result, ActionPlan can run in a history-conditioned, future-aware mode for real-time streaming, while also supporting high-quality offline generation.
The same mechanism further enables zero-shot motion editing and in-betweening without additional models. Experiments demonstrate that our real-time streaming is 5.25× faster while achieving 18% motion quality improvement over the best previous method in terms of FID.
Overview of our ActionPlan. (a) During training, motion latents are noised with per-frame heterogeneous timesteps while frame-level text latents share a single global timestep. A Transformer Denoiser is trained to jointly reconstruct both. During inference, the model operates in two modes: in offline mode (b), the action plan is fully generated first and then motion latents are denoised in random pyramid order; in streaming mode (c), the action plan is denoised alongside the first motion frame, followed by raster progressive denoising of the remaining latents.
ActionPlan improves FID by 53% over MotionStreamer and 21.6% over MARDM. R-Precision Top-3: 0.892 vs MARDM (0.860) and MotionStreamer (0.859).
🏆 ActionPlan (Ours)
MARDM
No waddling motion, steps sideways
MotionStreamer
Sits down instead of waddling, no walking back
🏆 ActionPlan (Ours)
MARDM
More on the back than the head
MotionStreamer
No adjusting motion
🏆 ActionPlan (Ours)
MARDM
First kicks left, no knee raising
MotionStreamer
Kicks right and left, no knee raising
🏆 ActionPlan (Ours)
MARDM
Big motion, not really cooking
MotionStreamer
No cooking motion
🏆 ActionPlan (Ours)
MARDM
No tip-toeing motion
MotionStreamer
No tip-toeing motion
🏆 ActionPlan (Ours)
MARDM
Handstand is unrealistic
MotionStreamer
Steps back first, does cartwheel
🏆 ActionPlan (Ours)
MARDM
No karate motion, very small forward and backward motion
MotionStreamer
Fighting but not karate-specific, very small motion
🏆 ActionPlan (Ours)
MARDM
Walks in downward slope
MotionStreamer
Walks in downward slope
🏆 ActionPlan (Ours)
MARDM
Doesn't swing right leg, no holding onto something
MotionStreamer
Doesn't hold onto something
🏆 ActionPlan (Ours)
MARDM
Jumps up instead of sideways
MotionStreamer
Jumps up instead of sideways
🏆 ActionPlan (Ours)
MARDM
Walks in a circle, no karate motion
MotionStreamer
Karate motion but doesn't move diagonally
🏆 ActionPlan (Ours)
MARDM
No wrist checking motion
MotionStreamer
No hands on hips, no wrist checking motion
🏆 ActionPlan (Ours)
MARDM
No jump forward
MotionStreamer
No jump forward
🏆 ActionPlan (Ours)
MARDM
No dribbling motion
MotionStreamer
No dribbling motion, no shooting motion
🏆 ActionPlan (Ours)
MARDM
Walks left instead of back
MotionStreamer
No clear picking up
ActionPlan vs MotionStreamer on long-horizon motion generation with multiple chained prompts from HumanML3D. ActionPlan remains future-aware while running up to 9× faster during continuous streaming.
No arm movement in segment 2, no exercise motion in segment 3, doesn't cross legs in segment 4
No walking in segment 3, no walking while holding the head in segment 4
Doesn't sit down in segment 4
Lacks varied arm movement in segment 2; fails to stay still in segment 3; no jogging in segment 4; no walking forward in segment 5
Doesn't go to his knee in segment 3
Doesn't put up their knee in segment 2
No basketball signals, unnatural vertical movement in segment 2
No floor sweeping in segment 3
No foot-dragging in segment 4, no stumbling backwards in segment 5
No ballet dance in segment 3, doesn't raise hands in segment 4
No forward walking while jumping in segment 4
Walks sideways instead of limping in segment 2, throws with left arm in segment 3
ActionPlan supports diverse downstream applications zero-shot, without fine-tuning.
✏️ Motion Editing: Replace selected segments with new text-conditioned motion, keeping the rest unchanged.
↔️ In-Betweening: Fill in motion between fixed start/end poses guided by a text prompt.
White = original | Green = edited / generated.
Original (left) | Edited (right)
Original (left) | Edited (right)
Original (left) | Edited (right)
Original (left) | Edited (right)
@article{nazarenus2026actionplan,
title = {{ActionPlan}: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning},
author = {Nazarenus, Eric and Li, Chuqiao and He, Yannan and Xie, Xianghui and Lenssen, Jan Eric and Pons-Moll, Gerard},
journal = {arXiv preprint},
year = {2026}
}