ActionPlan: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning

arXiv 2026

Eric Nazarenus^*1, Chuqiao Li^*†1, Yannan He¹, Xianghui Xie^1,2, Jan Eric Lenssen², Gerard Pons-Moll^1,2

¹Tübingen AI Center, University of Tübingen, Germany ²Max Planck Institute for Informatics, Saarland Informatics Campus, Germany
^*Equal contribution ^†Corresponding author

Paper arXiv Code Demo (Coming Soon)

Real-time Demo: ActionPlan enables high-quality real-time motion streaming — showcased inside our interactive interface.

🤖 ActionPlan on Unitree G1

Our streaming motion generation transfers directly to a Unitree G1 humanoid robot using SONIC as the low-level controller, demonstrating real-world deployment.

Abstract

TL;DR: A unified motion diffusion model that bridges real-time streaming and high-quality offline generation via per-frame action planning — 5.25× faster streaming with 18% better FID.

We present ActionPlan, a unified motion diffusion framework that bridges real-time streaming with high-quality offline generation within a single model. The core idea is to introduce a per-frame action plan: the model predicts frame-level text latents that act as dense semantic anchors throughout denoising, and uses them to denoise the full motion sequence with combined semantic and motion cues.

To support this structured workflow, we design latent-specific diffusion steps, allowing each motion latent to be denoised independently and sampled in flexible orders at inference. As a result, ActionPlan can run in a history-conditioned, future-aware mode for real-time streaming, while also supporting high-quality offline generation.

The same mechanism further enables zero-shot motion editing and in-betweening without additional models. Experiments demonstrate that our real-time streaming is 5.25× faster while achieving 18% motion quality improvement over the best previous method in terms of FID.

Method Overview

Overview of our ActionPlan. (a) During training, motion latents are noised with per-frame heterogeneous timesteps while frame-level text latents share a single global timestep. A Transformer Denoiser is trained to jointly reconstruct both. During inference, the model operates in two modes: in offline mode (b), the action plan is fully generated first and then motion latents are denoised in random pyramid order; in streaming mode (c), the action plan is denoised alongside the first motion frame, followed by raster progressive denoising of the remaining latents.

👑 Text-to-Motion Comparison

ActionPlan improves FID by 53% over MotionStreamer and 21.6% over MARDM. R-Precision Top-3: 0.892 vs MARDM (0.860) and MotionStreamer (0.859).

"The person waddles low to the ground and then stands up and walks back"

🏆 ActionPlan (Ours)

MARDM

No waddling motion, steps sideways

MotionStreamer

Sits down instead of waddling, no walking back

"Person is adjusting something on their head"

🏆 ActionPlan (Ours)

MARDM

📹 Streaming Motion Comparison

ActionPlan vs MotionStreamer on long-horizon motion generation with multiple chained prompts from HumanML3D. ActionPlan remains future-aware while running up to 9× faster during continuous streaming.

🏆 ActionPlan (Ours)

MotionStreamer

No arm movement in segment 2, no exercise motion in segment 3, doesn't cross legs in segment 4

🏆 ActionPlan (Ours)

MotionStreamer

No walking in segment 3, no walking while holding the head in segment 4

🏆 ActionPlan (Ours)

MotionStreamer

Doesn't sit down in segment 4

🏆 ActionPlan (Ours)

MotionStreamer

Lacks varied arm movement in segment 2; fails to stay still in segment 3; no jogging in segment 4; no walking forward in segment 5

🏆 ActionPlan (Ours)

MotionStreamer

Doesn't go to his knee in segment 3

🏆 ActionPlan (Ours)

MotionStreamer

Doesn't put up their knee in segment 2

🏆 ActionPlan (Ours)

MotionStreamer

No basketball signals, unnatural vertical movement in segment 2

🏆 ActionPlan (Ours)

MotionStreamer

No floor sweeping in segment 3

🏆 ActionPlan (Ours)

MotionStreamer

No foot-dragging in segment 4, no stumbling backwards in segment 5

🏆 ActionPlan (Ours)

MotionStreamer

No ballet dance in segment 3, doesn't raise hands in segment 4

🏆 ActionPlan (Ours)

MotionStreamer

No forward walking while jumping in segment 4

🏆 ActionPlan (Ours)

MotionStreamer

Walks sideways instead of limping in segment 2, throws with left arm in segment 3

🎯 Applications

ActionPlan supports diverse downstream applications zero-shot, without fine-tuning.

✏️ Motion Editing: Replace selected segments with new text-conditioned motion, keeping the rest unchanged.

↔️ In-Betweening: Fill in motion between fixed start/end poses guided by a text prompt.

White = original | Green = edited / generated.

✏️ Motion Editing

"walk" → "do a cartwheel"

Original (left) | Edited (right)

"rising arms up and down" → "do a handstand"

Original (left) | Edited (right)

"walk" → "jump in place"

Original (left) | Edited (right)

"walk forward" → "do a side shuffle"

Original (left) | Edited (right)

↔️ In-Betweening

"a person hops backwards, then defends themselves"

"jumping forward quickly then stopping"

"person is standing forward doing jumping jacks"

"a person walks up stairs turns left and walks back down stairs"

BibTeX

@article{nazarenus2026actionplan,
  title   = {{ActionPlan}: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning},
  author  = {Nazarenus, Eric and Li, Chuqiao and He, Yannan and Xie, Xianghui and Lenssen, Jan Eric and Pons-Moll, Gerard},
  journal = {arXiv preprint},
  year    = {2026}
}