ActionPlan icon ActionPlan: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning

arXiv 2026
1Tübingen AI Center, University of Tübingen, Germany2Max Planck Institute for Informatics, Saarland Informatics Campus, Germany
*Equal contribution   Corresponding author

Real-time Demo: ActionPlan enables high-quality real-time motion streaming — showcased inside our interactive interface.

🤖 ActionPlan on Unitree G1

Our streaming motion generation transfers directly to a Unitree G1 humanoid robot using SONIC as the low-level controller, demonstrating real-world deployment.

Abstract

TL;DR: A unified motion diffusion model that bridges real-time streaming and high-quality offline generation via per-frame action planning — 5.25× faster streaming with 18% better FID.

We present ActionPlan, a unified motion diffusion framework that bridges real-time streaming with high-quality offline generation within a single model. The core idea is to introduce a per-frame action plan: the model predicts frame-level text latents that act as dense semantic anchors throughout denoising, and uses them to denoise the full motion sequence with combined semantic and motion cues.

To support this structured workflow, we design latent-specific diffusion steps, allowing each motion latent to be denoised independently and sampled in flexible orders at inference. As a result, ActionPlan can run in a history-conditioned, future-aware mode for real-time streaming, while also supporting high-quality offline generation.

The same mechanism further enables zero-shot motion editing and in-betweening without additional models. Experiments demonstrate that our real-time streaming is 5.25× faster while achieving 18% motion quality improvement over the best previous method in terms of FID.

Method Overview

Method Overview

Overview of our ActionPlan. (a) During training, motion latents are noised with per-frame heterogeneous timesteps while frame-level text latents share a single global timestep. A Transformer Denoiser is trained to jointly reconstruct both. During inference, the model operates in two modes: in offline mode (b), the action plan is fully generated first and then motion latents are denoised in random pyramid order; in streaming mode (c), the action plan is denoised alongside the first motion frame, followed by raster progressive denoising of the remaining latents.

👑 Text-to-Motion Comparison

ActionPlan improves FID by 53% over MotionStreamer and 21.6% over MARDM. R-Precision Top-3: 0.892 vs MARDM (0.860) and MotionStreamer (0.859).

📹 Streaming Motion Comparison

ActionPlan vs MotionStreamer on long-horizon motion generation with multiple chained prompts from HumanML3D. ActionPlan remains future-aware while running up to 9× faster during continuous streaming.

🎯 Applications

ActionPlan supports diverse downstream applications zero-shot, without fine-tuning.

✏️ Motion Editing: Replace selected segments with new text-conditioned motion, keeping the rest unchanged.

↔️ In-Betweening: Fill in motion between fixed start/end poses guided by a text prompt.

White = original  |  Green = edited / generated.

✏️ Motion Editing

↔️ In-Betweening

BibTeX

@article{nazarenus2026actionplan,
  title   = {{ActionPlan}: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning},
  author  = {Nazarenus, Eric and Li, Chuqiao and He, Yannan and Xie, Xianghui and Lenssen, Jan Eric and Pons-Moll, Gerard},
  journal = {arXiv preprint},
  year    = {2026}
}