Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation

Runqi Ouyang^1,2,*, Haoyun Li^1,2,*, Zhenyuan Zhang^1,3*, Xiaofeng Wang¹,
Zheng Zhu^1,†, Guan Huang¹, Xingang Wang^1,†,

¹GigaAI ²CASIA ³HKUST

^*Equal Contribution
^†Corresponding Authors: zhengzhu@ieee.org, xingang.wang@ia.ac.cn

Abstract

Recent advances in large language models, especially in natural language understanding and reasoning, have opened new possibilities for text-to-motion generation. Although existing approaches have made notable progress in semantic alignment and motion synthesis, they often rely on end-to-end mapping strategies that fail to capture deep linguistic structures and logical reasoning. Consequently, generated motions tend to lack controllability, consistency, and diversity. To address these limitations, we propose Motion-R1, a unified motion-language modeling framework that integrates a Chain-of-Thought mechanism. By explicitly decomposing complex textual instructions into logically structured action paths, Motion-R1 provides high-level semantic guidance for motion generation, significantly enhancing the model's ability to interpret and execute multi-step, long-horizon, and compositionally rich commands. To train our model, we adopt Group Relative Policy Optimization, a reinforcement learning algorithm designed for large models, which leverages motion quality feedback to optimize reasoning chains and motion synthesis jointly. Extensive experiments across multiple benchmark datasets demonstrate that Motion-R1 achieves competitive or superior performance compared to state-of-the-art methods, particularly in scenarios requiring nuanced semantic understanding and long-term temporal coherence. The code, model and data will be publicly available.

Approach Overview

In-Distribution Prompts

"A person is throwing something."

"A person walks forcefully forward 4 steps."

"A person is walking fast."

"The person is walking slowly."

"A man moves erratically, like a marionette struggling to free itself from its strings."

Out-of-Distribution Prompts

We showcase Motion-R1's capability to generate diverse and high-quality motions for out-of-distribution prompts.

Complex Text

"After hearing a loud noise, a person turned around , stepped back cautiously with hands raised defensively and then slowly approached."

"A person takes a few steps forward, jumps forward with both feet , and immediately turns right upon landing"

"A person raises arms, arches back slightly, then shifts weight onto the right leg while extending the left leg backward in a poised arabesque position."

"A person jumped up happily, raised hand and spsun excitedly."

Abstract Text

"A person is serving in badminton ."

"A person is skipping rope."

"A person is dancing ballroom dance."

"The person walks as if balancing on a tightrope."

"The person mimics swimming in mid-air, as if performing a freestyle stroke without water."

"The person walks through strong wind, leans forward and braces against resistance."

MotionCoT Data Example

Given the prompt ''A person does Tai chi.'' the LLM generates a step-by-step CoT reasoning trace (<think>) and a structured action plan <output>, covering stance, arm movement, weight transfer, and hand positioning.

Comparisons

We compare Motion-R1 against baselines such as MoMask and MotionLLM. As shown in Left of Figure, Motion-R1 produces smooth, well-structured sequences for simple and multi-step instructions. To evaluate generalization beyond the training distribution, we present qualitative comparisons under two types of out-of-distribution captions, as shown in middle and right of Figure.

Related Motion Generation Works 🚀🚀

Text2Motion: Diverse text-driven motion generation using temporal variational autoencoder.
TM2T: Learning text2motion and motion2text reciprocally through discrete token and language model.
MoMask: Generative Masked Modeling of 3D Human Motions

BibTeX

      
@misc{ouyang2025motionr1chainofthoughtreasoningreinforcement,
  title={Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation}, 
  author={Runqi Ouyang and Haoyun Li and Zhenyuan Zhang and Xiaofeng Wang and Zheng Zhu and Guan Huang and Xingang Wang},
  year={2025},
  eprint={2506.10353},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2506.10353}, 
}