PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis

1GSAI, Renmin University of China, 2Kuaishou.

Different methods repeatedly generate 30 motions based on the same text. Similar motions are grouped and reported times.

Abstract

Recent advances in large language models (LLMs) have enabled breakthroughs in many multimodal generation tasks, but a significant performance gap still exists in text-to-motion generation, where LLM-based methods lag far behind non-LLM methods. We identify the granularity of motion tokenization as a critical bottleneck: fine-grained tokenization induces local dependency issues, where LLMs overemphasize short-term coherence at the expense of global semantic alignment, while coarse-grained tokenization sacrifices motion details. To resolve this issue, we propose PlanMoGPT, an LLM-based framework integrating progressive planning and flow-enhanced fine-grained motion tokenization. First, our progressive planning mechanism leverages LLMs' autoregressive capabilities to hierarchically generate motion tokens by starting from sparse global plans and iteratively refining them into full sequences. Second, our flow-enhanced tokenizer doubles the downsampling resolution and expands the codebook size by eight times, minimizing detail loss during discretization, while a flow-enhanced decoder recovers motion nuances. Extensive experiments on text-to-motion benchmarks demonstrate that it achieves state-of-the-art performance, improving FID scores by 63.8% (from 0.380 to 0.141) on long-sequence generation while enhancing motion diversity by 49.9% compared to existing methods. The proposed framework successfully resolves the diversity-quality trade-off that plagues current non-LLM approaches, establishing new standards for text-to-motion generation.

Approach Overview

PlanMoGPT consists of two components: (a). A flow-enhanced motion tokenizer converts motion into fine-grained tokens with minimal loss; (b). An LLM integrates with progressive planning, which progressively generates from a larger interval motion tokens to the full motion token sequence.

Gallery of Generation

Comparison with Baselines


Text: The person walks around the floor, then runs forwards. After stopping, they stretch their elbows. Following this, the person, who is a man, raises his right hand over his head, then lets it back down next to his hip.


by PlanMoGPT

by MoMask

by T2M-GPT



Text: The figure walks in a counterclockwise circle, stopping at their starting point. Then, they stand up from being on one knee. Finally, the person is looking at an accident.


by PlanMoGPT

by MoMask

by T2M-GPT



Text: The sequence of motions is as follows: First, a person hops on their right foot. Then, a man walks slowly forward with his hands out to his sides, using an object on either side for balance. After that, the person walks in a curved path to the right. Finally, they slowly move their left hand from the right side of their body to the left.


by PlanMoGPT

by MoMask

by T2M-GPT



Text: The person starts by walking in a circle. After completing the circle, they begin to walk forwards playfully, hopping slightly with each step. Midway through their playful walk, they respectfully take a knee for Black Lives Matter. As they rise from the kneeling position, they accidentally hurt their back.


by PlanMoGPT

by MoMask

by T2M-GPT



Text: The person first bounces a ball to their right-hand side. Then, they swing a bat to their left twice. Following that, they wave widely with their right arm. Next, they dance side to side, moving their arms wide in and out. Finally, they grab something in front of them, swing it around to the side, and throw it overhead.


by PlanMoGPT

by MoMask

by T2M-GPT

Conclusion

In this paper, we address what limits the ability of LLM in text-to-motion generation tasks. We identify the granularity of motion tokenization as a critical bottleneck. That is, fine-grained tokens lead to severe local dependencies, while coarse-grained motion tokens lose motion details.To address this issue, we propose PlanMoGPT, an LLM-based framework integrating progressive planning and flow-enhanced fine-grained motion tokenization. Extensive experiments demonstrate that PlanMoGPT not only generates precise and diverse human motions but also outperforms the state-of-the-art method MoMask in both human and automated evaluations on short and long sequence datasets. What's more, PlanMoGPT resolves the diversity-quality dilemma in existing non-LLM approaches. which further verifies the necessity of exploiting the potential of LLM for text-to-motion tasks. In the future, we will explore more flexible planning, such as manually selected keyframes. In addition, we will also explore how to extend PlanMoGPT to generate motion with expressions and hand movements.