PlanMoGPT: Flow-Enhanced Progressive Planning<br>for Text to Motion Synthesis

Abstract

Recent advances in large language models (LLMs) have enabled breakthroughs in many multimodal generation tasks, but a significant performance gap still exists in text-to-motion generation, where LLM-based methods lag far behind non-LLM methods. We identify the granularity of motion tokenization as a critical bottleneck: fine-grained tokenization induces local dependency issues, where LLMs overemphasize short-term coherence at the expense of global semantic alignment, while coarse-grained tokenization sacrifices motion details. To resolve this issue, we propose PlanMoGPT, an LLM-based framework integrating progressive planning and flow-enhanced fine-grained motion tokenization. First, our progressive planning mechanism leverages LLMs' autoregressive capabilities to hierarchically generate motion tokens by starting from sparse global plans and iteratively refining them into full sequences. Second, our flow-enhanced tokenizer doubles the downsampling resolution and expands the codebook size by eight times, minimizing detail loss during discretization, while a flow-enhanced decoder recovers motion nuances. Extensive experiments on text-to-motion benchmarks demonstrate that it achieves state-of-the-art performance, improving FID scores by 63.8% (from 0.380 to 0.141) on long-sequence generation while enhancing motion diversity by 49.9% compared to existing methods. The proposed framework successfully resolves the diversity-quality trade-off that plagues current non-LLM approaches, establishing new standards for text-to-motion generation.

Approach Overview

PlanMoGPT consists of two components: (a). A flow-enhanced motion tokenizer converts motion into fine-grained tokens with minimal loss; (b). An LLM integrates with progressive planning, which progressively generates from a larger interval motion tokens to the full motion token sequence.

Conclusion

In this paper, we address what limits the ability of LLM in text-to-motion generation tasks. We identify the granularity of motion tokenization as a critical bottleneck. That is, fine-grained tokens lead to severe local dependencies, while coarse-grained motion tokens lose motion details.To address this issue, we propose PlanMoGPT, an LLM-based framework integrating progressive planning and flow-enhanced fine-grained motion tokenization. Extensive experiments demonstrate that PlanMoGPT not only generates precise and diverse human motions but also outperforms the state-of-the-art method MoMask in both human and automated evaluations on short and long sequence datasets. What's more, PlanMoGPT resolves the diversity-quality dilemma in existing non-LLM approaches. which further verifies the necessity of exploiting the potential of LLM for text-to-motion tasks. In the future, we will explore more flexible planning, such as manually selected keyframes. In addition, we will also explore how to extend PlanMoGPT to generate motion with expressions and hand movements.

PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis

Different methods repeatedly generate 30 motions based on the same text. Similar motions are grouped and reported times.

Abstract

Approach Overview

Gallery of Generation

Comparison with Baselines

Conclusion