UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

¹BUPT, ²BAAI

^*Equal contribution, ^†Corresponding author, ^‡Project leader

Abstract

Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose UDM-GRPO the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. UDM-GRPO significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from 69% to 96% and PickScore increases from 20.46 to 23.81, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy improves from 8% to 57%, further validating the effectiveness and generalization ability of our method.

Problem Analysis

We conduct a detailed analysis of the instability caused by the conventional use of intermediate samples and backward trajectories. Based on this analysis, we motivate a unified design that instead relies on the final clean sample and the forward trajectory, leading to more stable and consistent optimization.

BibTeX

@article{wang2026udmgrpo, title={UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models}, author={Wang, Jiaqi and Deng, Haoge and Pan, Ting and Liu, Yang and Wang, Chengyuan and Zhang, Fan and Qi, Yonggang and Wang, Xinlong}, journal={arXiv preprint arXiv:2604.18518 }, year={2026} }

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

UDM-GRPO is the first method to achieve stable and efficient group relative policy optimization for uniform discrete diffusion models.

Abstract

Problem Analysis

Visual Results

Quantitative Results

BibTeX