UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

1BUPT, 2BAAI
*Equal contribution, Corresponding author, Project leader
Pipeline

UDM-GRPO is the first method to achieve stable and efficient group relative policy optimization for uniform discrete diffusion models.

Abstract

Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose UDM-GRPO the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. UDM-GRPO significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from 69% to 96% and PickScore increases from 20.46 to 23.81, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy improves from 8% to 57%, further validating the effectiveness and generalization ability of our method.

Problem Analysis

Problem analysis figure

We conduct a detailed analysis of the instability caused by the conventional use of intermediate samples and backward trajectories. Based on this analysis, we motivate a unified design that instead relies on the final clean sample and the forward trajectory, leading to more stable and consistent optimization.

Visual Results

GenEval qualitative comparison results

Qualitative Comparison. We evaluate our model against SD3.5-L, Flux.1 Dev and URSA.

Training process visualization

We visualize the generated samples across successive training iterations during the optimization.

Quantitative Results

Qualitative Comparison. We evaluate our model against
        SD3.5-L, Flux.1 Dev and URSA using prompts from GenEval and
        PickScore, respectively.

Comparison result on GenEval.

Comparison results on GenEval, PickScore, and OCR

Comparison results on GenEval, PickScore, and OCR.

BibTeX

@article{wang2026udmgrpo,
      title={UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models},
      author={Wang, Jiaqi and Deng, Haoge and Pan, Ting and Liu, Yang and Wang, Chengyuan and Zhang, Fan and Qi, Yonggang and Wang, Xinlong},
      journal={arXiv preprint arXiv:2604.18518 },
      year={2026}
    }