MixGRPO: Unlocking Flow-Based GRPO Efficiency with Mixed ODE-SDE

¹Hunyuan, Tencent ²School of Computer Science, Peking University ³Computer Center, Peking University

Pre-experiments

Figure 1: Performance comparison for different numbers of denoising steps optimized. The performance improvement of DanceGRPO relies on more steps optimized. MixGRPO achieves optimal performance while requiring only 4 steps.

Figure 2:Visualization of t-SNE for images sampled with different strategies. Employing SDE sampling in the early stages of the denoising process results in a more discrete data distribution.

Abstract

Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose MixGRPO, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for sampling. So we present a faster variant, termed MixGRPO-Flash, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%. Codes and models are available at MixGRPO.

Experiments

Table1: Comparison results for overhead and performance. MixGRPO achieves the best performance across multiple metrics. MixGRPO-Flash significantly reduces sampling time while outperforming DanceGRPO. Bold rank 1. Underline rank 2. ^*The Frozen strategy means that optimization is only employed at the initial denoising steps.

Table2: Comparison results for in-domain and out-of-domain reward metrics. The results demonstrate that MixGRPO achieves the best performance on both in-domain and out-of-domain rewards, whether using a single or multiple reward models.

Table3: Comparison results demonstrate that MixGRPO outperforms Flow-DPO and Flow-GRPO in training efficiency and performance.

Table4: Ablation experiments on important parameters of sliding windows.

Table5: Comparison of performance across high-order solvers.

Figure 1: Qualitative comparison. MixGRPO achieve superior performance in both semantics and aesthetics.

Figure 2: Qualitative comparison with different training-time sampling steps. The performance of MixGRPO does not significantly decrease with the reduction in overhead. ^*The Frozen strategy means that optimization is only employed at the initial denoising steps.

Figure 3: Comparison of the visualization results of FLUX, DanceGRPO, and MixGRPO under HPS-v2.1 as the reward model.

Figure 4: Comparison of the visualization results of FLUX, DanceGRPO, and MixGRPO under HPS-v2.1 and CLIP Score as the reward models.

Figure 5: Comparison of the visualization results of SD3.5-M, offline DPO, online DPO, Flow-GRPO and MixGRPO under HPS-v2.1, Pick Score and ImageReward as multi-reward models.

BibTeX

@misc{li2025mixgrpounlockingflowbasedgrpo, title={MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE}, author={Junzhe Li and Yutao Cui and Tao Huang and Yinping Ma and Chun Fan and Miles Yang and Zhao Zhong}, year={2025}, eprint={2507.21802}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2507.21802}, }

MixGRPO: Unlocking Flow-Based GRPO Efficiency with Mixed ODE-SDE

Pre-experiments

Figure 1: Performance comparison for different numbers of denoising steps optimized. The performance improvement of DanceGRPO relies on more steps optimized. MixGRPO achieves optimal performance while requiring only 4 steps.

Figure 2:Visualization of t-SNE for images sampled with different strategies. Employing SDE sampling in the early stages of the denoising process results in a more discrete data distribution.

Abstract

METHOD

MixGRPO Algorithm

MixGRPO-Flash Algorithm

Experiments

Table2: Comparison results for in-domain and out-of-domain reward metrics. The results demonstrate that MixGRPO achieves the best performance on both in-domain and out-of-domain rewards, whether using a single or multiple reward models.

Table3: Comparison results demonstrate that MixGRPO outperforms Flow-DPO and Flow-GRPO in training efficiency and performance.

Table4: Ablation experiments on important parameters of sliding windows.

Table5: Comparison of performance across high-order solvers.

Figure 1: Qualitative comparison. MixGRPO achieve superior performance in both semantics and aesthetics.

Figure 2: Qualitative comparison with different training-time sampling steps. The performance of MixGRPO does not significantly decrease with the reduction in overhead. ^*The Frozen strategy means that optimization is only employed at the initial denoising steps.

Figure 3: Comparison of the visualization results of FLUX, DanceGRPO, and MixGRPO under HPS-v2.1 as the reward model.

Figure 4: Comparison of the visualization results of FLUX, DanceGRPO, and MixGRPO under HPS-v2.1 and CLIP Score as the reward models.

Figure 5: Comparison of the visualization results of SD3.5-M, offline DPO, online DPO, Flow-GRPO and MixGRPO under HPS-v2.1, Pick Score and ImageReward as multi-reward models.

BibTeX

MixGRPO: Unlocking Flow-Based GRPO Efficiency with Mixed ODE-SDE

Pre-experiments

Figure 1: Performance comparison for different numbers of denoising steps optimized. The performance improvement of DanceGRPO relies on more steps optimized. MixGRPO achieves optimal performance while requiring only 4 steps.

Figure 2:Visualization of t-SNE for images sampled with different strategies. Employing SDE sampling in the early stages of the denoising process results in a more discrete data distribution.

Abstract

METHOD

MixGRPO Algorithm

MixGRPO-Flash Algorithm

Experiments

Table2: Comparison results for in-domain and out-of-domain reward metrics. The results demonstrate that MixGRPO achieves the best performance on both in-domain and out-of-domain rewards, whether using a single or multiple reward models.

Table3: Comparison results demonstrate that MixGRPO outperforms Flow-DPO and Flow-GRPO in training efficiency and performance.

Table4: Ablation experiments on important parameters of sliding windows.

Table5: Comparison of performance across high-order solvers.

Figure 1: Qualitative comparison. MixGRPO achieve superior performance in both semantics and aesthetics.

Figure 2: Qualitative comparison with different training-time sampling steps. The performance of MixGRPO does not significantly decrease with the reduction in overhead. *The Frozen strategy means that optimization is only employed at the initial denoising steps.

Figure 3: Comparison of the visualization results of FLUX, DanceGRPO, and MixGRPO under HPS-v2.1 as the reward model.

Figure 4: Comparison of the visualization results of FLUX, DanceGRPO, and MixGRPO under HPS-v2.1 and CLIP Score as the reward models.

Figure 5: Comparison of the visualization results of SD3.5-M, offline DPO, online DPO, Flow-GRPO and MixGRPO under HPS-v2.1, Pick Score and ImageReward as multi-reward models.

BibTeX

Figure 2: Qualitative comparison with different training-time sampling steps. The performance of MixGRPO does not significantly decrease with the reduction in overhead. ^*The Frozen strategy means that optimization is only employed at the initial denoising steps.