AI & ML Papers
Photo
🔥 MARBLE: Multi-Aspect Reward Balance for Diffusion RL
📅 Published on May 7
🔗 Links:
• arXiv: https://arxiv.org/abs/2605.06507
• PDF: https://arxiv.org/pdf/2605.06507
• Project Page: https://aim-uofa.github.io/MARBLE/
• GitHub: https://github.com/aim-uofa/MARBLE ⭐ 24
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultiRewardReinforcementLearning #DiffusionModels #GradientSpaceOptimization #MultiAspectRewardBalance #ReinforcementLearningFineTuning
💡 The paper introduces MARBLE, a novel gradient-space optimization framework for multi-reward reinforcement learning fine-tuning of diffusion models. The problem addressed is that existing methods for handling multiple rewards either train separate models for each reward or use a weighted-sum reward aggregation, which can lead to poor performance due to sample-level mismatch. This mismatch occurs because most rollouts are highly informative for certain reward dimensions but irrelevant for others, causing the weighted summation to dilute their supervision.
To address this issue, MARBLE maintains independent advantage estimators for each reward and computes per-reward policy gradients. These gradients are then harmonized into a single update direction without manual reward weighting, by solving a quadratic programming problem. This approach allows for a unified model that can be jointly trained on all rewards, eliminating the need for heavy manual tuning and sequential training.
The authors also propose an amortized formulation that reduces the computational cost of MARBLE, making it more efficient. Additionally, they use exponential moving average smoothing on the balancing coefficients to stabilize updates against transient fluctuations.
The results show that MARBLE improves all five reward dimensions simultaneously on the SD3.5 Medium dataset, outperforming the baseline method. Specifically, MARBLE turns the worst-aligned reward's gradient cosine from negative to consistently positive, indicating better alignment with human preferences. Furthermore, MARBLE runs at nearly the same training speed as the baseline method, with only a 3% slowdown. Overall, MARBLE provides a more effective and efficient approach to multi-reward reinforcement learning fine-tuning of diffusion models.
📅 Published on May 7
🔗 Links:
• arXiv: https://arxiv.org/abs/2605.06507
• PDF: https://arxiv.org/pdf/2605.06507
• Project Page: https://aim-uofa.github.io/MARBLE/
• GitHub: https://github.com/aim-uofa/MARBLE ⭐ 24
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultiRewardReinforcementLearning #DiffusionModels #GradientSpaceOptimization #MultiAspectRewardBalance #ReinforcementLearningFineTuning
arXiv.org
MARBLE: Multi-Aspect Reward Balance for Diffusion RL
Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and...
❤1