The paper on the Soft Adaptive Policy Optimization (SAPO) algorithm was published on arXiv, and then the Qwen team introduced this reinforcement learning training method for large language and multimodal models through the official blog system. It is pointed out that the existing strategy optimization based on hard clipping either disappears or explodes when the importance is more than extreme fluctuations, especially in the mixed expert (MoE) structure, which is more likely to lead to unstable training and low sample utilization efficiency.
SAPO replaces the traditional hard boundary with a continuous, adjustable "temperature gating" that adaptively scales token-level importance while maintaining trust-domain-like constraints at the serial level, suppressing only the parts that deviate from the policy severely while retaining the effective gradient close to the policy distribution. The algorithm also allows for asymmetric temperature settings to enhance robustness in high-variance MoE models. Papers and experiments show that under similar training budgets, SAPO can support stable RL training for a longer period of time, and significantly improve key indicators such as Pass@1 in mathematics, code, and multimodal tasks of the Qwen3-VL series models, providing a more scalable and reusable foundation for RL tuning of large models.
FAQs
Q: What is SAPO?
A: SAPO stands for Soft Adaptive Policy Optimization, which is a policy optimization method for large model RL tuning, emphasizing smooth gating and adaptive updates.
Q: What are its core improvements over traditional hard clipping?
A: SAPO replaces hard thresholds with continuous, temperature-controlled doors to avoid sudden gradient disappearance or explosion caused by "full on/all off".
Q: Why is there a special emphasis on advantages over the MoE model?
A: The MoE structure itself amplifies the variance, and SAPO's asymmetric temperature and fine-grained token adjustment can mitigate the damage to training stability by extreme samples.
Q: What are the specific benefits of SAPO in terms of performance?
A: Experimental reports show that it can achieve a longer and more stable RL training process, and bring higher Pass@1 and multitasking performance improvements on the Qwen3-VL.
Q: To what extent is SAPO's research and implementation open?
A: The details of the algorithm and experimental results have been published through papers and official blogs for further implementation and evaluation by researchers and engineering teams.