**Jingcheng Hu, Qi Han, Yinmin Zhang,** Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum

[Still Work in Progress]

Started writing on 10 Feb, 2025, Released on 18 Feb, 2025.

Other versions: [📚PDF][👨‍💻‍ Github][🤗 HF]

<aside> 🌊

We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($\lambda=1, \gamma=1$) and straightforward rule-based reward function, without any KL regularization, is sufficient to scale up both response length and benchmark performance on reasoning tasks, similar to the phenomenon observed in DeepSeek-R1-Zero. Notably, our implementation outperforms DeepSeek-R1-Zero-Qwen-32B on the GPQA Diamond benchmark, while only requiring 1/30 of the training steps. In the spirit of open source, we release our source code, parameter settings, training data, and model weights.

</aside>

Table of Content

A. Qualitative Examples of Model Output

B. Visualization for Credit Assign in Bad Samples

We provide a representative visualization of critic model’s output in each token for a bad response(which has many bad repetitive patterns in the responses). Greener the text means the value is larger, and oranger the text means the value is smaller. We identify very interesting observations:

We believe this is a sign that the critic model has already learnt the credit assignment task very well: