Open-Reasoner-Zero

**Jingcheng Hu, Qi Han, Yinmin Zhang,** Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum

[Still Work in Progress]

Started writing on 10 Feb, 2025, Released on 18 Feb, 2025.

Other versions: [📚PDF][👨‍💻‍ Github][🤗 HF]

<aside> 🌊

We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($\lambda=1, \gamma=1$) and straightforward rule-based reward function, without any KL regularization, is sufficient to scale up both response length and benchmark performance on reasoning tasks, similar to the phenomenon observed in DeepSeek-R1-Zero. Notably, our implementation outperforms DeepSeek-R1-Zero-Qwen-32B on the GPQA Diamond benchmark, while only requiring 1/30 of the training steps. In the spirit of open source, we release our source code, parameter settings, training data, and model weights.

</aside>

Table of Content

A. Qualitative Examples of Model Output

Repetitive but Correct Model Output Example in GRPO Experiments
Reflection and Correct Examples

B. Visualization for Credit Assign in Bad Samples

We provide a representative visualization of critic model’s output in each token for a bad response(which has many bad repetitive patterns in the responses). Greener the text means the value is larger, and oranger the text means the value is smaller. We identify very interesting observations:

The model, though has output so many repetitive thinking, may suddenly output the answer at the very last, which indicating very ineffective usage of its own inference time compute budget.
Even more interesting thing is that the output values from the critic model will monotonically decrease as the repetitive pattern goes, and will have a sudden drop when model suddenly output correct formatted answer

We believe this is a sign that the critic model has already learnt the credit assignment task very well:

As the repetitive pattern increases, the policy model is less and less likely to generate correct answer, and this is correctly modeled by critic model
The suddenly outputting correct formatted answer is an even more harmful behavior pattern for model:
- not only the inference time compute budget is not utilized effectively, its own repetitive content serving as messy context may distract model’s attention
- This is also correctly modeled by critic model, indicating by the sudden drop of value along token index dimension.
Detailed Visualization