| input_ids | ["123+456=?", "The", "answer", "is", "579", "."] | |
| logprobs | [-0.12, -0.41, -0.35, -0.09, -0.22] | |
| loss_mask | 0, 0, 0, 1, 1, 1, 1, 1 | |
| rewards | 1.0 | |
| advantages | [0.12, 0.34, 0.56, 0.78, 0.91] | |
| versions | ["v2", "v2", "v2", "v2", "v2"] |
| 42.0 | - | |
| 1 | 41.8 | 42.1 |
| 42.2 | ||
| 35.7 | 41.0 | |
| ∞ | 34.0 | 36.9 |
| Ref Model | Old Model | Current Model | |
|---|---|---|---|
| ✅ | ✅ | ❌ | |
| ❌ | ❌ | ✅ | |
ref_logprobs |
old_logprobs |
current_logprobs |
|
| — |
ratio |
||