LLaVA-OneVision-1.5

Benchmark	LLaVA-OV-1.5 8B	Qwen2.5-VL 7B	LLaVA-OV-1.5 4B	LLaVA-OV-1.5 3B	Qwen2.5-VL 3B
MMStar	67.7	62.5	64.9	59.1	55.9
MMBenchen	84.1	83.4	84.2	81.0	78.0
MMBenchcn	81.0	81.6	76.9	73.0	74.6
MME-RealWorlden	62.3	57.3	49.6	57.9	51.6
MME-RealWorldcn	56.1	51.5	61.6	23.4	45.4
SeedBenchimage	77.3	77.5	76.6	71.3	74.8
CV-Bench	80.8	80.0	77.2	73.8	71.5
ScienceQA	95.0	88.8	93.6	91.2	83.3
SEED-Bench-2-Plus	69.2	70.9	68.9	67.6	68.6
RealWorldQA	68.1	68.5	67.8	66.8	60.0
Avg	74.2	72.2	72.1	66.5	66.4
MathVistamini	69.6	68.6	67.9	64.7	60.2
WeMath	33.6	33.3	24.9	22.6	18.4
MathVision	25.6	22.4	24.2	19.9	21.3
MMMUval	55.4	51.3	52.7	45.5	46.4
MMMU-Prostandard	37.4	36.3	35.3	29.5	31.1
MMMU-Provision	25.2	32.8	25.4	20.3	21.3
Avg	41.1	40.8	38.4	33.7	33.1
ChartQA	86.5	84.1	87.1	84.4	83.4
CharXivDQ	74.1	69.8	63.8	61.8	58.2
DocVQA	95.0	94.9	94.4	93.4	92.7
OCRBench	82.9	84.2	80.0	80.5	79.2
AI2DwM	84.2	82.6	83.6	82.3	78.6
AI2Dw/oM	94.1	93.4	93.3	91.9	90.7
InfoVQA	78.4	81.7	76.1	71.2	75.6
Avg	85.0	84.4	82.6	81.0	79.8
PixmoCount	62.2	63.3	52.2	57.0	50.9
CountBench	88.2	86.4	79.8	49.5	72.5
VL-RewardBench	46.7	49.7	48.2	42.5	42.1
V*	78.0	77.0	74.9	67.5	69.6
Avg	68.8	69.1	63.8	54.1	58.8

Model Configuration	OCR & Document Understanding	General Vision Understanding
CLIP	ViT-L-14-336px	38.9	75.2	66.5	62.5	52.5	23.0	47.4	52.3	73.2	74.6	48.0	75.6	88.8	63.7	49.0	67.6
MLCD	ViT-L-14-336px	43.5	76.5	67.8	61.7	53.1	24.0	48.4	53.6	77.0	76.4	54.1	79.9	88.7	61.1	51.0	69.7
AIMv2	ViT-L-14-336px	35.4	77.2	72.7	65.9	57.2	23.9	47.3	54.2	75.4	78.6	48.3	75.0	88.4	62.2	50.2	68.3
RICE-ViT	ViT-L-14-336px	45.2	79.2	72.3	65.9	57.5	24.1	48.9	56.2	77.9	76.6	54.6	80.7	88.5	63.1	51.8	70.5

DFN5B	ViT-H-14-378px	38.6	70.9	64.4	59.4	47.3	21.9	46.2	49.8	73.5	73.4	45.8	76.9	88.6	59.9	49.1	66.7
SigLIP	ViT-SO400M-14-384px	41.4	76.7	69.3	64.7	55.4	24.0	48.4	54.3	76.2	77.0	46.1	79.9	88.8	63.7	47.3	68.4
SigLIPv2	ViT-SO400M-14-384px	43.7	79.1	70.2	66.2	58.7	25.4	48.6	56.0	77.0	77.1	46.6	80.4	89.3	63.4	52.8	69.5
RICE-ViT	ViT-L-14-378px	48.1	82.6	75.1	66.2	58.8	25.8	49.5	58.0	76.5	77.6	54.1	79.0	89.1	62.9	51.2	70.1

SigLIPv2	ViT-SO400M-16-560px	50.2	86.2	77.4	70.2	62.7	26.5	52.9	60.9	77.0	76.5	53.5	79.9	89.3	68.2	53.1	71.1
RICE-ViT	ViT-L-14-560px	53.2	87.4	78.1	69.0	60.7	26.1	53.0	61.1	76.9	78.6	56.3	79.3	88.9	65.1	50.5	70.8
Qwen-ViT from Qwen2.5-VL 7B	ViT-H-14-560px	55.9	85.8	78.8	73.7	66.2	26.8	53.4	62.9	78.8	78.4	62.0	80.8	88.6	64.2	55.0	72.5
RICE-ViT from OV-1.5 3B	ViT-L-14-560px	53.7	87.1	81.9	73.8	73.3	30.4	53.6	64.8	80.3	79.6	58.6	82.2	89.0	67.3	56.6	73.4

Overview概览

LLaVA-OneVision-1.5-RL presents an RL post-training stage utilizing 67K curated examples with discrepancy-based selection to generate explicit chain-of-thought reasoning, achieving significant performance gains on STEM, coding, and reasoning benchmarks while maintaining visual understanding capabilities.

LLaVA-OneVision-1.5-RL 提出了一套强化学习后训练阶段，利用 6.7 万条经精筛的数据，并采用差异驱动的数据选择策略来激发显式链式思维推理，在保持视觉理解能力的同时，显著提升了模型在 STEM、代码生成与复杂推理基准上的表现。

Our contributions are threefold:

我们的贡献主要体现在三个方面：

1

Discrepancy-Driven Data Curation. We identify tasks where model performance gap exists between Pass@N and Pass@1 metrics, targeting “latent capability” rather than knowledge injection.

差异驱动的数据构建。 我们聚焦于 Pass@N 与 Pass@1 之间存在明显性能差距的任务，挖掘模型已有但尚未稳定释放的“潜在能力”，而不是简单注入新知识。

2

Rule-Based Reward System. We employ domain-specific verification rules rather than learned preference models, enabling precise feedback across STEM, grounding, spatial reasoning, counting, coding, OCR, and diagram tasks.

基于规则的奖励系统。 我们不依赖学习式偏好模型，而是针对不同任务设计领域特定的验证规则，从而为 STEM、定位、空间推理、计数、代码、OCR 与图表理解任务提供精确反馈。

3

Two-Stage Curriculum Training. We design a training curriculum that first stabilizes concise task performance with answer-only RL, then unlocks deeper reasoning through chain-of-thought RL.

两阶段课程式训练。 我们先通过仅输出答案的 RL 稳定模型的简洁作答能力，再通过链式思维 RL 进一步释放更深层的推理能力。

RL Training Data Composition强化学习训练数据构成

(a) Total RL corpus (67K instances)(a) 全部 RL 语料（6.7 万条）

Math / Science数学 / 科学

Grounding定位

Coding代码

Spatial空间

OCROCR

Diagram图表

Counting计数

(b) Stage 1: Answer-only training (19.9K)(b) 阶段一：仅答案训练（1.99 万条）

Math / Science数学 / 科学

Grounding定位

Coding代码

Spatial空间

OCROCR

Diagram图表

Counting计数

(c) Stage 2: Chain-of-thought training (49.2K)(c) 阶段二：链式思维训练（4.92 万条）

Math / Science数学 / 科学

Grounding定位

Coding代码

Spatial空间

OCROCR

Diagram图表

Counting计数

Figure 3. Distribution of task categories in the RL training data. (a) Total RL corpus (67K instances). (b) Stage 1: Answer-only training. (c) Stage 2: Chain-of-thought training.图 3. 强化学习训练数据中的任务类别分布。(a) 全部 RL 语料（6.7 万条）；(b) 阶段一：仅答案训练；(c) 阶段二：链式思维训练。

How the RL data is obtainedRL 数据是怎么获得的

This RL corpus is not hand-labeled preference data. Instead, it is built from public benchmarks such as ViRL39K, Ref-L4, VigoRL-SA / SAT, PixmoCount, WebCode2M, UniSVG, InfoVQA, and AI2D. The pipeline is: start from benchmark samples → let the base LLaVA-OneVision-1.5-Instruct model generate multiple rollouts → score them with task-specific rule rewards → keep medium-difficulty examples. Samples that are already trivial (reward too high) or still unsolved (reward too low) are filtered out, leaving the 67K examples that provide the strongest RL learning signal. The released training set is then organized into two local splits used by code: ./data/stage1-normal and ./data/stage2-long.

这套 RL 训练数据不是人工标注的偏好对比数据，而是从一批公开 benchmark / dataset 中构建出来的，例如 ViRL39K、Ref-L4、VigoRL-SA / SAT、PixmoCount、WebCode2M、UniSVG、InfoVQA 和 AI2D。整体流程是：先取公开题目 → 用基础模型 LLaVA-OneVision-1.5-Instruct 对每个样本 rollout 多个回答 → 用任务对应的规则奖励自动打分 → 只保留中等难度样本。已经很容易的题（reward 太高）和当前模型几乎完全不会的题（reward 太低）都会被过滤掉，最终留下约 6.7 万条最有 RL 学习价值的数据。代码里训练时实际读取的就是整理后的两个目录：./data/stage1-normal 和 ./data/stage2-long。

RL Data Strategy强化学习数据策略

Discrepancy-Driven Selection差异驱动筛选

If a model can solve a task given enough attempts (high Pass@N) but rarely gets it right on the first try (low Pass@1), it already has the latent capability — it just needs to learn to use it reliably. We select tasks with this gap for RL training, filtering out tasks that are too easy (high Pass@1, nothing to learn) or too hard (low Pass@N, beyond current capability).

如果模型在给出足够多次尝试时能够解出某个任务（高 Pass@N），但首次作答却很难做对（低 Pass@1），说明它已经具备相应的潜在能力，只是尚未学会稳定地调用这种能力。我们将这类存在差距的样本用于 RL 训练，同时过滤掉过于简单（高 Pass@1，几乎没有学习空间）或过于困难（低 Pass@N，超出当前能力边界）的任务。

Task Selection by Capability Gap基于能力差距的任务筛选

32%

78%

gap差距

Selected入选

85%

91%

Too Easy过于简单

5%

8%

Too Hard过于困难

Pass@1 Pass@N Latent capability gap潜在能力差距

Reward-Based Sampling基于奖励的采样

Multiple candidate responses are filtered by average reward scores to exclude trivial and unsolvable cases, focusing on medium-difficulty instances that provide optimal learning signals.

我们通过候选回答的平均奖励分数进行过滤，剔除过于简单或几乎不可解的样本，将训练重点放在能提供最优学习信号的中等难度实例上。

Reward-Based Sampling Filter基于奖励的样本过滤器

✗ Trivial✗ 过于简单

reward ≈ 1.0

→

✓ Selected✓ 保留

medium difficulty中等难度

←

✗ Unsolvable✗ 不可解

reward ≈ 0.0

0.01.0

excluded剔除 optimal signal最佳信号 excluded剔除

Reward System Architecture奖励系统架构

Our RL setup employs a rule-based reward paradigm, where rewards are derived directly from task outcomes rather than learned preference models. Since different answer types require distinct verification strategies, we design answer-type-specific scoring rules via the reward/ module.

我们的 RL 方案采用基于规则的奖励范式：奖励直接由任务结果计算，而不是依赖学习式偏好模型。由于不同答案类型需要不同的验证策略，我们通过 reward/ 模块设计了针对答案类型的专用评分规则。

Category类别	Source来源	Reward Design Details奖励设计细节
STEM	ViRL39K	Choice accuracy & math expression equivalence选项准确率与数学表达式等价性 View reward logic查看奖励逻辑 reward/math.py python `def math_reward_fn(completions, answer): model_answer = extract_answer(completion, format_strict=True) gold_parsed = parse(gt_answer, extraction_config=[ StringExtractionConfig(), LatexExtractionConfig(), ExprExtractionConfig(), ]) correct = verify(answer_parsed, gold_parsed) if not correct: correct = is_equal(gt_answer, model_answer) # SymPy fallback return 1.0 if correct else 0.0`
Grounding定位	Ref-L4, VigoRL-SA	IoU between predicted/ref boxes; choice accuracy预测框与参考框的 IoU；选项准确率 View reward logic查看奖励逻辑 reward/bbox.py python `def bbox_reward_fn(completions, answer): xA = max(predicted[0], gt[0]); yA = max(predicted[1], gt[1]) xB = min(predicted[2], gt[2]); yB = min(predicted[3], gt[3]) inter = max(0, xB-xA) * max(0, yB-yA) iou = inter / float(areaA + areaB - inter) return iou # ∈ [0, 1]` IoU = Intersection / (Area₁ + Area₂ − Intersection)
Spatial空间	VigoRL-SAT	Choice accuracy选项准确率 View reward logic查看奖励逻辑 reward/multiple_choice.py python `def multiplechoice_reward_fn(completions, answer): predicted = extract_boxed_content(completions)[-1] predicted = predicted.strip().strip('.()') predicted = predicted[0].upper() if predicted else "" return 1 if predicted == answer.upper() else 0`
Counting计数	PixmoCount	Numeric token equivalence数值 token 等价匹配 View reward logic查看奖励逻辑 reward/number.py python `def number_reward_fn(completions, answer): answer_str = extract_boxed_content(completions)[-1] match = re.findall(r"([0-9\\.]+)", answer_str) count = match[-1] if match else "" return float(count.strip() == answer.strip())`
Coding代码	WebCode2M, UniSVG	Token/tag overlap; SVG rendering similarity [0,1]token / 标签重叠度；SVG 渲染相似度 [0,1] View reward logic — HTML查看奖励逻辑 — HTML reward/htmlcode.py python `def html_reward_fn(completions, answer): token_score = calculate_token_overlap(gen, ref) structure_score = calculate_tag_structure_similarity(gen, ref) reward = 0.6 * token_score + 0.4 * structure_score return max(0.0, min(1.0, reward))` View reward logic — SVG查看奖励逻辑 — SVG reward/svgcode.py python `def svg_reward_fn(completions, answer): token_score = calculate_token_overlap(gen, ref) structure_score = calculate_structure_similarity(gen, ref) image_score = calculate_image_similarity(gen_png, ref_png) # SSIM reward = 0.5 * image_score + 0.25 * (token_score + structure_score) return reward # ∈ [0, 1]` HTML: 0.6 × TokenJaccard + 0.4 × TagJaccard \| SVG: 0.5 × SSIM + 0.25 × (Token + Tag)
OCR	InfoVQA	Text similarity文本相似度 View reward logic查看奖励逻辑 reward/ocr.py python `def ocr_reward_fn(completions, answer): dist = levenshtein_distance(gt, det) length = max(len(target), len(det)) reward = 1 - min(values) return reward if reward >= 0.5 else 0 # threshold` Similarity = 1 − (Levenshtein / max(len₁, len₂)), clipped at 0.5
Diagram图表	AI2D	Choice accuracy选项准确率

Format Reward格式奖励 (cross-cutting): requires exactly one <think> block, at least one \boxed{}, and boxed content ≤ 20% of total length. （跨任务通用）：要求恰好出现一个 <think> 块，至少包含一个 \boxed{}，且被框出的内容长度不超过总长度的 20%。

Two-Stage Training Procedure两阶段训练流程

Interactive Training Pipeline: We utilize Group Relative Policy Optimization (GRPO) within the asynchronous AReaL framework. Expand the shared hyperparameters to view the global training configuration.

交互式训练流程：我们在异步 AReaL 框架中采用 Group Relative Policy Optimization（GRPO）。展开共享超参数区域即可查看全局训练配置。

1Stage 1阶段一

Answer-only RL仅答案 RL

Stabilizes task performance with concise answers (19.9K samples, ./data/stage1-normal).

通过简洁答案输出稳定任务表现（1.99 万样本，./data/stage1-normal）。

Model Base模型基座LLaVA-OneVision-1.5-8B-Instruct

Data数据./data/stage1-normal (19.9K)

Prompt Template提示模板Put ONLY your final answer within <answer></answer>.

2Stage 2阶段二

Chain-of-Thought RL ✨链式思维 RL ✨

Unlocks deeper reasoning via explicit thinking prompts (49.2K samples, ./data/stage2-long).

通过显式思维提示释放更深层推理能力（4.92 万样本，./data/stage2-long）。

Model Init模型初始化Stage 1 Checkpoint阶段一检查点

Data数据./data/stage2-long (49.2K)

Prompt Template提示模板Think and solve the following question step by step. Please put your thinking and analysis procedure within <think></think>. Put ONLY your final answer within <answer></answer>.

View Global Hyperparameters (Both Stages)查看全局超参数（两阶段共享）

Shared GRPO Configuration共享 GRPO 配置

Algorithm算法GRPO (AReaL Framework)

Infrastructure基础设施8× GPUs, FSDP, SGLang

Optimizer优化器Adam (lr=2e-6, wd=0.01)

LR Schedule学习率计划Constant w/ 0.1% warmup常数学习率 + 0.1% warmup

Batch Params批量参数BS=32, 16 samples/group

Max Tokens最大 token4096 (temp=1.0)

PPO ClipPPO 裁剪ε=0.2, ε_high=0.28

Reward奖励Scale=10.0, Bias=-0.5

KL PenaltyKL 惩罚0.0 (No constraint)

Epochs / Dtype轮数 / 精度30 / bfloat16

Reward Modules奖励模块 bbox.py, bool.py, critic.py, format.py, generalcode.py, htmlcode.py, math.py, multiple_choice.py, number.py, ocr.py, svgcode.py, string_matching.py

Extended Capability Analysis扩展能力分析

Spatial Reasoning & Grounding空间推理与定位

Coding代码

Score / Accuracy (%)得分 / 准确率 (%)

LLaVA-OV-1.5 8B

LLaVA-OV-1.5 RL (Thinking)

LLaVA-OV-1.5 RL (Fast)

Qwen 2.5-VL

Figure 4. Performance comparison of LLaVA-OV-1.5 and corresponding RL version on Spatial Reasoning & Grounding and Coding tasks.图 4. LLaVA-OV-1.5 与其 RL 版本在空间推理、定位与代码任务上的性能对比。

Spatial & Grounding: RL “fast mode” significantly enhances fine-grained perception on SAT and Ref-L4 benchmarks.

空间与定位：RL 的“fast mode”在 SAT 与 Ref-L4 基准上显著提升了细粒度感知能力。

Coding: “Thinking” mode achieves highest scores on Design2Code and UniSVG, demonstrating chain-of-thought benefits for structural code generation.

代码：“Thinking” 模式在 Design2Code 与 UniSVG 上取得了最高分，说明链式思维对结构化代码生成具有明显优势。

Performance Results性能结果

Table 3Performance comparison across vision-language models on various benchmarks grouped by task type. All scores are reported as accuracy percentages unless otherwise specified.按任务类别汇总的多种视觉语言模型基准对比。除非另有说明，所有分数均以准确率百分比报告。

Task任务	Benchmark基准	LLaVA-OV-1.5	LLaVA-OV-1.5 RL
		8B	8B
		-	thinkingthinking	fastfast
General VQA通用 VQA	MMStar	67.7	68.2^↑0.5	68.3^↑0.6
	MMBench_en	84.1	85.7^↑1.6	85.7^↑1.6
	MMBench_cn	81.0	84.2^↑3.2	81.5^↑0.5
	MME-RealWorld_en	61.7	63.4^↑1.7	63.3^↑1.6
	MME-RealWorld_cn	56.1	56.1^↑0.0	56.3^↑0.2
	SeedBench_image	77.3	76.7	77.6^↑0.3
	CV-Bench	80.7	82.9^↑2.2	81.1^↑0.4
	SEED-Bench-2-Plus	69.2	69.5^↑0.3	69.2^↑0.0
	RealWorldQA	68.1	68.4^↑0.3	70.6^↑2.5
	Avg.	71.8	72.8^↑1.0	72.6^↑0.8
Reasoning推理	MathVista_mini	69.6	72.3^↑2.7	71.8^↑2.2
	WeMath	61.5	69.4^↑7.9	60.8
	MathVision	25.6	34.4^↑8.8	26.2^↑0.6
	MMMU_val	55.4	58.8^↑3.4	54.9
	MMMU-Pro_standard	37.4	39.9^↑2.5	38.0^↑0.6
	MMMU-Pro_vision	25.2	35.7^↑10.5	29.0^↑3.8
	Avg.	45.8	51.8^↑6.0	46.8^↑1.0
OCR & ChartOCR 与图表	ChartQA	86.5	87.4^↑0.9	87.0^↑0.5
	CharXiv_DQ	70.9	68.4	71.2^↑0.3
	DocVQA	95.0	91.9	95.0^↑0.0
	OCRBench	82.9	81.7	82.3
	AI2D_{w M}	84.2	83.7	84.3^↑0.1
	AI2D_{w/o M}	94.1	93.7	93.9
	InfoVQA	78.4	76.6	78.7^↑0.3
	Avg.	84.6	83.3	84.6^↑0.0
Others其他	PixmoCount	62.2	65.7^↑3.5	71.1^↑8.9
	CountBench	88.2	86.8	88.6^↑0.4
	VL-RewardBench	47.7	44.0	49.7^↑2.0
	V*	78.0	79.1^↑1.1	78.0^↑0.0
	Avg.	69.0	66.0	71.6^↑2.6

GRPO Algorithm & AReaL Async FrameworkGRPO 算法与 AReaL 异步框架

GRPO

GRPO eliminates the critic by sampling G = 16 completions per prompt and using group-normalized rewards as the baseline.

GRPO 通过为每个提示采样 G = 16 个回答，并使用组内归一化奖励作为基线，从而移除了 critic。

Objective
J(θ) = 𝔼 [ min( rt · Ât , clip(rt, 1−ε, 1+ε′) · Ât ) · wt ]

Ratio

r_t = π_θ(y_t|y_<t,x) / π_prox(y_t|y_<t,x)

BehavWeight
wt = exp( log πprox − log πbehave ), cap = 5.0

Advantage

Â(x, y_i) = ( r_i − μ_group ) / ( σ_group + ε )

Reward

r′ = (r_task − 0.5) × 10.0

Show GRPO Loss Implementation查看 GRPO Loss 实现 (utils/functional.py)

# utils/functional.py — ppo_actor_loss_fn
ratio = torch.exp(logprobs - proximal_logprobs)
clipped_ratio = torch.clamp(ratio, 1.0 - eps_clip, 1.0 + eps_clip_higher)
pg_loss = torch.max(-advantages * ratio, -advantages * clipped_ratio)

# Behavior importance weight (off-policy correction)
behav_kl = proximal_logprobs - old_logprobs
behav_imp_weight = torch.clamp(behav_kl.exp(), max=behav_imp_weight_cap)
pg_loss = pg_loss * behav_imp_weight

AReaL Async TrainingAReaL 异步训练

AReaL decouples rollout from gradient computation — 2.77× throughput by eliminating GPU idle time.

AReaL 将 rollout 与梯度计算解耦，通过消除 GPU 空闲时间实现了 2.77× 吞吐提升。

Rollout Engine

SGLang · vLLM · 4×GPU

batch + logprobs →

Actor (FSDP)

4×GPU · gradient update

2.77×

Rollout

R1

R2

R3

Actor

Train 1

Train 2

Train 3

t₀t₁t₂t₃

Decoupled PPO解耦式 PPO

Three policies: π_behave (rollout), π_prox (recomputed), π_θ (current). Ratio uses π_θ/π_prox.包含三类策略：π_behave（rollout）、π_prox（重算）、π_θ（当前）。比值项使用 π_θ/π_prox。

Staleness Control时滞控制

max_head_offpolicyness η = 4. Rollout samples are restricted to ≤4 gradient steps behind the current policy.rollout 样本最多只允许落后当前策略 4 个梯度步。

Behavior Weight行为权重

w = exp(log π_prox − log π_behave), capped at 5.0 to prevent gradient explosion.w = exp(log π_prox − log π_behave)，并截断到 5.0 以避免梯度爆炸。

AReaL Training LoopAReaL 训练循环 (trains/grpo.py)

# trains/grpo.py — main training loop
for global_step in range(start_step, max_steps):
    batch = rollout.prepare_batch(train_dataloader, workflow=workflow)
    batch["prox_logp"] = actor.compute_logp(batch)
    actor.compute_advantages(batch)
    actor.ppo_update(batch)

    rollout.pause()
    actor.update_weights(weight_update_meta)
    rollout.set_version(global_step + 1)
    rollout.resume()

rollout.prepare_batch(...)

What this step does

Dequeue a batch of already-generated, already-scored samples from the async ring buffer. This call does not trigger any LLM inference — it is a pure memory read.

In this configuration: batch_size = 32 prompts × n_samples = 16 rollouts per prompt = 512 trajectories per gradient step.

What rollout workers do concurrently

While the actor runs gradient steps, rollout workers run continuously on a dedicated GPU partition. Each worker loops:

Draw a prompt from the data loader
Concurrently launch 16 independent agenerate calls (gconfig.n_samples) via asyncio.gather — each samples autoregressively at temperature = 1.0 (raw logit distribution, no sharpening or flattening), producing 16 diverse responses for the same prompt. These 16 form one GRPO group.
Score each response with a task reward — one scalar r ∈ [0, 1] per response, not per token:

MULTIPLE_CHOICE
option letter match → 1 / 0

NUMBER / MATH
math equivalence check → 1 / 0

BBOX
IoU with ground-truth box

OCR_TEXT
character-level similarity
Push all 16 results into the shared ring buffer as a group

The per-token log-probs behav_logp[t] are recorded for free at step 2 — they are the exact logits already computed when sampling, so no extra forward pass is needed.

What each sample carries

prompt_idstokenized input (image tokens + text)

response_idsthe sampled output token sequence

rscalar reward from the workflow callable

behav_logpshape [seq_len] — log P of each response token under θ_behave at generation time; saved for IS correction in ppo_update

versioninteger tag: which actor checkpoint generated this sample

Staleness guard (η = 4)

Before returning, every sample’s version is compared against the current actor version. Samples more than 4 gradient steps old are discarded and replaced. Beyond that threshold the IS weight w would hit its 5.0 cap so frequently that the gradient signal degrades — the guard prevents this at the source.

这一步做什么

从异步环形缓冲区出队一批已生成、已打分的样本。这个调用不触发任何 LLM 推理——纯粹是内存读取操作。

当前配置下：batch_size = 32 条 prompt × n_samples = 16 次 rollout = 每个梯度步处理 512 条轨迹。

rollout worker 同期在做什么

actor 进行梯度步的同时，rollout worker 在独立 GPU 分区上持续运行。每个 worker 不断循环：

从数据加载器取一条 prompt
通过 asyncio.gather 并发发起 16 次独立的 agenerate 调用（gconfig.n_samples = 16），每次在 temperature = 1.0（直接用原始 logit 分布采样，不压缩也不拉平）下独立自回归采样，对同一条 prompt 生成 16 条各不相同的 response，构成一个 GRPO group。
对每条 response 用任务奖励函数打分——每条 response 得到一个标量 r ∈ [0, 1]，是整条回答的得分，不是 token 级别的分：

MULTIPLE_CHOICE
选项字母匹配 → 1 / 0

NUMBER / MATH
数学等价校验 → 1 / 0

BBOX
与 ground-truth 框的 IoU

OCR_TEXT
字符级相似度
将这 16 条结果作为一组推入共享环形缓冲区

每个 token 的 log-prob behav_logp[t] 在第 2 步采样时顺手记录——它们就是采样时已算出的 logit，无需额外前向传播。

每条样本包含的字段

prompt_ids分词后的输入（图像 token + 文本）

response_ids采样得到的输出 token 序列

rworkflow 可调用对象给出的标量奖励

behav_logp形状 [seq_len]——生成时 θ_behave 对每个 response token 的 log-prob；留存用于 ppo_update 中的重要性加权校正

version整数标签：记录生成该样本时使用的 actor 检查点版本

时滞守卫（η = 4）

出队前，每条样本的 version 与当前 actor 版本比对。超过 4 个梯度步的样本会被丢弃并补充新样本。超过该阈值后 IS 权重 w 会频繁触及 5.0 上限，梯度信号质量显著下降——守卫从根源上防止了这种情况。

Why can samples be up to 4 steps stale?为什么样本可以落后当前策略最多 4 步？

The key distinction

Rollout worker weights are synced every step — but the samples already sitting in the ring buffer were generated by an older version of those weights. The sync only affects what the worker generates from now on.

Timeline

# actor version advances each step
step 1: actor=θ₁ → sync → rollout worker now uses θ_behave⁽¹⁾
step 2: actor=θ₂ → sync → rollout worker now uses θ_behave⁽²⁾
step 3: actor=θ₃ → sync → rollout worker now uses θ_behave⁽³⁾
# but the ring buffer still holds samples generated at step 1
step 5: prepare_batch() pulls a sample from θ_behave⁽¹⁾ → version gap = 5−1=4 ≤ η → accepted

Why does the gap accumulate?

During a single gradient step, the actor finishes its ppo_update in roughly 1.5–3 s. Meanwhile rollout workers are continuously pushing new samples and old samples from earlier versions are still queued. By the time prepare_batch() drains the buffer, some of those samples may have been sitting there for up to max_head_offpolicyness = 4 gradient steps. Samples older than that are discarded.

This is why π_prox exists

At the moment compute_logp runs, the actor is at version N. A sample in the batch might have been generated at version N−k (k ≤ 4). The gap θ_behave→θ can span multiple gradient steps and is unknown in magnitude. PPO clip cannot handle this directly — so we snapshot π_prox first, then split the correction into two legs: w for the cross-step gap, ratio for the within-step update.

关键区别

rollout worker 的权重确实每步都在同步——但环形缓冲区里已经存着的样本是用更旧版本的权重生成的。同步只影响 worker 从此刻起生成的新样本。

时间线

# actor 版本每步递增
step 1: actor=θ₁ → 同步 → rollout worker 开始使用 θ_behave⁽¹⁾
step 2: actor=θ₂ → 同步 → rollout worker 开始使用 θ_behave⁽²⁾
step 3: actor=θ₃ → 同步 → rollout worker 开始使用 θ_behave⁽³⁾
# 但缓冲区里还存着 step 1 时生成的样本
step 5: prepare_batch() 取到 θ_behave⁽¹⁾ 生成的样本 → 版本差 = 5−1=4 ≤ η → 接受

为什么会积累版本差？

一个梯度步里，actor 完成 ppo_update 大约需要 1.5–3 s。与此同时 rollout worker 在持续推入新样本，旧版本生成的样本也还在队列里等待。等 prepare_batch() 消费缓冲区时，有些样本可能已经排了最多 max_head_offpolicyness = 4 个梯度步。超过这个阈值的样本会被丢弃。

这就是 π_prox 存在的原因

compute_logp 运行时，actor 处于版本 N，但 batch 里的样本可能是版本 N−k（k ≤ 4）时生成的。θ_behave→θ 的偏移横跨多个梯度步，大小不定，PPO clip 无法直接处理。因此先快照 π_prox，再将校正拆成两段：w 处理跨步偏移，ratio 处理步内更新。

actor.compute_logp(batch) → batch["prox_logp"]

What this step does

One teacher-forced forward pass over the already-generated response, using the current actor weights θ. The result is a frozen log-prob tensor batch["prox_logp"] of shape [512, seq_len] (32 prompts × 16 samples), representing π_prox — the proximal policy snapshot at the start of this gradient step.

How teacher forcing works

The actor receives (prompt_ids, response_ids) from the buffer and constructs:

input = [prompt_ids | response_ids[:-1]] # response right-shifted by 1

The full concatenated sequence is fed into the LLM in a single forward pass. Causal masking makes position t’s output depend only on positions 0…t−1 — mathematically identical to step-by-step autoregressive decoding, but with O(seq_len) parallelism. The log-probs of the actual response tokens are read from the output logits and written to batch["prox_logp"].

How many weight copies exist?

Exactly two physical copies:

θ_behavethe behavior policy — rollout worker weights on the rollout GPU partition. “Behave” means it is the policy that acts in the environment: its generated token sequences determine what training data we collect. It never receives gradients — it only “performs” for the reward function. Replaced only during the weight-sync window.

θactor parameter tensors — on the actor GPU partition, updated by every optimizer.step()

π_prox is not a third weight copy. It is the [batch, seq_len] log-prob tensor from one forward pass of θ, immediately .detach()-ed. Zero extra GPU memory beyond the tensor itself.

Why not use π_behave directly?

Standard synchronous PPO has two policies and its clip window ε = 0.2 is calibrated for one step of policy change. In AReaL, a sample can be 1–4 steps old (max_head_offpolicyness = 4), so the total shift θ→θ_behave is variable and can be much larger than one step. Clipping this full shift with a fixed ε would make the window either too tight (cuts valid gradients) or too loose (no trust-region protection).

The fix: decompose the shift into two legs

The total drift π_behave → π_θ passes through an intermediate anchor π_prox. Each leg has a distinct character and gets a different corrective tool:

                              πbehave
                               ────────────────── 
                              πprox
                               ───────── 
                              πθ
                            

                                  
                              ←———— Leg 1: cross-step ————→
                               
                              ←—— Leg 2 ——→
                            
   variable size (0–4 global_steps)         exactly one gradient step
   handled by IS weight w  (hard cap 5.0)     handled by PPO clip [0.8, 1.28]

Leg 1 — cross-step drift

π_behave → π_prox spans 1–4 gradient steps. Its size is unpredictable — applying a fixed ε clip here would either kill valid gradients (window too tight) or remove all trust-region protection (window too wide).

Fix: IS weight w = clip(π_prox/π_behave, 0, 5.0) — soft re-weighting with a hard explosion guard.

Leg 2 — within-step drift

π_prox → π_θ is exactly one gradient update. Size is predictable and bounded — this is precisely what PPO clip was designed for.

Fix: asymmetric clip [0.8, 1.28] — conservative on contraction (ε = 0.2), more room for improvement (ε′ = 0.28).

w = clip(exp(logπ_prox − logπ_behave), 0, 5.0)  # Leg 1 — behav_imp_weight_cap
ratio = exp(logπ_θ − logπ_prox)                    # Leg 2 — within-step ratio
L = w · max(−A·ratio, −A·clip(ratio, 0.8, 1.28))  # combined

这一步做什么

对已生成的 response 做一次 teacher forcing 前向传播，使用当前 actor 权重 θ。结果是形状为 [512, seq_len]（32 条 prompt × 16 个样本）的冻结 log-prob 张量 batch["prox_logp"]，代表本次梯度步开始时的近端策略 π_prox。

Teacher forcing 的具体过程

actor 从缓冲区取到 (prompt_ids, response_ids)，构造输入：

input = [prompt_ids | response_ids[:-1]] # response 右移一位

整段拼接序列一次性喂入 LLM。Causal mask 保证位置 t 的输出只关注位置 0…t−1，与逐步自回归解码在数学上完全等价——但并行度是 O(seq_len) 倍。从输出 logit 读出实际 response token 的 log-prob，写入 batch["prox_logp"]。

物理上到底维护了几份权重？

只有两份：

θ_behave即 behavior policy（行为策略）——rollout worker 的权重缓冲区，存于 rollout GPU 分区。“behave” 的含义是：这是负责在环境中行动的策略，它生成的 token 序列决定了我们收集到什么样的训练数据。它不参与梯度更新，只负责“表演”给奖励函数看。只在权重同步窗口被替换。

θactor 参数张量——存于 actor GPU 分区，每次 optimizer.step() 后原地更新

π_prox 不是第三份权重。它是本 global step 开始时用 θ 做一次前向传播得到的 [batch, seq_len] log-prob 张量，随即 .detach() 冻结。除张量本身外不占用额外显存。

为什么不直接用 π_behave 算 ratio？

标准同步 PPO 只有两个策略，clip 的 ε = 0.2 是针对单步策略变化标定的。在 AReaL 中，一条样本可能是 1–4 步之前生成的（max_head_offpolicyness = 4），θ→θ_behave 的总偏移可变且可能远大于一步。直接对完整偏移用固定 ε clip，窗口要么太窄（截掉有效梯度），要么太宽（失去信任域保护）。

解决方案：将偏移分两段

总偏移 π_behave → π_θ 经过中间锤点 π_prox。两段性质完全不同，分别用不同的工具处理：

                              πbehave
                               ────────────────── 
                              πprox
                               ───────── 
                              πθ
                            

                                 
                              ←———— 第一段：跨步偏移 ————→
                               
                              ←—— 第二段 ——→
                            
   1–4 个 global_step，大小可变                  本次更新内一个梯度步
   IS 权重 w 处理  (上限 5.0)                 PPO clip 处理  [0.8, 1.28]

第一段 —— 跨步偏移

π_behave → π_prox 经历 1–4 个梯度步，大小不可预测。用固定 ε clip 要么截掉有效梯度，要么失去信任域。

解法：IS 权重 w = clip(π_prox/π_behave, 0, 5.0)——软校正 + 硬截断防梯度爆炸。

第二段 —— 步内偏移

π_prox → π_θ 恰好是一个梯度更新，大小可预测。这正是 PPO clip 被设计来处理的场景。

解法：非对称 clip [0.8, 1.28]——下侧收缩保守 (ε = 0.2)，上侧改进空间更大 (ε′ = 0.28)。

w = clip(exp(logπ_prox − logπ_behave), 0, 5.0)  # 第一段 — behav_imp_weight_cap
ratio = exp(logπ_θ − logπ_prox)                    # 第二段 — 步内 ratio
L = w · max(−A·ratio, −A·clip(ratio, 0.8, 1.28))  # 合并

actor.compute_advantages(batch)

What this step does

Compute GRPO group-normalized advantages — the scalar signal that tells each token whether it was part of a “good” or “bad” response relative to its siblings.

Group whitening formula — step-by-step

For each prompt, G = 16 responses were sampled. compute_advantages groups them and whitens rewards within the group. Here's a concrete example:

Prompt: "What is 12 × 8?"

Resp	Answer	Raw reward r_i
1	96	1.00
2	96	1.00
3	96	1.00
4	95	0.50
5	98	0.00
mean = 0.70, std = 0.41, ε = 1e-6

A_i = (r_i − 0.70) / (0.41 + 1e-6)

✓ Correct (r = 1.00)

A = (1.00 − 0.70) / 0.41 = +0.73
Better than group average → positive signal

✗ Wrong (r = 0.00)

A = (0.00 − 0.70) / 0.41 = −1.71
Worse than group average → negative signal

Broadcast to tokens: All tokens in response #1 get A = +0.73. All tokens in response #5 get A = −1.71. This is GRPO's key approximation — the episode-level reward proxies per-token contribution.

Note on scale: Without std normalization, a binary task (0/1) would have A ∈ [−0.5, +0.5] while a math task (0–1 continuous) might have A ∈ [−2, +2]. Division by std equalizes gradient magnitudes across task types.

Why group normalization instead of a critic?

PPO needs a baseline V(s) to form the advantage A = Q(s,a) − V(s). The classic solution is a critic network — but this breaks down for LLMs/VLMs on three fronts:

① Memory cost

Critic must match actor scale (8B params). Running both doubles VRAM and optimizer state.

② Underfitting on VLM inputs

The critic must predict V from image + question + partial response. Early in training its estimates are pure noise — adding variance instead of reducing it.

③ Multi-task scale mismatch

Binary format rewards (0/1) and continuous math scores have different variance. A single critic cannot fit all reward scales simultaneously.

GRPO’s fix — Monte Carlo group baseline:

V^π(s) = E_a∼π[R(s,a)] ≈ mean(r₁…r_G) // G=16 siblings drawn from same prompt

This is an unbiased estimator of V^π(s) by definition — no extra parameters, no warmup, self-consistent from step 1.

Standard-deviation normalization then equalizes gradient magnitude across tasks:

A_i = (r_i − mean) / (std + ε) // gradient magnitude normalized across all task types

Critic (PPO)

Extra 8B network • biased early • multi-task scale problem • needs warmup

Group baseline (GRPO) ✓

Zero extra params • unbiased MC estimate • per-task std normalization • works from step 1

Edge case: all responses tie

If all G responses receive identical rewards (std ≈ 0), all advantages collapse to zero and the gradient vanishes for that prompt — correct behavior, since there is no learning signal when all outputs are equally (un)rewarded. The small ε in the denominator prevents division-by-zero.

这一步做什么

计算 GRPO 组内归一化优势——告诉每个 token 它所属的回答相对同组兄弟回答是“好”还是“差”的标量信号。

组内白化公式 —— 按一个 prompt 具体算

对每个 prompt，rollout 会采样 G = 16 条回答。compute_advantages 做的事情不是逐 token 打分，而是先把这 16 条回答重新放回同一组，再在组内对标量奖励做白化。下面按一个具体例子来看：

例子：prompt = “What is 12 × 8?”

实际代码里这一组本来有 16 条回答，这里只展开其中 5 条来演示公式怎么落到每个样本上。

样本 i	回答内容	奖励 r_i
1	96	1.00
2	96	1.00
3	96	1.00
4	95	0.50
5	98	0.00
组内均值 mean = 0.70，组内标准差 std = 0.41，ε ≈ 1e-6

A_i = (r_i − 0.70) / (0.41 + 1e-6)

样本 1：正确回答（r = 1.00）

A₁ = (1.00 − 0.70) / 0.41 = +0.73
比组内平均更好，所以是正优势

样本 5：错误回答（r = 0.00）

A₅ = (0.00 − 0.70) / 0.41 = −1.71
比组内平均更差，所以是负优势

广播到 token 是什么意思？
如果样本 1 的回答是 ["The", "answer", "is", "96"]，那么这 4 个 token 在 loss 里都会拿到同一个 A = +0.73。
如果样本 5 的回答是 ["It", "might", "be", "98"]，那么这 4 个 token 都会拿到同一个 A = -1.71。

所以，GRPO 不是在判断“哪个 token 对，哪个 token 错”，而是在说：“这整条回答相对同组兄弟回答更好还是更差”。然后把这个回合级信号平均分配给该回答中的每个 token。

为什么还要除以 std？ 因为不同任务的奖励尺度不一样。比如二值题的 reward 可能只有 0/1，而数学题、IoU、OCR 匹配分可能是连续值。除以组内标准差之后，不同任务的 advantage 大小会被拉到更接近的量级，训练更稳定。

为什么用组归一化而非价值网络？

PPO 需要一个基线 V(s) 来计算优势 A = Q(s,a) − V(s)。经典解法是引入 critic（评论家网络）——但在 LLM/VLM 上会在三个方面失效：

① 显存开销

Critic 规模必须与 actor 相当（8B 参数）。同时跑两个模型，显存与优化器状态直接翻倍。

② VLM 输入下欠拟合

Critic 需从图片 + 问题 + 已生成部分预测 V。训练早期估值几乎是噪声，增大了方差而非降低。

③ 多任务奖励尺度不一

二值格式奖励（0/1）与连续数学分数方差完全不同。单一 critic 无法同时拟合所有任务的奖励尺度。

GRPO 的解法——蒙特卡洛组基线：

V^π(s) = E_a∼π[R(s,a)] ≈ mean(r₁…r_G) // G=16 个兄弟回答来自同一 prompt

根据定义，这是 V^π(s) 的无偏估计——无需额外参数，无需预热，从第一步起就自洽。

标准差归一化再进一步，使各任务的梯度量级统一：

A_i = (r_i − mean) / (std + ε) // 所有任务类型的梯度量级归一化

Critic（PPO）

额外 8B 网络 • 早期估值有偏 • 多任务尺度问题 • 需要 warmup

组基线（GRPO）✓

零额外参数 • 无偏 MC 估计 • 逐任务标准差归一化 • 第一步即可用

边界情况：G 条回答奖励相同

若 G 条回答获得相同奖励（std ≈ 0），所有优势値趋近于零，该 prompt 的梯度消失——这是正确行为，因为当所有输出同等好坏时不存在学习信号。分母中的小量 ε 防止除零。

actor.ppo_update(batch)

What this step does

Compute the full AReaL-GRPO loss from two multiplicative components, then backpropagate and update actor weights.

Component 1 — PPO clipped surrogate (trust-region term)

ratio = exp(logπ_θ − logπ_prox)
L_PPO = −min(ratio · A, clip(ratio, 1−ε, 1+ε′) · A)

Asymmetric clipping (ε = 0.2, ε′ = 0.28) lets probability mass rise faster than it falls — accelerating reward-seeking while capping runaway updates. The min is a pessimistic bound:

A > 0 (good token): policy is rewarded for raising ratio, capped at 1+ε′
A < 0 (bad token): policy is penalized once ratio exceeds 1−ε

Component 2 — off-policy IS weight (AReaL correction)

The total drift from sample generation to the current update is π_behave → π_θ. AReaL splits this into two segments with different properties and handles each separately:

                            πbehave
                             ──────────────── 
                            πprox
                             ──────── 
                            πθ

                                    ↑ Segment 1: cross-step drift ↑       ↑ Segment 2 ↑

                                      1–4 global_steps, variable size      one gradient step

                                      handled by w (IS weight, cap 5.0)     handled by PPO clip [0.8, 1.28]
                          

Segment 1 — cross-step drift

Size is unpredictable (0–4 steps). PPO’s fixed ε can’t cover it: too tight clips valid gradients, too loose loses trust-region. Handled by the IS weight w: soft correction with a hard cap at 5.0 to prevent gradient explosion on stale samples.

Segment 2 — within-step drift

Exactly one gradient step wide — predictable, calibrated. PPO clip [0.8, 1.28] applies here with full trust-region guarantees restored.

w = clip(exp(logπ_prox − logπ_behave), 0, 5.0) # Segment 1 correction
L = w · L_PPO # Segment 2 handled inside L_PPO via clip

No KL penalty (kl_ctl=0)

Standard RLHF adds a KL term to keep the policy near a reference model. AReaL omits it: PPO clip already provides a per-step trust region, and the staleness guard (η = 4) limits global drift. Adding KL on top would double-penalize deviation and slow convergence on vision tasks where the base model is already well-calibrated.

Backward & optimizer

Loss is averaged over all non-padding token positions and over the batch. loss.backward() computes per-parameter gradients; optimizer.step() (AdamW, lr = 2e-6) updates actor weights in-place.

这一步做什么

从两个相乘分量计算完整的 AReaL-GRPO 损失，然后反向传播并更新 actor 权重。

分量 1 —— PPO 裁剪代理损失（信任域项）

ratio = exp(logπ_θ − logπ_prox)
L_PPO = −min(ratio · A, clip(ratio, 1−ε, 1+ε′) · A)

非对称裁剪（ε = 0.2，ε′ = 0.28）允许概率质量增长比下降更快——加速奖励追求的同时遗制策略失控更新。min 是悲观下界：

A > 0（好 token）：策略因提高 ratio 而获益，上限为 1+ε′
A < 0（坏 token）：ratio 超过 1−ε 后受到惩罚

分量 2 —— 离策重要性权重（AReaL 校正项）

从样本生成到当前更新，策略经历的总漂移是 π_behave → π_θ。AReaL 将其拆成两段性质不同的区间，分别处理：

                            πbehave
                             ──────────────── 
                            πprox
                             ──────── 
                            πθ

                                    ↑ 第一段：跨步偏移 ↑            ↑ 第二段 ↑

                                      1–4 个 global_step，大小不固定      一个梯度步

                                      由 w（IS 权重，上限 5.0）处理       由 PPO clip [0.8, 1.28] 处理
                          

第一段 —— 跨步偏移

大小不可预测（0–4 步）。PPO 的固定 ε 无法覆盖：窗口太窄截掉有效梯度，太宽失去信任域保护。由 IS 权重 w 处理：软校正 + 硬截断 5.0，防止过期样本梯度爆炸。

第二段 —— 步内偏移

恰好是一个梯度步——大小可预测，经过标定。PPO clip [0.8, 1.28] 在此段正常生效，信任域保证完全恢复。

w = clip(exp(logπ_prox − logπ_behave), 0, 5.0) # 第一段校正
L = w · L_PPO # 第二段由 L_PPO 内部 clip 处理

无 KL 惩罚（kl_ctl=0）

标准 RLHF 添加 KL 项以约束策略不偏离参考模型。AReaL 省略它：PPO 裁剪已提供逐步信任域，时滞守卫（η = 4）在全局限制了策略漂移。叠加 KL 惩罚会重复惩罚偏差，在基础模型已充分校准的视觉语言任务上减慢收敛。

反向传播与优化器

损失在所有非填充 token 位置及 batch 上取平均。loss.backward() 计算各参数梯度；optimizer.step()（AdamW，lr = 2e-6）就地更新 actor 权重。

rollout.pause() / actor.update_weights(...) / rollout.set_version(...) / rollout.resume()

What this step does

The weight synchronization window — the only moment actor and rollout workers are not fully parallel. Happens every single global_step, unconditionally. Four ordered sub-steps transfer the freshly-updated actor weights to the rollout GPU partition.

The four sub-steps

1rollout.pause() — calls self.paused.set() on a threading.Event. The rollout thread stops accepting new requests. Any in-flight generation that hits the paused check enters await asyncio.sleep(0.5) loop until the flag is cleared. pause_grace_period defaults to 0.0 s in this config.

2actor.update_weights(weight_update_meta) — transfers full parameter tensors (not deltas) to the rollout partition. Cross-node: NCCL/XCCL distributed. Single-node: disk-based load (fallback). weight_update_meta carries per-shard routing info (tensor-parallel rank → target device IDs).

3rollout.set_version(global_step + 1) — calls self._version = version under a lock, after the copy completes. All samples generated henceforth are tagged with this new version. The staleness guard in prepare_batch rejects samples where actor_version − sample_version > 4 (max_head_offpolicyness).

4rollout.resume() — calls self.paused.clear(). Workers exit the sleep loop and immediately resume generating with the new weights.

Why is this window so short?

NVLink / NVSwitch transfers an 8B model in roughly 30–50 ms. A typical gradient step on 4×A100s takes 1.5–3 s — the sync pause is <2% of wall-clock step time. AReaL’s 2.77× throughput gain comes precisely from keeping this window tiny relative to rollout generation time — which in a synchronous setup would block the actor entirely.

这一步做什么

权重同步窗口——actor 与 rollout worker 唯一不完全并行的时刻。每个 global_step 无条件执行一次。四个有序子步将最新更新的 actor 权重传输到 rollout GPU 分区。

四个子步骤

1rollout.pause() —— 调用 self.paused.set()，设置一个 threading.Event。rollout 线程停止接受新请求。正在生成中的序列遇到暂停检查后进入 await asyncio.sleep(0.5) 循环等待。此配置中 pause_grace_period 默认为 0.0 s。

2actor.update_weights(weight_update_meta) —— 传输完整参数张量（非增量）到 rollout 分区。跨节点走 NCCL/XCCL 分布式传输，单节点走磁盘加载（fallback）。weight_update_meta 带有每个分片的路由信息（张量并行 rank → 目标设备 ID）。

3rollout.set_version(global_step + 1) —— 在复制完成后加锁执行 self._version = version。此后生成的样本均打上新版本号。prepare_batch 中的时滞检查拒绝满足 actor_version − sample_version > 4 的样本（max_head_offpolicyness）。

4rollout.resume() —— 调用 self.paused.clear()，worker 退出 sleep 循环，立即以新权重继续生成。

为什么这个窗口很短？

NVLink / NVSwitch 传输 8B 模型权重约需 30–50 ms。在 4×A100 上一个典型梯度步耗时 1.5–3 s——同步暂停占墙钟步时间不到 2%。AReaL 2.77× 的吸吐量提升，正是来自将这一窗口相对于 rollout 生成时间压缩到极小——而同步方案中 actor 本需等待整个 rollout 完成。

Training Configuration训练配置

eps_clip

0.2 / 0.28 (asymmetric)

kl_ctl

0.0 (disabled)

reward_scaling

10.0

reward_bias

−0.5

group_size

16

max_new_tokens

4096

temperature

1.0

learning_rate

2e-6

epochs

30

batch_size

32

offpolicyness (η)

4

behav_weight_cap

5.0

dtype

bfloat16

allocation

d4p1t1 + d4p1t1

View full config on GitHub →在 GitHub 查看完整配置 →

Acknowledgements致谢

We thank the following projects and frameworks:

我们感谢以下项目与框架：

AReaL: Lightning-Fast RL for LLM Reasoning and Agents面向大模型推理与智能体的高吞吐强化学习框架
sglang: Fast serving framework for LLMs and vision language models面向大语言模型与视觉语言模型的高性能服务框架
lmms-eval: Standardized evaluation framework标准化评测框架
LLaVA: Large Language-and-Vision Assistant大型语言-视觉助手
LLaVA-NeXT: Next-generation multi-modal assistant新一代多模态助手

Open-Source Resources开源资源

Complete LLaVA-OneVision-1.5-RL resources for the community.面向社区开放的完整 LLaVA-OneVision-1.5-RL 资源。

⌘

Code & Paper代码与论文

Research Paper研究论文arXiv

Read the full technical paper on arXiv在 arXiv 阅读完整技术论文

Training Code训练代码GitHub

RL training code and reproduction scriptsRL 训练代码与复现实验脚本

◈

Model Checkpoints模型权重

LLAVA-OV-1.5-8B-RL8B

Pre-trained model with RL optimization经过 RL 优化的预训练模型

▒

Training Datasets & Base Model训练数据集与基础模型

LLaVA-OneVision-1.5-RL-Data67K samples

Curated RL training data with discrepancy-driven selection采用差异驱动筛选策略构建的 RL 训练数据

LLaVA-OneVision-1.5Base

LLaVA-OneVision-1.5 foundation modelLLaVA-OneVision-1.5 基础模型

Citation引用

@inproceedings{LLaVA-OneVision-1.5,
  title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
  author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
  booktitle={arXiv},
  year={2025}
}

@inproceedings{xie2025region,
  title={Region-based Cluster Discrimination for Visual Representation Learning},
  author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang},
  booktitle={ICCV},
  year={2025}
}

@article{lillava,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  journal={Transactions on Machine Learning Research},
  year={2024}
}

Model Configuration		OCR & Document Understanding								General Vision Understanding
Method	Vision Tower	InfoVQA	DocVQA	ChartQA	TextVQA	OCRBench	OCRBenchV2	LiveXivVQA	OCR Avg	AI2D	MMBEN	MMECog	MMEPer	POPE	RealworldQA	MMStar	Other Avg
CLIP	ViT-L-14-336px	38.9	75.2	66.5	62.5	52.5	23.0	47.4	52.3	73.2	74.6	48.0	75.6	88.8	63.7	49.0	67.6
MLCD	ViT-L-14-336px	43.5	76.5	67.8	61.7	53.1	24.0	48.4	53.6	77.0	76.4	54.1	79.9	88.7	61.1	51.0	69.7
AIMv2	ViT-L-14-336px	35.4	77.2	72.7	65.9	57.2	23.9	47.3	54.2	75.4	78.6	48.3	75.0	88.4	62.2	50.2	68.3
RICE-ViT	ViT-L-14-336px	45.2	79.2	72.3	65.9	57.5	24.1	48.9	56.2	77.9	76.6	54.6	80.7	88.5	63.1	51.8	70.5

DFN5B	ViT-H-14-378px	38.6	70.9	64.4	59.4	47.3	21.9	46.2	49.8	73.5	73.4	45.8	76.9	88.6	59.9	49.1	66.7
SigLIP	ViT-SO400M-14-384px	41.4	76.7	69.3	64.7	55.4	24.0	48.4	54.3	76.2	77.0	46.1	79.9	88.8	63.7	47.3	68.4
SigLIPv2	ViT-SO400M-14-384px	43.7	79.1	70.2	66.2	58.7	25.4	48.6	56.0	77.0	77.1	46.6	80.4	89.3	63.4	52.8	69.5
RICE-ViT	ViT-L-14-378px	48.1	82.6	75.1	66.2	58.8	25.8	49.5	58.0	76.5	77.6	54.1	79.0	89.1	62.9	51.2	70.1

SigLIPv2	ViT-SO400M-16-560px	50.2	86.2	77.4	70.2	62.7	26.5	52.9	60.9	77.0	76.5	53.5	79.9	89.3	68.2	53.1	71.1
RICE-ViT	ViT-L-14-560px	53.2	87.4	78.1	69.0	60.7	26.1	53.0	61.1	76.9	78.6	56.3	79.3	88.9	65.1	50.5	70.8
Qwen-ViT from Qwen2.5-VL 7B	ViT-H-14-560px	55.9	85.8	78.8	73.7	66.2	26.8	53.4	62.9	78.8	78.4	62.0	80.8	88.6	64.2	55.0	72.5
RICE-ViT from OV-1.5 3B	ViT-L-14-560px	53.7	87.1	81.9	73.8	73.3	30.4	53.6	64.8	80.3	79.6	58.6	82.2	89.0	67.3	56.6	73.4

The LLaVA JourneyLLaVA 之旅

The Reproducibility Gap可复现性差距

What LLaVA-OneVision-1.5 DeliversLLaVA-OneVision-1.5 的贡献

Pretraining Dataset (85M) and Concept Balancing预训练数据集 (85M) 与概念均衡

Balanced vs. Random Sampling均衡采样 vs. 随机采样

Instruction Dataset (22M)指令数据集 (22M)

Visual Encoder Pretraining视觉编码器预训练

Three-Stage Pipeline三阶段训练流程

Language–image alignment语言-图像对齐

Mid-stage pretraining中期预训练

Visual instruction alignment视觉指令对齐

Offline Packing离线数据打包

From uneven raw samples to dense fixed-length packed sequences从不均匀的原始样本到密集的固定长度打包序列

Hybrid Parallelism混合并行训练

From zero to SOTA in ~168 hours从零到 SOTA 仅需约 168 小时

Open-Source Resources开源资源

Code & Demos代码与演示

Model Checkpoints模型权重

Training Datasets训练数据集

Code Demos代码示例

Citation引用

Overview概览

RL Training Data Composition强化学习训练数据构成

RL Data Strategy强化学习数据策略

Discrepancy-Driven Selection差异驱动筛选

Reward-Based Sampling基于奖励的采样

Reward System Architecture奖励系统架构

Two-Stage Training Procedure两阶段训练流程

Answer-only RL仅答案 RL

Chain-of-Thought RL ✨链式思维 RL ✨

Shared GRPO Configuration共享 GRPO 配置

Extended Capability Analysis扩展能力分析

Performance Results性能结果

GRPO Algorithm & AReaL Async FrameworkGRPO 算法与 AReaL 异步框架

GRPO

AReaL Async TrainingAReaL 异步训练

Training Configuration训练配置

Acknowledgements致谢

Open-Source Resources开源资源

Code & Paper代码与论文

Model Checkpoints模型权重

Training Datasets & Base Model训练数据集与基础模型

Citation引用