Overview概览
LLaVA-OneVision-1.5-RL presents an RL post-training stage utilizing 67K curated examples with discrepancy-based selection to generate explicit chain-of-thought reasoning, achieving significant performance gains on STEM, coding, and reasoning benchmarks while maintaining visual understanding capabilities.
LLaVA-OneVision-1.5-RL 提出了一套强化学习后训练阶段,利用 6.7 万条经精筛的数据,并采用差异驱动的数据选择策略来激发显式链式思维推理,在保持视觉理解能力的同时,显著提升了模型在 STEM、代码生成与复杂推理基准上的表现。
Our contributions are threefold:
我们的贡献主要体现在三个方面:
Discrepancy-Driven Data Curation. We identify tasks where model performance gap exists between Pass@N and Pass@1 metrics, targeting “latent capability” rather than knowledge injection.
差异驱动的数据构建。 我们聚焦于 Pass@N 与 Pass@1 之间存在明显性能差距的任务,挖掘模型已有但尚未稳定释放的“潜在能力”,而不是简单注入新知识。
Rule-Based Reward System. We employ domain-specific verification rules rather than learned preference models, enabling precise feedback across STEM, grounding, spatial reasoning, counting, coding, OCR, and diagram tasks.
基于规则的奖励系统。 我们不依赖学习式偏好模型,而是针对不同任务设计领域特定的验证规则,从而为 STEM、定位、空间推理、计数、代码、OCR 与图表理解任务提供精确反馈。
Two-Stage Curriculum Training. We design a training curriculum that first stabilizes concise task performance with answer-only RL, then unlocks deeper reasoning through chain-of-thought RL.
两阶段课程式训练。 我们先通过仅输出答案的 RL 稳定模型的简洁作答能力,再通过链式思维 RL 进一步释放更深层的推理能力。
RL Training Data Composition强化学习训练数据构成
Figure 3. Distribution of task categories in the RL training data. (a) Total RL corpus (67K instances). (b) Stage 1: Answer-only training. (c) Stage 2: Chain-of-thought training.图 3. 强化学习训练数据中的任务类别分布。(a) 全部 RL 语料(6.7 万条);(b) 阶段一:仅答案训练;(c) 阶段二:链式思维训练。
RL Data Strategy强化学习数据策略
Discrepancy-Driven Selection差异驱动筛选
If a model can solve a task given enough attempts (high Pass@N) but rarely gets it right on the first try (low Pass@1), it already has the latent capability — it just needs to learn to use it reliably. We select tasks with this gap for RL training, filtering out tasks that are too easy (high Pass@1, nothing to learn) or too hard (low Pass@N, beyond current capability).
如果模型在给出足够多次尝试时能够解出某个任务(高 Pass@N),但首次作答却很难做对(低 Pass@1),说明它已经具备相应的潜在能力,只是尚未学会稳定地调用这种能力。我们将这类存在差距的样本用于 RL 训练,同时过滤掉过于简单(高 Pass@1,几乎没有学习空间)或过于困难(低 Pass@N,超出当前能力边界)的任务。
Reward-Based Sampling基于奖励的采样
Multiple candidate responses are filtered by average reward scores to exclude trivial and unsolvable cases, focusing on medium-difficulty instances that provide optimal learning signals.
我们通过候选回答的平均奖励分数进行过滤,剔除过于简单或几乎不可解的样本,将训练重点放在能提供最优学习信号的中等难度实例上。
Reward System Architecture奖励系统架构
Our RL setup employs a rule-based reward paradigm, where rewards are derived directly from task outcomes rather than learned preference models. Since different answer types require distinct verification strategies, we design answer-type-specific scoring rules via the reward/ module.
我们的 RL 方案采用基于规则的奖励范式:奖励直接由任务结果计算,而不是依赖学习式偏好模型。由于不同答案类型需要不同的验证策略,我们通过 reward/ 模块设计了针对答案类型的专用评分规则。
| Category类别 | Source来源 | Reward Design Details奖励设计细节 |
|---|---|---|
| STEM | ViRL39K |
Choice accuracy & math expression equivalence选项准确率与数学表达式等价性
View reward logic查看奖励逻辑 |
| Grounding定位 | Ref-L4, VigoRL-SA |
IoU between predicted/ref boxes; choice accuracy预测框与参考框的 IoU;选项准确率
View reward logic查看奖励逻辑IoU = Intersection / (Area₁ + Area₂ − Intersection) |
| Spatial空间 | VigoRL-SAT |
Choice accuracy选项准确率
View reward logic查看奖励逻辑 |
| Counting计数 | PixmoCount |
Numeric token equivalence数值 token 等价匹配
View reward logic查看奖励逻辑 |
| Coding代码 | WebCode2M, UniSVG |
Token/tag overlap; SVG rendering similarity [0,1]token / 标签重叠度;SVG 渲染相似度 [0,1]
View reward logic — HTML查看奖励逻辑 — HTMLView reward logic — SVG查看奖励逻辑 — SVGHTML: 0.6 × TokenJaccard + 0.4 × TagJaccard | SVG: 0.5 × SSIM + 0.25 × (Token + Tag) |
| OCR | InfoVQA |
Text similarity文本相似度
View reward logic查看奖励逻辑Similarity = 1 − (Levenshtein / max(len₁, len₂)), clipped at 0.5 |
| Diagram图表 | AI2D | Choice accuracy选项准确率 |
<think> block, at least one \boxed{}, and boxed content ≤ 20% of total length.
(跨任务通用):要求恰好出现一个 <think> 块,至少包含一个 \boxed{},且被框出的内容长度不超过总长度的 20%。
Two-Stage Training Procedure两阶段训练流程
Interactive Training Pipeline: We utilize Group Relative Policy Optimization (GRPO) within the asynchronous AReaL framework. Expand the shared hyperparameters to view the global training configuration.
交互式训练流程:我们在异步 AReaL 框架中采用 Group Relative Policy Optimization(GRPO)。展开共享超参数区域即可查看全局训练配置。
Answer-only RL仅答案 RL
Stabilizes task performance with concise answers (19.9K samples, ./data/stage1-normal).
通过简洁答案输出稳定任务表现(1.99 万样本,./data/stage1-normal)。
Chain-of-Thought RL ✨链式思维 RL ✨
Unlocks deeper reasoning via explicit thinking prompts (49.2K samples, ./data/stage2-long).
通过显式思维提示释放更深层推理能力(4.92 万样本,./data/stage2-long)。
View Global Hyperparameters (Both Stages)查看全局超参数(两阶段共享)
Shared GRPO Configuration共享 GRPO 配置
Extended Capability Analysis扩展能力分析
Figure 4. Performance comparison of LLaVA-OV-1.5 and corresponding RL version on Spatial Reasoning & Grounding and Coding tasks.图 4. LLaVA-OV-1.5 与其 RL 版本在空间推理、定位与代码任务上的性能对比。
Spatial & Grounding: RL “fast mode” significantly enhances fine-grained perception on SAT and Ref-L4 benchmarks.
空间与定位:RL 的“fast mode”在 SAT 与 Ref-L4 基准上显著提升了细粒度感知能力。
Coding: “Thinking” mode achieves highest scores on Design2Code and UniSVG, demonstrating chain-of-thought benefits for structural code generation.
代码:“Thinking” 模式在 Design2Code 与 UniSVG 上取得了最高分,说明链式思维对结构化代码生成具有明显优势。
Performance Results性能结果
| Task任务 | Benchmark基准 | LLaVA-OV-1.5 | LLaVA-OV-1.5 RL | |
|---|---|---|---|---|
| 8B | 8B | |||
| - | thinkingthinking | fastfast | ||
| General VQA通用 VQA | MMStar | 67.7 | 68.2↑0.5 | 68.3↑0.6 |
| MMBenchen | 84.1 | 85.7↑1.6 | 85.7↑1.6 | |
| MMBenchcn | 81.0 | 84.2↑3.2 | 81.5↑0.5 | |
| MME-RealWorlden | 61.7 | 63.4↑1.7 | 63.3↑1.6 | |
| MME-RealWorldcn | 56.1 | 56.1↑0.0 | 56.3↑0.2 | |
| SeedBenchimage | 77.3 | 76.7 | 77.6↑0.3 | |
| CV-Bench | 80.7 | 82.9↑2.2 | 81.1↑0.4 | |
| SEED-Bench-2-Plus | 69.2 | 69.5↑0.3 | 69.2↑0.0 | |
| RealWorldQA | 68.1 | 68.4↑0.3 | 70.6↑2.5 | |
| Avg. | 71.8 | 72.8↑1.0 | 72.6↑0.8 | |
| Reasoning推理 | MathVistamini | 69.6 | 72.3↑2.7 | 71.8↑2.2 |
| WeMath | 61.5 | 69.4↑7.9 | 60.8 | |
| MathVision | 25.6 | 34.4↑8.8 | 26.2↑0.6 | |
| MMMUval | 55.4 | 58.8↑3.4 | 54.9 | |
| MMMU-Prostandard | 37.4 | 39.9↑2.5 | 38.0↑0.6 | |
| MMMU-Provision | 25.2 | 35.7↑10.5 | 29.0↑3.8 | |
| Avg. | 45.8 | 51.8↑6.0 | 46.8↑1.0 | |
| OCR & ChartOCR 与图表 | ChartQA | 86.5 | 87.4↑0.9 | 87.0↑0.5 |
| CharXivDQ | 70.9 | 68.4 | 71.2↑0.3 | |
| DocVQA | 95.0 | 91.9 | 95.0↑0.0 | |
| OCRBench | 82.9 | 81.7 | 82.3 | |
| AI2Dw M | 84.2 | 83.7 | 84.3↑0.1 | |
| AI2Dw/o M | 94.1 | 93.7 | 93.9 | |
| InfoVQA | 78.4 | 76.6 | 78.7↑0.3 | |
| Avg. | 84.6 | 83.3 | 84.6↑0.0 | |
| Others其他 | PixmoCount | 62.2 | 65.7↑3.5 | 71.1↑8.9 |
| CountBench | 88.2 | 86.8 | 88.6↑0.4 | |
| VL-RewardBench | 47.7 | 44.0 | 49.7↑2.0 | |
| V* | 78.0 | 79.1↑1.1 | 78.0↑0.0 | |
| Avg. | 69.0 | 66.0 | 71.6↑2.6 | |
GRPO Algorithm & AReaL Async FrameworkGRPO 算法与 AReaL 异步框架
GRPO
GRPO eliminates the critic by sampling G = 16 completions per prompt and using group-normalized rewards as the baseline.
GRPO 通过为每个提示采样 G = 16 个回答,并使用组内归一化奖励作为基线,从而移除了 critic。
Show GRPO Loss Implementation查看 GRPO Loss 实现 (utils/functional.py)
# utils/functional.py — ppo_actor_loss_fn
ratio = torch.exp(logprobs - proximal_logprobs)
clipped_ratio = torch.clamp(ratio, 1.0 - eps_clip, 1.0 + eps_clip_higher)
pg_loss = torch.max(-advantages * ratio, -advantages * clipped_ratio)
# Behavior importance weight (off-policy correction)
behav_kl = proximal_logprobs - old_logprobs
behav_imp_weight = torch.clamp(behav_kl.exp(), max=behav_imp_weight_cap)
pg_loss = pg_loss * behav_imp_weightAReaL Async TrainingAReaL 异步训练
AReaL decouples rollout from gradient computation — 2.77× throughput by eliminating GPU idle time.
AReaL 将 rollout 与梯度计算解耦,通过消除 GPU 空闲时间实现了 2.77× 吞吐提升。
max_head_offpolicyness η = 4. Rollout samples are restricted to ≤4 gradient steps behind the current policy.rollout 样本最多只允许落后当前策略 4 个梯度步。# trains/grpo.py — main training loop
for global_step in range(start_step, max_steps):
batch = rollout.prepare_batch(train_dataloader, workflow=workflow)
batch["prox_logp"] = actor.compute_logp(batch)
actor.compute_advantages(batch)
actor.ppo_update(batch)
rollout.pause()
actor.update_weights(weight_update_meta)
rollout.set_version(global_step + 1)
rollout.resume()rollout.prepare_batch(...)In this configuration: batch_size = 32 prompts × n_samples = 16 rollouts per prompt = 512 trajectories per gradient step.
- Draw a prompt from the data loader
- Concurrently launch 16 independent
ageneratecalls (gconfig.n_samples) viaasyncio.gather— each samples autoregressively at temperature = 1.0 (raw logit distribution, no sharpening or flattening), producing 16 diverse responses for the same prompt. These 16 form one GRPO group. - Score each response with a task reward — one scalar r ∈ [0, 1] per response, not per token:MULTIPLE_CHOICE
option letter match → 1 / 0NUMBER / MATH
math equivalence check → 1 / 0BBOX
IoU with ground-truth boxOCR_TEXT
character-level similarity - Push all 16 results into the shared ring buffer as a group
当前配置下:batch_size = 32 条 prompt × n_samples = 16 次 rollout = 每个梯度步处理 512 条轨迹。
- 从数据加载器取一条 prompt
- 通过
asyncio.gather并发发起 16 次独立的agenerate调用(gconfig.n_samples = 16),每次在 temperature = 1.0(直接用原始 logit 分布采样,不压缩也不拉平)下独立自回归采样,对同一条 prompt 生成 16 条各不相同的 response,构成一个 GRPO group。 - 对每条 response 用任务奖励函数打分——每条 response 得到一个标量 r ∈ [0, 1],是整条回答的得分,不是 token 级别的分:MULTIPLE_CHOICE
选项字母匹配 → 1 / 0NUMBER / MATH
数学等价校验 → 1 / 0BBOX
与 ground-truth 框的 IoUOCR_TEXT
字符级相似度 - 将这 16 条结果作为一组推入共享环形缓冲区
step 1: actor=θ1 → sync → rollout worker now uses θbehave(1)
step 2: actor=θ2 → sync → rollout worker now uses θbehave(2)
step 3: actor=θ3 → sync → rollout worker now uses θbehave(3)
# but the ring buffer still holds samples generated at step 1
step 5: prepare_batch() pulls a sample from θbehave(1) → version gap = 5−1=4 ≤ η → accepted
step 1: actor=θ1 → 同步 → rollout worker 开始使用 θbehave(1)
step 2: actor=θ2 → 同步 → rollout worker 开始使用 θbehave(2)
step 3: actor=θ3 → 同步 → rollout worker 开始使用 θbehave(3)
# 但缓冲区里还存着 step 1 时生成的样本
step 5: prepare_batch() 取到 θbehave(1) 生成的样本 → 版本差 = 5−1=4 ≤ η → 接受
actor.compute_logp(batch) → batch["prox_logp"]The total drift πbehave → πθ passes through an intermediate anchor πprox. Each leg has a distinct character and gets a different corrective tool:
ratio = exp(logπθ − logπprox) # Leg 2 — within-step ratio
L = w · max(−A·ratio, −A·clip(ratio, 0.8, 1.28)) # combined
总偏移 πbehave → πθ 经过中间锤点 πprox。两段性质完全不同,分别用不同的工具处理:
ratio = exp(logπθ − logπprox) # 第二段 — 步内 ratio
L = w · max(−A·ratio, −A·clip(ratio, 0.8, 1.28)) # 合并
actor.compute_advantages(batch)For each prompt, G = 16 responses were sampled. compute_advantages groups them and whitens rewards within the group. Here's a concrete example:
| Resp | Answer | Raw reward ri |
|---|---|---|
| 1 | 96 | 1.00 |
| 2 | 96 | 1.00 |
| 3 | 96 | 1.00 |
| 4 | 95 | 0.50 |
| 5 | 98 | 0.00 |
| mean = 0.70, std = 0.41, ε = 1e-6 | ||
Better than group average → positive signal
Worse than group average → negative signal
Broadcast to tokens: All tokens in response #1 get A = +0.73. All tokens in response #5 get A = −1.71. This is GRPO's key approximation — the episode-level reward proxies per-token contribution.
PPO needs a baseline V(s) to form the advantage A = Q(s,a) − V(s). The classic solution is a critic network — but this breaks down for LLMs/VLMs on three fronts:
GRPO’s fix — Monte Carlo group baseline:
This is an unbiased estimator of Vπ(s) by definition — no extra parameters, no warmup, self-consistent from step 1.
Standard-deviation normalization then equalizes gradient magnitude across tasks:
对每个 prompt,rollout 会采样 G = 16 条回答。compute_advantages 做的事情不是逐 token 打分,而是先把这 16 条回答重新放回同一组,再在组内对标量奖励做白化。下面按一个具体例子来看:
| 样本 i | 回答内容 | 奖励 ri |
|---|---|---|
| 1 | 96 | 1.00 |
| 2 | 96 | 1.00 |
| 3 | 96 | 1.00 |
| 4 | 95 | 0.50 |
| 5 | 98 | 0.00 |
| 组内均值 mean = 0.70,组内标准差 std = 0.41,ε ≈ 1e-6 | ||
比组内平均更好,所以是正优势
比组内平均更差,所以是负优势
如果样本 1 的回答是 ["The", "answer", "is", "96"],那么这 4 个 token 在 loss 里都会拿到同一个 A = +0.73。
如果样本 5 的回答是 ["It", "might", "be", "98"],那么这 4 个 token 都会拿到同一个 A = -1.71。
所以,GRPO 不是在判断“哪个 token 对,哪个 token 错”,而是在说:“这整条回答相对同组兄弟回答更好还是更差”。然后把这个回合级信号平均分配给该回答中的每个 token。
PPO 需要一个 基线 V(s) 来计算优势 A = Q(s,a) − V(s)。经典解法是引入 critic(评论家网络)——但在 LLM/VLM 上会在三个方面失效:
GRPO 的解法——蒙特卡洛组基线:
根据定义,这是 Vπ(s) 的无偏估计——无需额外参数,无需预热,从第一步起就自洽。
标准差归一化再进一步,使各任务的梯度量级统一:
actor.ppo_update(batch)LPPO = −min(ratio · A, clip(ratio, 1−ε, 1+ε′) · A)
- A > 0 (good token): policy is rewarded for raising ratio, capped at 1+ε′
- A < 0 (bad token): policy is penalized once ratio exceeds 1−ε
The total drift from sample generation to the current update is πbehave → πθ. AReaL splits this into two segments with different properties and handles each separately:
↑ Segment 1: cross-step drift ↑ ↑ Segment 2 ↑
1–4 global_steps, variable size one gradient step
handled by w (IS weight, cap 5.0) handled by PPO clip [0.8, 1.28]
L = w · LPPO # Segment 2 handled inside L_PPO via clip
LPPO = −min(ratio · A, clip(ratio, 1−ε, 1+ε′) · A)
- A > 0(好 token):策略因提高 ratio 而获益,上限为 1+ε′
- A < 0(坏 token):ratio 超过 1−ε 后受到惩罚
从样本生成到当前更新,策略经历的总漂移是 πbehave → πθ。AReaL 将其拆成两段性质不同的区间,分别处理:
↑ 第一段:跨步偏移 ↑ ↑ 第二段 ↑
1–4 个 global_step,大小不固定 一个梯度步
由 w(IS 权重,上限 5.0)处理 由 PPO clip [0.8, 1.28] 处理
L = w · LPPO # 第二段由 L_PPO 内部 clip 处理
rollout.pause() / actor.update_weights(...) / rollout.set_version(...) / rollout.resume()global_step, unconditionally. Four ordered sub-steps transfer the freshly-updated actor weights to the rollout GPU partition.
self.paused.set() on a threading.Event. The rollout thread stops accepting new requests. Any in-flight generation that hits the paused check enters await asyncio.sleep(0.5) loop until the flag is cleared. pause_grace_period defaults to 0.0 s in this config.self._version = version under a lock, after the copy completes. All samples generated henceforth are tagged with this new version. The staleness guard in prepare_batch rejects samples where actor_version − sample_version > 4 (max_head_offpolicyness).self.paused.clear(). Workers exit the sleep loop and immediately resume generating with the new weights.global_step 无条件执行一次。四个有序子步将最新更新的 actor 权重传输到 rollout GPU 分区。
self.paused.set(),设置一个 threading.Event。rollout 线程停止接受新请求。正在生成中的序列遇到暂停检查后进入 await asyncio.sleep(0.5) 循环等待。此配置中 pause_grace_period 默认为 0.0 s。self._version = version。此后生成的样本均打上新版本号。prepare_batch 中的时滞检查拒绝满足 actor_version − sample_version > 4 的样本(max_head_offpolicyness)。self.paused.clear(),worker 退出 sleep 循环,立即以新权重继续生成。Training Configuration训练配置
Acknowledgements致谢
We thank the following projects and frameworks:
我们感谢以下项目与框架:
- AReaL: Lightning-Fast RL for LLM Reasoning and Agents面向大模型推理与智能体的高吞吐强化学习框架
- sglang: Fast serving framework for LLMs and vision language models面向大语言模型与视觉语言模型的高性能服务框架
- lmms-eval: Standardized evaluation framework标准化评测框架
- LLaVA: Large Language-and-Vision Assistant大型语言-视觉助手
- LLaVA-NeXT: Next-generation multi-modal assistant新一代多模态助手
Open-Source Resources开源资源
Complete LLaVA-OneVision-1.5-RL resources for the community.面向社区开放的完整 LLaVA-OneVision-1.5-RL 资源。
Code & Paper代码与论文
Model Checkpoints模型权重
Citation引用
@inproceedings{LLaVA-OneVision-1.5,
title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
booktitle={arXiv},
year={2025}
}
@inproceedings{xie2025region,
title={Region-based Cluster Discrimination for Visual Representation Learning},
author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang},
booktitle={ICCV},
year={2025}
}
@article{lillava,
title={LLaVA-OneVision: Easy Visual Task Transfer},
author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
journal={Transactions on Machine Learning Research},
year={2024}
}