Open Multimodal Training开放多模态训练

LLaVA-OneVision-1.5

Sep 30, 2025 models
LLaVA-OneVision ContributorsLLaVA-OneVision 贡献者

The LLaVA JourneyLLaVA 之旅

LLaVA (2023) efficiently connects open-source vision encoders with large language models through low-cost alignment, bringing multimodal capabilities to the open ecosystem and significantly narrowing the gap with top-tier closed models.

LLaVA (2023) 通过低成本对齐,高效连接开源视觉编码器与大语言模型,为开源生态带来多模态能力,显著缩小了与顶级闭源模型的差距。

The series evolved steadily: LLaVA-1.5 strengthened comprehension with cleaner data and high-resolution inputs; LLaVA-NeXT expanded into OCR and mathematical reasoning; LLaVA-NeXT-Video added temporal understanding; LLaVA-NeXT-Interleave enabled multi-image joint reasoning. These converged in LLaVA-OneVision, a unified interface covering images, documents, charts, multi-image, and video.

该系列稳步演进:LLaVA-1.5 通过更干净的数据和高分辨率输入增强了理解能力;LLaVA-NeXT 扩展到 OCR 和数学推理;LLaVA-NeXT-Video 增加了时序视频理解;LLaVA-NeXT-Interleave 支持多图联合推理。最终汇聚为 LLaVA-OneVision,提供覆盖图像、文档、图表、多图和视频的统一接口。

The Reproducibility Gap可复现性差距

While models like Qwen2.5-VL and InternVL3.5 set strong baselines, full data recipes, mixing ratios, and training schedules are often only partially disclosed — making end-to-end reproduction difficult. The primary gap today lies in reproducibility of recipes and engineering details, not model architecture.

虽然 Qwen2.5-VL 和 InternVL3.5 等模型建立了强大的基线,但完整的数据配方、混合比例和训练调度往往只部分公开——使得端到端复现十分困难。当今的核心差距在于配方和工程细节的可复现性,而非模型架构本身。

What LLaVA-OneVision-1.5 DeliversLLaVA-OneVision-1.5 的贡献

LMMs-Lab releases a fully open, concept-balanced 85M pretraining dataset and a curated 22M instruction dataset, with a compact three-stage pipeline and offline parallel data packing (up to ~11× compression). Stage-1.5 pretraining of an 8B model completes on 128 A800 GPUs in about four days. LLaVA-OneVision-1.5 adds RICE-ViT for native-resolution fine-grained modeling, strengthens chart/document understanding, and delivers truly end-to-end transparent openness — data, toolchains, configs, logs, and reproducible evaluation commands — enabling low-cost reproduction by the community. It achieves competitive or superior performance to Qwen2.5-VL on multiple public benchmarks.

LMMs-Lab 发布了完全开放的、概念均衡的 85M 预训练数据集和精心策划的 22M 指令数据集,采用紧凑的三阶段训练流程和离线并行数据打包(最高约 11 倍压缩)。8B 模型的 Stage-1.5 预训练在 128 块 A800 GPU 上约四天完成。LLaVA-OneVision-1.5 引入 RICE-ViT 实现原生分辨率细粒度建模,增强图表/文档理解,并提供真正端到端透明的开源——数据、工具链、配置、日志和可复现的评测命令——使社区能够低成本复现。在多个公开基准上达到与 Qwen2.5-VL 竞争或优于其的性能。

Table 1表 1 Performance comparison across vision-language models on various benchmarks grouped by task type. All scores are reported as accuracy percentages unless otherwise specified. Blue values indicate the best score in a row, including ties.不同视觉语言模型在多种基准测试上的性能对比,按任务类型分组。所有得分以准确率百分比报告(除非另有说明)。蓝色数值表示该行最佳得分(含并列)。
Benchmark LLaVA-OV-1.5
8B
Qwen2.5-VL
7B
LLaVA-OV-1.5
4B
LLaVA-OV-1.5
3B
Qwen2.5-VL
3B
MMStar 67.7 62.5 64.9 59.1 55.9
MMBenchen 84.1 83.4 84.2 81.0 78.0
MMBenchcn 81.0 81.6 76.9 73.0 74.6
MME-RealWorlden 62.3 57.3 49.6 57.9 51.6
MME-RealWorldcn 56.1 51.5 61.6 23.4 45.4
SeedBenchimage 77.3 77.5 76.6 71.3 74.8
CV-Bench 80.8 80.0 77.2 73.8 71.5
ScienceQA 95.0 88.8 93.6 91.2 83.3
SEED-Bench-2-Plus 69.2 70.9 68.9 67.6 68.6
RealWorldQA 68.1 68.5 67.8 66.8 60.0
Avg 74.2 72.2 72.1 66.5 66.4
MathVistamini 69.6 68.6 67.9 64.7 60.2
WeMath 33.6 33.3 24.9 22.6 18.4
MathVision 25.6 22.4 24.2 19.9 21.3
MMMUval 55.4 51.3 52.7 45.5 46.4
MMMU-Prostandard 37.4 36.3 35.3 29.5 31.1
MMMU-Provision 25.2 32.8 25.4 20.3 21.3
Avg 41.1 40.8 38.4 33.7 33.1
ChartQA 86.5 84.1 87.1 84.4 83.4
CharXivDQ 74.1 69.8 63.8 61.8 58.2
DocVQA 95.0 94.9 94.4 93.4 92.7
OCRBench 82.9 84.2 80.0 80.5 79.2
AI2DwM 84.2 82.6 83.6 82.3 78.6
AI2Dw/oM 94.1 93.4 93.3 91.9 90.7
InfoVQA 78.4 81.7 76.1 71.2 75.6
Avg 85.0 84.4 82.6 81.0 79.8
PixmoCount 62.2 63.3 52.2 57.0 50.9
CountBench 88.2 86.4 79.8 49.5 72.5
VL-RewardBench 46.7 49.7 48.2 42.5 42.1
V* 78.0 77.0 74.9 67.5 69.6
Avg 68.8 69.1 63.8 54.1 58.8

Pretraining Dataset (85M) and Concept Balancing预训练数据集 (85M) 与概念均衡

A general-purpose vision–language pretraining dataset (85M) and an instruction-tuning dataset (22M). The 85M pretraining corpus fuses eight heterogeneous sources—COYO-700M, Obelics, DataComp-1B, LAION-CN, ImageNet-21K, SAM-1B, MINT, and Zero250M—yielding roughly 20 million Chinese and 65 million English image–text pairs. To tackle long-tail concept sparsity and noise/missing issues in raw captions, we move beyond raw term frequencies and adopt a feature-driven “concept balancing” strategy: using a MetaCLIP encoder, we embed all images and a 500K-scale concept vocabulary into a shared vector space, retrieve the Top-K most similar concepts for each image, tally concept frequencies, and then apply inverse-frequency weighted resampling. This suppresses high-frequency background classes and boosts rare fine-grained entities, attributes, and scenes, substantially flattening the long-tail distribution. We then use a high-quality captioner to generate aligned bilingual (Chinese/English) augmented descriptions. Systematic experiments show that, under the same or lower token budget, scaling high-quality data combined with concept-balanced sampling delivers significant and reproducible gains in multimodal understanding, long-tail recognition, and instruction generalization.

通用视觉-语言预训练数据集(85M)和指令微调数据集(22M)。85M 预训练语料融合了八个异构数据源——COYO-700M、Obelics、DataComp-1B、LAION-CN、ImageNet-21K、SAM-1B、MINT 和 Zero250M——产生约 2000 万中文和 6500 万英文图文对。为解决原始标注中的长尾概念稀疏和噪声/缺失问题,我们采用特征驱动的“概念均衡”策略:利用 MetaCLIP 编码器将所有图像和 50 万规模的概念词表嵌入共享向量空间,检索每张图像最相似的 Top-K 概念,统计概念频率,然后进行逆频率加权重采样。这抑制了高频背景类别,提升了稀有细粒度实体、属性和场景,显著平滑了长尾分布。随后使用高质量标注器生成对齐的中英双语增强描述。系统实验表明,在相同或更低的 token 预算下,高质量数据的扩展结合概念均衡采样,在多模态理解、长尾识别和指令泛化方面带来了显著且可复现的提升。

Balanced vs. Random Sampling均衡采样 vs. 随机采样

Score improvement = balanced − random分数提升 = 均衡采样 − 随机采样

Instruction Dataset (22M)指令数据集 (22M)

The 22M instruction dataset covers eight categories: Caption, Chart & Table, Code & Math, Domain-specific, General VQA, Grounding & Counting, OCR, and Science. Through multi-source aggregation, format standardization, instruction rewriting, bilingual conversion, template diversification (to reduce homogeneity), and safety filtering, we maintain balanced distributions across categories and difficulty levels. Moreover, augmenting our instruction data with the FineVision dataset yields further performance gains.

22M 指令数据集涵盖八个类别:Caption、Chart & Table、Code & Math、领域特定、General VQA、Grounding & Counting、OCR 和 Science。通过多源聚合、格式标准化、指令改写、双语转换、模板多样化(减少同质性)和安全过滤,我们在各类别和难度级别上保持均衡分布。此外,使用 FineVision 数据集增强指令数据可进一步提升性能。

(b) Mid-Training Data Sources (85M)(b) Mid-Training 数据源分布 (85M)
MINT 5.0%
ImageNet-21K 1.9%
SA-1B 1.4%
(c) Instruction Data Categories (22M)(c) 指令数据类别分布 (22M)
Caption图像描述 4.6%
Science科学 4.1%
CodeMath代码数学 3.3%
Grounding定位与计数 0.8%

Figure 1. (a) The vocabulary coverage proportion in the LLaVA-OneVision-1.5-Mid-Training dataset before and after concept balancing. (b) Distribution of data sources within the LLaVA-OneVision-1.5-Mid-Training dataset. (c) Distribution of data sources within the LLaVA-OneVision-1.5-Instruct.图 1. (a) LLaVA-OneVision-1.5-Mid-Training 数据集中概念均衡前后的词汇覆盖比例。(b) LLaVA-OneVision-1.5-Mid-Training 数据集中的数据源分布。(c) LLaVA-OneVision-1.5-Instruct 中的数据源分布。

Visual Encoder Pretraining视觉编码器预训练

To raise the floor for OCR, tables/documents, region‑level understanding, and downstream instruction reasoning, LLaVA‑OneVision‑1.5 adopts our in‑house MVT v1.5 (RICE‑ViT) as the vision backbone. Compared to CLIP/SigLIP‑style contrastive models that rely on global alignment only, RICE‑ViT addresses the structural bottleneck of representing an instance with a single global vector by introducing a unified Region Cluster Discrimination mechanism:

为提升 OCR、表格/文档、区域级理解和下游指令推理的基线,LLaVA-OneVision-1.5 采用自研的 MVT v1.5(RICE-ViT)作为视觉骨干。与 CLIP/SigLIP 式仅依赖全局对齐的对比模型相比,RICE-ViT 通过引入统一的区域聚类判别机制,解决了用单个全局向量表示实例的结构性瓶颈:

  • trained on 450M images and 2.4B candidate regions在 4.5 亿图像和 24 亿候选区域上训练
  • explicitly models local entities/text blocks and their context via region‑cluster discrimination plus region‑aware attention通过区域聚类判别和区域感知注意力,显式建模局部实体/文本块及其上下文
  • uses 2D rotary position encoding (2D RoPE) for native multi‑resolution support使用二维旋转位置编码(2D RoPE)实现原生多分辨率支持

Unlike SigLIP2, which relies on multiple specialized losses (SILC, TIPS, LocCa, etc.), we use a single clustering‑discrimination paradigm to simultaneously strengthen general semantics, OCR recognition, and localization, yielding a simpler, more maintainable training/inference pipeline. During multimodal fusion, a lightweight projection followed by full‑parameter joint training seamlessly plugs this fine‑grained semantic foundation into the language model, reducing redundant adapters and improving cross‑task transfer efficiency.

与依赖多个专用损失(SILC、TIPS、LocCa 等)的 SigLIP2 不同,我们使用单一聚类判别范式,同时增强通用语义、OCR 识别和定位能力,从而实现更简洁、易维护的训练/推理流程。在多模态融合时,轻量级投影层后接全参数联合训练,将细粒度语义基础无缝接入语言模型,减少冗余适配器,提升跨任务迁移效率。

Input Image
H × W × 3
Vision Transformer
Single Forward Pass
Token Grid
H × W + 1 Class Token
Region Transformer
Sample Regions by Mask
ROI Align
Region Attention
Region-specific Visibility Mask
Region Embeddings
Object Tokens (Fixed-length)
OCR Tokens (Fixed-length)
Unified Supervision
Object Region Loss
Single-label Cluster Discrimination
Semantic Cluster Centers
OCR Region Loss
Multi-label Cluster Discrimination
Vocabularies / Token Embeddings
Figure 2. RICE architecture processes diverse semantic regions within images using region-based cluster discrimination. The model jointly captures general visual semantics and OCR semantics in a single forward pass.图 2. RICE 架构通过基于区域的聚类判别处理图像中的多样语义区域。该模型在单次前向传播中联合捕获通用视觉语义和 OCR 语义。
Table 2表 2 Comparison of RICE-ViT with other vision encoders on varied multimodal benchmarks. RICE-ViT models are highlighted.RICE-ViT 与其他视觉编码器在多种多模态基准上的对比。RICE-ViT 模型已高亮显示。
Model Configuration OCR & Document Understanding General Vision Understanding
Method Vision Tower
InfoVQA
DocVQA
ChartQA
TextVQA
OCRBench
OCRBenchV2
LiveXivVQA
OCR Avg
AI2D
MMBEN
MMECog
MMEPer
POPE
RealworldQA
MMStar
Other Avg
CLIP ViT-L-14-336px 38.9 75.2 66.5 62.5 52.5 23.0 47.4 52.3 73.2 74.6 48.0 75.6 88.8 63.7 49.0 67.6
MLCD ViT-L-14-336px 43.5 76.5 67.8 61.7 53.1 24.0 48.4 53.6 77.0 76.4 54.1 79.9 88.7 61.1 51.0 69.7
AIMv2 ViT-L-14-336px 35.4 77.2 72.7 65.9 57.2 23.9 47.3 54.2 75.4 78.6 48.3 75.0 88.4 62.2 50.2 68.3
RICE-ViT ViT-L-14-336px 45.2 79.2 72.3 65.9 57.5 24.1 48.9 56.2 77.9 76.6 54.6 80.7 88.5 63.1 51.8 70.5
DFN5B ViT-H-14-378px 38.6 70.9 64.4 59.4 47.3 21.9 46.2 49.8 73.5 73.4 45.8 76.9 88.6 59.9 49.1 66.7
SigLIP ViT-SO400M-14-384px 41.4 76.7 69.3 64.7 55.4 24.0 48.4 54.3 76.2 77.0 46.1 79.9 88.8 63.7 47.3 68.4
SigLIPv2 ViT-SO400M-14-384px 43.7 79.1 70.2 66.2 58.7 25.4 48.6 56.0 77.0 77.1 46.6 80.4 89.3 63.4 52.8 69.5
RICE-ViT ViT-L-14-378px 48.1 82.6 75.1 66.2 58.8 25.8 49.5 58.0 76.5 77.6 54.1 79.0 89.1 62.9 51.2 70.1
SigLIPv2 ViT-SO400M-16-560px 50.2 86.2 77.4 70.2 62.7 26.5 52.9 60.9 77.0 76.5 53.5 79.9 89.3 68.2 53.1 71.1
RICE-ViT ViT-L-14-560px 53.2 87.4 78.1 69.0 60.7 26.1 53.0 61.1 76.9 78.6 56.3 79.3 88.9 65.1 50.5 70.8
Qwen-ViT from Qwen2.5-VL 7B ViT-H-14-560px 55.9 85.8 78.8 73.7 66.2 26.8 53.4 62.9 78.8 78.4 62.0 80.8 88.6 64.2 55.0 72.5
RICE-ViT from OV-1.5 3B ViT-L-14-560px 53.7 87.1 81.9 73.8 73.3 30.4 53.6 64.8 80.3 79.6 58.6 82.2 89.0 67.3 56.6 73.4

Three-Stage Pipeline三阶段训练流程

Stage 1

Language–image alignment语言-图像对齐

Train the projector on LLaVA–1.5 558K to map visual encoder outputs into the LLM token embedding space, with controlled parameter updates for fast, stable convergence.

在 LLaVA-1.5 558K 数据集上训练投影层,将视觉编码器输出映射到 LLM 的 token 嵌入空间,通过控制参数更新实现快速稳定收敛。

LLaVA–1.5 558K
Stage 1.5

Mid-stage pretraining中期预训练

Use full-parameter training on the concept-balanced 85M pretraining set to inject broad visual semantics and world knowledge, emphasizing data quality and coverage rather than blindly expanding token counts.

在概念均衡的 85M 预训练集上进行全参数训练,注入广泛的视觉语义和世界知识,强调数据质量和覆盖度,而非盲目扩展 token 数量。

Concept-balanced 85M
Stage 2

Visual instruction alignment视觉指令对齐

Continue full-parameter training on the 22M instruction set plus multi-source visual instruction corpora such as FineVision to improve task generalization, reasoning organization, and response-format control.

在 22M 指令集和 FineVision 等多源视觉指令语料上继续全参数训练,提升任务泛化能力、推理组织和回复格式控制。

22M + FineVision

Offline Packing离线数据打包

To reduce padding waste from multimodal sequence-length variance and improve effective token utilization, we adopt offline parallel packing:

为减少多模态序列长度差异导致的填充浪费并提高有效 token 利用率,我们采用离线并行打包:

  • hash‑bucket clustering by sample length or length ranges to cut global sorting/scanning costs按样本长度或长度范围进行哈希桶聚类,降低全局排序/扫描开销
  • multithreaded concatenation of multiple short samples into fixed‑length sequences close to the target length during data prep在数据预处理阶段多线程拼接多个短样本为接近目标长度的固定长度序列

This one‑pass, corpus‑wide pipeline is deterministic and reproducible, avoiding the runtime instability and extra CPU overhead of online dynamic packing. On the 85M pretraining set, it achieves up to ~11× effective padding compression (defined as original total padding tokens / post‑packing total padding tokens) compared to the baseline.

这种一次性、全语料范围的流程是确定性且可复现的,避免了在线动态打包的运行时不稳定性和额外 CPU 开销。在 85M 预训练集上,相比基线实现了约 11 倍的有效填充压缩(定义为原始总填充 token 数 / 打包后总填充 token 数)。

Offline Packing离线打包

From uneven raw samples to dense fixed-length packed sequences从不均匀的原始样本到密集的固定长度打包序列

Instead of packing inside the dataloader with a small rolling buffer, we first compute token lengths for the whole shard, store the packing state offline, and then run staged bin-packing over global statistics so every sequence is filled as close as possible to the target budget before training even begins.

Why online packing under-fills为什么在线打包会欠填充
  • the dataloader only sees a small rolling buffer, so it cannot search globally for better complementsdataloader 只能看到很小的滚动缓冲区,因此无法在全局范围内搜索更优互补样本
  • small buffers leave many near-fit candidates invisible, which turns into leftover slack in each max-length box小缓冲区会让许多接近匹配的候选样本不可见,导致每个最大长度箱体都留下剩余空隙
  • growing the online buffer improves fit quality, but raises host-RAM pressure and preprocessing latency扩大在线缓冲区虽能提升匹配质量,但会增加主机内存压力和预处理延迟
  • because packing happens inside training, every step still pays search, merge, and scheduling overhead由于打包发生在训练过程中,每一步仍需承担搜索、合并和调度开销
Why offline packing gets near-full boxes为什么离线打包能接近满载
  • the whole shard is indexed by token length first, so exact and near-exact complements are easy to retrieve整个分片会先按 token 长度建立索引,因此精确或近似互补样本都能轻松检索
  • long samples are treated as seeds, then short and medium samples are packed around them to close the remaining gap长样本会被视为种子,再在其周围填入短样本和中样本以补齐剩余空缺
  • the algorithm can switch between diversity-first filling and greedy completion when a bucket starts to deadlock当某个桶开始陷入僵局时,算法可以在优先多样性填充和贪心补齐之间切换
  • once the best arrangement is found, it is serialized once and reused directly during training with no extra packing cost一旦找到最佳排布,只需序列化一次,训练时即可直接复用而无需额外打包成本

Hybrid Parallelism混合并行训练

On the training side, we use hybrid parallelism and long‑context optimizations—tensor parallelism (TP) + pipeline parallelism (PP) + sequence/context parallelism with a distributed optimizer—to improve compute utilization and memory efficiency at cluster scale. We also adopt a native‑resolution strategy to preserve structural details in charts, documents, and dense text regions, avoiding information loss from uniform resizing. On a 128×A800 cluster, Stage‑1.5 for an 8B model (85M samples, native resolution) completes in about 3.7 days, balancing throughput and cost.

在训练端,我们使用混合并行和长上下文优化——张量并行 (TP) + 流水线并行 (PP) + 序列/上下文并行配合分布式优化器——以提升集群规模下的计算利用率和内存效率。我们还采用原生分辨率策略,保留图表、文档和密集文本区域的结构细节,避免统一缩放导致的信息丢失。在 128×A800 集群上,8B 模型的 Stage-1.5(85M 样本,原生分辨率)约 3.7 天完成,兼顾吞吐量与成本。

From zero to SOTA in ~168 hours从零到 SOTA 仅需约 168 小时

Training loss over time for Mid-Training and SFT phases with a logarithmic y-axis Mid-Training (89h) SFT (79h)
Mid-Training中期训练89.39h
SFT79.05h
Total Time总时长168.44h
Hardware硬件128 GPUs

Open-Source Resources开源资源

Models, datasets, training code, and demos—organized as a single reproducible release surface.

模型、数据集、训练代码和演示——组织为单一可复现的发布。

Code Demos代码示例

Quick Start with HuggingFaceHuggingFace 快速开始
python
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
from qwen_vl_utils import process_vision_info

model_path = "lmms-lab/LLaVA-One-Vision-1.5-8B-Instruct"

# default: Load the model on the available device(s)

model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
)

# default processor

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output

generated_ids = model.generate(inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Model Evaluation模型评估
bash
# pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
 --model=llava_onevision1_5 \
 --model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \
 --tasks=mmmu_val,mmmu_pro_standard,mmbench_en_test,mmerealworld,mmerealworld_cn,ai2d,ai2d_no_mask,vstar_bench,chartqa,charxiv,docvqa_test,mathvista_testmini,mmstar,scienceqa \
 --batch_size=1

Citation引用

citation.bib
bibtex
@inproceedings{LLaVA-OneVision-1.5,
  title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
  author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
  booktitle={arXiv},
  year={2025}
}

@inproceedings{xie2025region,
  title={Region-based Cluster Discrimination for Visual Representation Learning},
  author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang},
  booktitle={ICCV},
  year={2025}
}

@article{lillava,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  journal={Transactions on Machine Learning Research},
  year={2024}
}
Multimodal Reinforcement Learning多模态强化学习

LLaVA-OneVision-1.5-RL

Sep 30, 2025 rl

Project led by 项目由 Changrui Chen and Jiankang Deng

Overview概览

LLaVA-OneVision-1.5-RL presents an RL post-training stage utilizing 67K curated examples with discrepancy-based selection to generate explicit chain-of-thought reasoning, achieving significant performance gains on STEM, coding, and reasoning benchmarks while maintaining visual understanding capabilities.

LLaVA-OneVision-1.5-RL 提出了一套强化学习后训练阶段,利用 6.7 万条经精筛的数据,并采用差异驱动的数据选择策略来激发显式链式思维推理,在保持视觉理解能力的同时,显著提升了模型在 STEM、代码生成与复杂推理基准上的表现。

Our contributions are threefold:

我们的贡献主要体现在三个方面:

1

Discrepancy-Driven Data Curation. We identify tasks where model performance gap exists between Pass@N and Pass@1 metrics, targeting “latent capability” rather than knowledge injection.

差异驱动的数据构建。 我们聚焦于 Pass@N 与 Pass@1 之间存在明显性能差距的任务,挖掘模型已有但尚未稳定释放的“潜在能力”,而不是简单注入新知识。

2

Rule-Based Reward System. We employ domain-specific verification rules rather than learned preference models, enabling precise feedback across STEM, grounding, spatial reasoning, counting, coding, OCR, and diagram tasks.

基于规则的奖励系统。 我们不依赖学习式偏好模型,而是针对不同任务设计领域特定的验证规则,从而为 STEM、定位、空间推理、计数、代码、OCR 与图表理解任务提供精确反馈。

3

Two-Stage Curriculum Training. We design a training curriculum that first stabilizes concise task performance with answer-only RL, then unlocks deeper reasoning through chain-of-thought RL.

两阶段课程式训练。 我们先通过仅输出答案的 RL 稳定模型的简洁作答能力,再通过链式思维 RL 进一步释放更深层的推理能力。

RL Training Data Composition强化学习训练数据构成

(a) Total RL corpus (67K instances)(a) 全部 RL 语料(6.7 万条)
Math / Science数学 / 科学
Grounding定位
Coding代码
Spatial空间
OCROCR
Diagram图表
Counting计数
(b) Stage 1: Answer-only training (19.9K)(b) 阶段一:仅答案训练(1.99 万条)
Math / Science数学 / 科学
Grounding定位
Coding代码
Spatial空间
OCROCR
Diagram图表
Counting计数
(c) Stage 2: Chain-of-thought training (49.2K)(c) 阶段二:链式思维训练(4.92 万条)
Math / Science数学 / 科学
Grounding定位
Coding代码
Spatial空间
OCROCR
Diagram图表
Counting计数

Figure 3. Distribution of task categories in the RL training data. (a) Total RL corpus (67K instances). (b) Stage 1: Answer-only training. (c) Stage 2: Chain-of-thought training.图 3. 强化学习训练数据中的任务类别分布。(a) 全部 RL 语料(6.7 万条);(b) 阶段一:仅答案训练;(c) 阶段二:链式思维训练。

How the RL data is obtainedRL 数据是怎么获得的
This RL corpus is not hand-labeled preference data. Instead, it is built from public benchmarks such as ViRL39K, Ref-L4, VigoRL-SA / SAT, PixmoCount, WebCode2M, UniSVG, InfoVQA, and AI2D. The pipeline is: start from benchmark samples → let the base LLaVA-OneVision-1.5-Instruct model generate multiple rollouts → score them with task-specific rule rewards → keep medium-difficulty examples. Samples that are already trivial (reward too high) or still unsolved (reward too low) are filtered out, leaving the 67K examples that provide the strongest RL learning signal. The released training set is then organized into two local splits used by code: ./data/stage1-normal and ./data/stage2-long.
这套 RL 训练数据不是人工标注的偏好对比数据,而是从一批公开 benchmark / dataset 中构建出来的,例如 ViRL39K、Ref-L4、VigoRL-SA / SAT、PixmoCount、WebCode2M、UniSVG、InfoVQA 和 AI2D。整体流程是:先取公开题目 → 用基础模型 LLaVA-OneVision-1.5-Instruct 对每个样本 rollout 多个回答 → 用任务对应的规则奖励自动打分 → 只保留中等难度样本。已经很容易的题(reward 太高)和当前模型几乎完全不会的题(reward 太低)都会被过滤掉,最终留下约 6.7 万条最有 RL 学习价值的数据。代码里训练时实际读取的就是整理后的两个目录:./data/stage1-normal./data/stage2-long

RL Data Strategy强化学习数据策略

Discrepancy-Driven Selection差异驱动筛选

If a model can solve a task given enough attempts (high Pass@N) but rarely gets it right on the first try (low Pass@1), it already has the latent capability — it just needs to learn to use it reliably. We select tasks with this gap for RL training, filtering out tasks that are too easy (high Pass@1, nothing to learn) or too hard (low Pass@N, beyond current capability).

如果模型在给出足够多次尝试时能够解出某个任务(高 Pass@N),但首次作答却很难做对(低 Pass@1),说明它已经具备相应的潜在能力,只是尚未学会稳定地调用这种能力。我们将这类存在差距的样本用于 RL 训练,同时过滤掉过于简单(高 Pass@1,几乎没有学习空间)或过于困难(低 Pass@N,超出当前能力边界)的任务。

Task Selection by Capability Gap基于能力差距的任务筛选
32%
78%
gap差距
Selected入选
85%
91%
Too Easy过于简单
5%
8%
Too Hard过于困难
Pass@1 Pass@N Latent capability gap潜在能力差距

Reward-Based Sampling基于奖励的采样

Multiple candidate responses are filtered by average reward scores to exclude trivial and unsolvable cases, focusing on medium-difficulty instances that provide optimal learning signals.

我们通过候选回答的平均奖励分数进行过滤,剔除过于简单或几乎不可解的样本,将训练重点放在能提供最优学习信号的中等难度实例上。

Reward-Based Sampling Filter基于奖励的样本过滤器
✗ Trivial✗ 过于简单
reward ≈ 1.0
✓ Selected✓ 保留
medium difficulty中等难度
✗ Unsolvable✗ 不可解
reward ≈ 0.0
0.01.0
excluded剔除 optimal signal最佳信号 excluded剔除

Reward System Architecture奖励系统架构

Our RL setup employs a rule-based reward paradigm, where rewards are derived directly from task outcomes rather than learned preference models. Since different answer types require distinct verification strategies, we design answer-type-specific scoring rules via the reward/ module.

我们的 RL 方案采用基于规则的奖励范式:奖励直接由任务结果计算,而不是依赖学习式偏好模型。由于不同答案类型需要不同的验证策略,我们通过 reward/ 模块设计了针对答案类型的专用评分规则

Category类别 Source来源 Reward Design Details奖励设计细节
STEM ViRL39K Choice accuracy & math expression equivalence选项准确率与数学表达式等价性
View reward logic查看奖励逻辑
reward/math.py
python
def math_reward_fn(completions, answer):
    model_answer = extract_answer(completion, format_strict=True)
    gold_parsed = parse(gt_answer, extraction_config=[
        StringExtractionConfig(), LatexExtractionConfig(), ExprExtractionConfig(),
    ])
    correct = verify(answer_parsed, gold_parsed)
    if not correct:
        correct = is_equal(gt_answer, model_answer)  # SymPy fallback
    return 1.0 if correct else 0.0
Grounding定位 Ref-L4, VigoRL-SA IoU between predicted/ref boxes; choice accuracy预测框与参考框的 IoU;选项准确率
View reward logic查看奖励逻辑
reward/bbox.py
python
def bbox_reward_fn(completions, answer):
    xA = max(predicted[0], gt[0]); yA = max(predicted[1], gt[1])
    xB = min(predicted[2], gt[2]); yB = min(predicted[3], gt[3])
    inter = max(0, xB-xA) * max(0, yB-yA)
    iou = inter / float(areaA + areaB - inter)
    return iou  # ∈ [0, 1]

IoU = Intersection / (Area₁ + Area₂ − Intersection)

Spatial空间 VigoRL-SAT Choice accuracy选项准确率
View reward logic查看奖励逻辑
reward/multiple_choice.py
python
def multiplechoice_reward_fn(completions, answer):
    predicted = extract_boxed_content(completions)[-1]
    predicted = predicted.strip().strip('.()')
    predicted = predicted[0].upper() if predicted else ""
    return 1 if predicted == answer.upper() else 0
Counting计数 PixmoCount Numeric token equivalence数值 token 等价匹配
View reward logic查看奖励逻辑
reward/number.py
python
def number_reward_fn(completions, answer):
    answer_str = extract_boxed_content(completions)[-1]
    match = re.findall(r"([0-9\\.]+)", answer_str)
    count = match[-1] if match else ""
    return float(count.strip() == answer.strip())
Coding代码 WebCode2M, UniSVG Token/tag overlap; SVG rendering similarity [0,1]token / 标签重叠度;SVG 渲染相似度 [0,1]
View reward logic — HTML查看奖励逻辑 — HTML
reward/htmlcode.py
python
def html_reward_fn(completions, answer):
    token_score = calculate_token_overlap(gen, ref)
    structure_score = calculate_tag_structure_similarity(gen, ref)
    reward = 0.6 * token_score + 0.4 * structure_score
    return max(0.0, min(1.0, reward))
View reward logic — SVG查看奖励逻辑 — SVG
reward/svgcode.py
python
def svg_reward_fn(completions, answer):
    token_score = calculate_token_overlap(gen, ref)
    structure_score = calculate_structure_similarity(gen, ref)
    image_score = calculate_image_similarity(gen_png, ref_png)  # SSIM
    reward = 0.5 * image_score + 0.25 * (token_score + structure_score)
    return reward  # ∈ [0, 1]

HTML: 0.6 × TokenJaccard + 0.4 × TagJaccard  |  SVG: 0.5 × SSIM + 0.25 × (Token + Tag)

OCR InfoVQA Text similarity文本相似度
View reward logic查看奖励逻辑
reward/ocr.py
python
def ocr_reward_fn(completions, answer):
    dist = levenshtein_distance(gt, det)
    length = max(len(target), len(det))
    reward = 1 - min(values)
    return reward if reward >= 0.5 else 0  # threshold

Similarity = 1 − (Levenshtein / max(len₁, len₂)), clipped at 0.5

Diagram图表 AI2D Choice accuracy选项准确率
Format Reward格式奖励 (cross-cutting): requires exactly one <think> block, at least one \boxed{}, and boxed content ≤ 20% of total length. (跨任务通用):要求恰好出现一个 <think> 块,至少包含一个 \boxed{},且被框出的内容长度不超过总长度的 20%。

Two-Stage Training Procedure两阶段训练流程

Interactive Training Pipeline: We utilize Group Relative Policy Optimization (GRPO) within the asynchronous AReaL framework. Expand the shared hyperparameters to view the global training configuration.

交互式训练流程:我们在异步 AReaL 框架中采用 Group Relative Policy Optimization(GRPO)。展开共享超参数区域即可查看全局训练配置。

1Stage 1阶段一

Answer-only RL仅答案 RL

Stabilizes task performance with concise answers (19.9K samples, ./data/stage1-normal).

通过简洁答案输出稳定任务表现(1.99 万样本,./data/stage1-normal)。

Model Base模型基座LLaVA-OneVision-1.5-8B-Instruct
Data数据./data/stage1-normal (19.9K)
Prompt Template提示模板Put ONLY your final answer within <answer></answer>.
2Stage 2阶段二

Chain-of-Thought RL ✨链式思维 RL ✨

Unlocks deeper reasoning via explicit thinking prompts (49.2K samples, ./data/stage2-long).

通过显式思维提示释放更深层推理能力(4.92 万样本,./data/stage2-long)。

Model Init模型初始化Stage 1 Checkpoint阶段一检查点
Data数据./data/stage2-long (49.2K)
Prompt Template提示模板Think and solve the following question step by step. Please put your thinking and analysis procedure within <think></think>. Put ONLY your final answer within <answer></answer>.
View Global Hyperparameters (Both Stages)查看全局超参数(两阶段共享)

Shared GRPO Configuration共享 GRPO 配置

Algorithm算法GRPO (AReaL Framework)
Infrastructure基础设施8× GPUs, FSDP, SGLang
Optimizer优化器Adam (lr=2e-6, wd=0.01)
LR Schedule学习率计划Constant w/ 0.1% warmup常数学习率 + 0.1% warmup
Batch Params批量参数BS=32, 16 samples/group
Max Tokens最大 token4096 (temp=1.0)
PPO ClipPPO 裁剪ε=0.2, ε_high=0.28
Reward奖励Scale=10.0, Bias=-0.5
KL PenaltyKL 惩罚0.0 (No constraint)
Epochs / Dtype轮数 / 精度30 / bfloat16
Reward Modules奖励模块 bbox.py, bool.py, critic.py, format.py, generalcode.py, htmlcode.py, math.py, multiple_choice.py, number.py, ocr.py, svgcode.py, string_matching.py

Extended Capability Analysis扩展能力分析

Spatial Reasoning & Grounding空间推理与定位
Coding代码
Score / Accuracy (%)得分 / 准确率 (%)
LLaVA-OV-1.5 8B
LLaVA-OV-1.5 RL (Thinking)
LLaVA-OV-1.5 RL (Fast)
Qwen 2.5-VL

Figure 4. Performance comparison of LLaVA-OV-1.5 and corresponding RL version on Spatial Reasoning & Grounding and Coding tasks.图 4. LLaVA-OV-1.5 与其 RL 版本在空间推理、定位与代码任务上的性能对比。

Spatial & Grounding: RL “fast mode” significantly enhances fine-grained perception on SAT and Ref-L4 benchmarks.

空间与定位:RL 的“fast mode”在 SAT 与 Ref-L4 基准上显著提升了细粒度感知能力。

Coding: “Thinking” mode achieves highest scores on Design2Code and UniSVG, demonstrating chain-of-thought benefits for structural code generation.

代码:“Thinking” 模式在 Design2Code 与 UniSVG 上取得了最高分,说明链式思维对结构化代码生成具有明显优势。

Performance Results性能结果

Table 3Performance comparison across vision-language models on various benchmarks grouped by task type. All scores are reported as accuracy percentages unless otherwise specified.按任务类别汇总的多种视觉语言模型基准对比。除非另有说明,所有分数均以准确率百分比报告。
Task任务 Benchmark基准 LLaVA-OV-1.5 LLaVA-OV-1.5 RL
8B 8B
- thinkingthinking fastfast
General VQA通用 VQA MMStar 67.7 68.2↑0.5 68.3↑0.6
MMBenchen 84.1 85.7↑1.6 85.7↑1.6
MMBenchcn 81.0 84.2↑3.2 81.5↑0.5
MME-RealWorlden 61.7 63.4↑1.7 63.3↑1.6
MME-RealWorldcn 56.1 56.1↑0.0 56.3↑0.2
SeedBenchimage 77.3 76.7 77.6↑0.3
CV-Bench 80.7 82.9↑2.2 81.1↑0.4
SEED-Bench-2-Plus 69.2 69.5↑0.3 69.2↑0.0
RealWorldQA 68.1 68.4↑0.3 70.6↑2.5
Avg. 71.8 72.8↑1.0 72.6↑0.8
Reasoning推理 MathVistamini 69.6 72.3↑2.7 71.8↑2.2
WeMath 61.5 69.4↑7.9 60.8
MathVision 25.6 34.4↑8.8 26.2↑0.6
MMMUval 55.4 58.8↑3.4 54.9
MMMU-Prostandard 37.4 39.9↑2.5 38.0↑0.6
MMMU-Provision 25.2 35.7↑10.5 29.0↑3.8
Avg. 45.8 51.8↑6.0 46.8↑1.0
OCR & ChartOCR 与图表 ChartQA 86.5 87.4↑0.9 87.0↑0.5
CharXivDQ 70.9 68.4 71.2↑0.3
DocVQA 95.0 91.9 95.0↑0.0
OCRBench 82.9 81.7 82.3
AI2Dw M 84.2 83.7 84.3↑0.1
AI2Dw/o M 94.1 93.7 93.9
InfoVQA 78.4 76.6 78.7↑0.3
Avg. 84.6 83.3 84.6↑0.0
Others其他 PixmoCount 62.2 65.7↑3.5 71.1↑8.9
CountBench 88.2 86.8 88.6↑0.4
VL-RewardBench 47.7 44.0 49.7↑2.0
V* 78.0 79.1↑1.1 78.0↑0.0
Avg. 69.0 66.0 71.6↑2.6

GRPO Algorithm & AReaL Async FrameworkGRPO 算法与 AReaL 异步框架

GRPO

GRPO eliminates the critic by sampling G = 16 completions per prompt and using group-normalized rewards as the baseline.

GRPO 通过为每个提示采样 G = 16 个回答,并使用组内归一化奖励作为基线,从而移除了 critic。

Objective
J(θ) = 𝔼 [ min( rt · Ât , clip(rt, 1−ε, 1+ε′) · Ât ) · wt ]
Ratio
rt = πθ(yt|y<t,x) / πprox(yt|y<t,x)
BehavWeight
wt = exp( log πprox − log πbehave ), cap = 5.0
Advantage
Â(x, yi) = ( ri − μgroup ) / ( σgroup + ε )
Reward
r′ = (rtask − 0.5) × 10.0
Show GRPO Loss Implementation查看 GRPO Loss 实现 (utils/functional.py)
ppo_actor_loss_fn
python
# utils/functional.py — ppo_actor_loss_fn
ratio = torch.exp(logprobs - proximal_logprobs)
clipped_ratio = torch.clamp(ratio, 1.0 - eps_clip, 1.0 + eps_clip_higher)
pg_loss = torch.max(-advantages * ratio, -advantages * clipped_ratio)

# Behavior importance weight (off-policy correction)
behav_kl = proximal_logprobs - old_logprobs
behav_imp_weight = torch.clamp(behav_kl.exp(), max=behav_imp_weight_cap)
pg_loss = pg_loss * behav_imp_weight

AReaL Async TrainingAReaL 异步训练

AReaL decouples rollout from gradient computation — 2.77× throughput by eliminating GPU idle time.

AReaL 将 rollout 与梯度计算解耦,通过消除 GPU 空闲时间实现了 2.77× 吞吐提升

Rollout Engine
SGLang · vLLM · 4×GPU
batch + logprobs →
Actor (FSDP)
4×GPU · gradient update
2.77×
Rollout
R1
R2
R3
Actor
Train 1
Train 2
Train 3
t₀t₁t₂t₃
Decoupled PPO解耦式 PPO
Three policies: πbehave (rollout), πprox (recomputed), πθ (current). Ratio uses πθprox.包含三类策略:πbehave(rollout)、πprox(重算)、πθ(当前)。比值项使用 πθprox
Staleness Control时滞控制
max_head_offpolicyness η = 4. Rollout samples are restricted to ≤4 gradient steps behind the current policy.rollout 样本最多只允许落后当前策略 4 个梯度步。
Behavior Weight行为权重
w = exp(log πprox − log πbehave), capped at 5.0 to prevent gradient explosion.w = exp(log πprox − log πbehave),并截断到 5.0 以避免梯度爆炸。
AReaL Training LoopAReaL 训练循环 (trains/grpo.py)
async_training_loop
python
# trains/grpo.py — main training loop
for global_step in range(start_step, max_steps):
    batch = rollout.prepare_batch(train_dataloader, workflow=workflow)
    batch["prox_logp"] = actor.compute_logp(batch)
    actor.compute_advantages(batch)
    actor.ppo_update(batch)

    rollout.pause()
    actor.update_weights(weight_update_meta)
    rollout.set_version(global_step + 1)
    rollout.resume()
rollout.prepare_batch(...)
What this step does
Dequeue a batch of already-generated, already-scored samples from the async ring buffer. This call does not trigger any LLM inference — it is a pure memory read.

In this configuration: batch_size = 32 prompts × n_samples = 16 rollouts per prompt = 512 trajectories per gradient step.
What rollout workers do concurrently
While the actor runs gradient steps, rollout workers run continuously on a dedicated GPU partition. Each worker loops:
  1. Draw a prompt from the data loader
  2. Concurrently launch 16 independent agenerate calls (gconfig.n_samples) via asyncio.gather — each samples autoregressively at temperature = 1.0 (raw logit distribution, no sharpening or flattening), producing 16 diverse responses for the same prompt. These 16 form one GRPO group.
  3. Score each response with a task reward — one scalar r ∈ [0, 1] per response, not per token:
    MULTIPLE_CHOICE
    option letter match → 1 / 0
    NUMBER / MATH
    math equivalence check → 1 / 0
    BBOX
    IoU with ground-truth box
    OCR_TEXT
    character-level similarity
  4. Push all 16 results into the shared ring buffer as a group
The per-token log-probs behav_logp[t] are recorded for free at step 2 — they are the exact logits already computed when sampling, so no extra forward pass is needed.
What each sample carries
prompt_idstokenized input (image tokens + text)
response_idsthe sampled output token sequence
rscalar reward from the workflow callable
behav_logpshape [seq_len] — log P of each response token under θbehave at generation time; saved for IS correction in ppo_update
versioninteger tag: which actor checkpoint generated this sample
Staleness guard (η = 4)
Before returning, every sample’s version is compared against the current actor version. Samples more than 4 gradient steps old are discarded and replaced. Beyond that threshold the IS weight w would hit its 5.0 cap so frequently that the gradient signal degrades — the guard prevents this at the source.
这一步做什么
从异步环形缓冲区出队一批已生成、已打分的样本。这个调用不触发任何 LLM 推理——纯粹是内存读取操作。

当前配置下:batch_size = 32 条 prompt × n_samples = 16 次 rollout = 每个梯度步处理 512 条轨迹
rollout worker 同期在做什么
actor 进行梯度步的同时,rollout worker 在独立 GPU 分区上持续运行。每个 worker 不断循环:
  1. 从数据加载器取一条 prompt
  2. 通过 asyncio.gather 并发发起 16 次独立的 agenerate 调用gconfig.n_samples = 16),每次在 temperature = 1.0(直接用原始 logit 分布采样,不压缩也不拉平)下独立自回归采样,对同一条 prompt 生成 16 条各不相同的 response,构成一个 GRPO group。
  3. 对每条 response 用任务奖励函数打分——每条 response 得到一个标量 r ∈ [0, 1],是整条回答的得分,不是 token 级别的分:
    MULTIPLE_CHOICE
    选项字母匹配 → 1 / 0
    NUMBER / MATH
    数学等价校验 → 1 / 0
    BBOX
    与 ground-truth 框的 IoU
    OCR_TEXT
    字符级相似度
  4. 将这 16 条结果作为一组推入共享环形缓冲区
每个 token 的 log-prob behav_logp[t] 在第 2 步采样时顺手记录——它们就是采样时已算出的 logit,无需额外前向传播。
每条样本包含的字段
prompt_ids分词后的输入(图像 token + 文本)
response_ids采样得到的输出 token 序列
rworkflow 可调用对象给出的标量奖励
behav_logp形状 [seq_len]——生成时 θbehave 对每个 response token 的 log-prob;留存用于 ppo_update 中的重要性加权校正
version整数标签:记录生成该样本时使用的 actor 检查点版本
时滞守卫(η = 4)
出队前,每条样本的 version 与当前 actor 版本比对。超过 4 个梯度步的样本会被丢弃并补充新样本。超过该阈值后 IS 权重 w 会频繁触及 5.0 上限,梯度信号质量显著下降——守卫从根源上防止了这种情况。
Why can samples be up to 4 steps stale?为什么样本可以落后当前策略最多 4 步?
The key distinction
Rollout worker weights are synced every step — but the samples already sitting in the ring buffer were generated by an older version of those weights. The sync only affects what the worker generates from now on.
Timeline
# actor version advances each step
step 1: actor=θ1 → sync → rollout worker now uses θbehave(1)
step 2: actor=θ2 → sync → rollout worker now uses θbehave(2)
step 3: actor=θ3 → sync → rollout worker now uses θbehave(3)
# but the ring buffer still holds samples generated at step 1
step 5: prepare_batch() pulls a sample from θbehave(1) → version gap = 5−1=4 η → accepted
Why does the gap accumulate?
During a single gradient step, the actor finishes its ppo_update in roughly 1.5–3 s. Meanwhile rollout workers are continuously pushing new samples and old samples from earlier versions are still queued. By the time prepare_batch() drains the buffer, some of those samples may have been sitting there for up to max_head_offpolicyness = 4 gradient steps. Samples older than that are discarded.
This is why πprox exists
At the moment compute_logp runs, the actor is at version N. A sample in the batch might have been generated at version N−k (k ≤ 4). The gap θbehaveθ can span multiple gradient steps and is unknown in magnitude. PPO clip cannot handle this directly — so we snapshot πprox first, then split the correction into two legs: w for the cross-step gap, ratio for the within-step update.
关键区别
rollout worker 的权重确实每步都在同步——但环形缓冲区里已经存着的样本是用更旧版本的权重生成的。同步只影响 worker 从此刻起生成的新样本。
时间线
# actor 版本每步递增
step 1: actor=θ1 → 同步 → rollout worker 开始使用 θbehave(1)
step 2: actor=θ2 → 同步 → rollout worker 开始使用 θbehave(2)
step 3: actor=θ3 → 同步 → rollout worker 开始使用 θbehave(3)
# 但缓冲区里还存着 step 1 时生成的样本
step 5: prepare_batch() 取到 θbehave(1) 生成的样本 → 版本差 = 5−1=4 η → 接受
为什么会积累版本差?
一个梯度步里,actor 完成 ppo_update 大约需要 1.5–3 s。与此同时 rollout worker 在持续推入新样本,旧版本生成的样本也还在队列里等待。等 prepare_batch() 消费缓冲区时,有些样本可能已经排了最多 max_head_offpolicyness = 4 个梯度步。超过这个阈值的样本会被丢弃。
这就是 πprox 存在的原因
compute_logp 运行时,actor 处于版本 N,但 batch 里的样本可能是版本 N−k(k ≤ 4)时生成的。θbehaveθ 的偏移横跨多个梯度步,大小不定,PPO clip 无法直接处理。因此先快照 πprox,再将校正拆成两段:w 处理跨步偏移,ratio 处理步内更新。
actor.compute_logp(batch)batch["prox_logp"]
What this step does
One teacher-forced forward pass over the already-generated response, using the current actor weights θ. The result is a frozen log-prob tensor batch["prox_logp"] of shape [512, seq_len] (32 prompts × 16 samples), representing πprox — the proximal policy snapshot at the start of this gradient step.
How teacher forcing works
The actor receives (prompt_ids, response_ids) from the buffer and constructs:
input = [prompt_ids | response_ids[:-1]] # response right-shifted by 1
The full concatenated sequence is fed into the LLM in a single forward pass. Causal masking makes position t’s output depend only on positions 0…t−1 — mathematically identical to step-by-step autoregressive decoding, but with O(seq_len) parallelism. The log-probs of the actual response tokens are read from the output logits and written to batch["prox_logp"].
How many weight copies exist?
Exactly two physical copies:
θbehavethe behavior policy — rollout worker weights on the rollout GPU partition. “Behave” means it is the policy that acts in the environment: its generated token sequences determine what training data we collect. It never receives gradients — it only “performs” for the reward function. Replaced only during the weight-sync window.
θactor parameter tensors — on the actor GPU partition, updated by every optimizer.step()
πprox is not a third weight copy. It is the [batch, seq_len] log-prob tensor from one forward pass of θ, immediately .detach()-ed. Zero extra GPU memory beyond the tensor itself.
Why not use πbehave directly?
Standard synchronous PPO has two policies and its clip window ε = 0.2 is calibrated for one step of policy change. In AReaL, a sample can be 1–4 steps old (max_head_offpolicyness = 4), so the total shift θθbehave is variable and can be much larger than one step. Clipping this full shift with a fixed ε would make the window either too tight (cuts valid gradients) or too loose (no trust-region protection).
The fix: decompose the shift into two legs

The total drift πbehave → πθ passes through an intermediate anchor πprox. Each leg has a distinct character and gets a different corrective tool:

πbehave  ──────────────────  πprox  ─────────  πθ
     ←———— Leg 1: cross-step ————→←—— Leg 2 ——→
   variable size (0–4 global_steps)         exactly one gradient step
   handled by IS weight w  (hard cap 5.0)     handled by PPO clip [0.8, 1.28]
Leg 1 — cross-step drift
πbehave → πprox spans 1–4 gradient steps. Its size is unpredictable — applying a fixed ε clip here would either kill valid gradients (window too tight) or remove all trust-region protection (window too wide).
Fix: IS weight w = clip(πproxbehave, 0, 5.0) — soft re-weighting with a hard explosion guard.
Leg 2 — within-step drift
πprox → πθ is exactly one gradient update. Size is predictable and bounded — this is precisely what PPO clip was designed for.
Fix: asymmetric clip [0.8, 1.28] — conservative on contraction (ε = 0.2), more room for improvement (ε′ = 0.28).
w = clip(exp(logπprox − logπbehave), 0, 5.0)  # Leg 1 — behav_imp_weight_cap
ratio = exp(logπθ − logπprox)                    # Leg 2 — within-step ratio
L = w · max(−A·ratio, −A·clip(ratio, 0.8, 1.28))  # combined
这一步做什么
对已生成的 response 做一次 teacher forcing 前向传播,使用当前 actor 权重 θ。结果是形状为 [512, seq_len](32 条 prompt × 16 个样本)的冻结 log-prob 张量 batch["prox_logp"],代表本次梯度步开始时的近端策略 πprox
Teacher forcing 的具体过程
actor 从缓冲区取到 (prompt_ids, response_ids),构造输入:
input = [prompt_ids | response_ids[:-1]] # response 右移一位
整段拼接序列一次性喂入 LLM。Causal mask 保证位置 t 的输出只关注位置 0…t−1,与逐步自回归解码在数学上完全等价——但并行度是 O(seq_len) 倍。从输出 logit 读出实际 response token 的 log-prob,写入 batch["prox_logp"]
物理上到底维护了几份权重?
只有两份:
θbehavebehavior policy(行为策略)——rollout worker 的权重缓冲区,存于 rollout GPU 分区。“behave” 的含义是:这是负责在环境中行动的策略,它生成的 token 序列决定了我们收集到什么样的训练数据。它不参与梯度更新,只负责“表演”给奖励函数看。只在权重同步窗口被替换。
θactor 参数张量——存于 actor GPU 分区,每次 optimizer.step() 后原地更新
πprox 不是第三份权重。它是本 global step 开始时用 θ 做一次前向传播得到的 [batch, seq_len] log-prob 张量,随即 .detach() 冻结。除张量本身外不占用额外显存。
为什么不直接用 πbehave 算 ratio?
标准同步 PPO 只有两个策略,clip 的 ε = 0.2 是针对单步策略变化标定的。在 AReaL 中,一条样本可能是 1–4 步之前生成的(max_head_offpolicyness = 4),θθbehave 的总偏移可变且可能远大于一步。直接对完整偏移用固定 ε clip,窗口要么太窄(截掉有效梯度),要么太宽(失去信任域保护)。
解决方案:将偏移分两段

总偏移 πbehave → πθ 经过中间锤点 πprox。两段性质完全不同,分别用不同的工具处理:

πbehave  ──────────────────  πprox  ─────────  πθ
    ←———— 第一段:跨步偏移 ————→←—— 第二段 ——→
   1–4 个 global_step,大小可变                  本次更新内一个梯度步
   IS 权重 w 处理  (上限 5.0)                 PPO clip 处理  [0.8, 1.28]
第一段 —— 跨步偏移
πbehave → πprox 经历 1–4 个梯度步,大小不可预测。用固定 ε clip 要么截掉有效梯度,要么失去信任域。
解法:IS 权重 w = clip(πproxbehave, 0, 5.0)——软校正 + 硬截断防梯度爆炸。
第二段 —— 步内偏移
πprox → πθ 恰好是一个梯度更新,大小可预测。这正是 PPO clip 被设计来处理的场景。
解法:非对称 clip [0.8, 1.28]——下侧收缩保守 (ε = 0.2),上侧改进空间更大 (ε′ = 0.28)。
w = clip(exp(logπprox − logπbehave), 0, 5.0)  # 第一段 — behav_imp_weight_cap
ratio = exp(logπθ − logπprox)                    # 第二段 — 步内 ratio
L = w · max(−A·ratio, −A·clip(ratio, 0.8, 1.28))  # 合并
actor.compute_advantages(batch)
What this step does
Compute GRPO group-normalized advantages — the scalar signal that tells each token whether it was part of a “good” or “bad” response relative to its siblings.
Group whitening formula — step-by-step

For each prompt, G = 16 responses were sampled. compute_advantages groups them and whitens rewards within the group. Here's a concrete example:

Prompt: "What is 12 × 8?"
Resp Answer Raw reward ri
1961.00
2961.00
3961.00
4950.50
5980.00
mean = 0.70, std = 0.41, ε = 1e-6
Ai = (ri0.70) / (0.41 + 1e-6)
✓ Correct (r = 1.00)
A = (1.00 − 0.70) / 0.41 = +0.73
Better than group average → positive signal
✗ Wrong (r = 0.00)
A = (0.00 − 0.70) / 0.41 = −1.71
Worse than group average → negative signal

Broadcast to tokens: All tokens in response #1 get A = +0.73. All tokens in response #5 get A = −1.71. This is GRPO's key approximation — the episode-level reward proxies per-token contribution.

Note on scale: Without std normalization, a binary task (0/1) would have A ∈ [−0.5, +0.5] while a math task (0–1 continuous) might have A ∈ [−2, +2]. Division by std equalizes gradient magnitudes across task types.
Why group normalization instead of a critic?

PPO needs a baseline V(s) to form the advantage A = Q(s,a) − V(s). The classic solution is a critic network — but this breaks down for LLMs/VLMs on three fronts:

① Memory cost
Critic must match actor scale (8B params). Running both doubles VRAM and optimizer state.
② Underfitting on VLM inputs
The critic must predict V from image + question + partial response. Early in training its estimates are pure noise — adding variance instead of reducing it.
③ Multi-task scale mismatch
Binary format rewards (0/1) and continuous math scores have different variance. A single critic cannot fit all reward scales simultaneously.

GRPO’s fix — Monte Carlo group baseline:

Vπ(s) = Ea∼π[R(s,a)]mean(r1…rG)  // G=16 siblings drawn from same prompt

This is an unbiased estimator of Vπ(s) by definition — no extra parameters, no warmup, self-consistent from step 1.

Standard-deviation normalization then equalizes gradient magnitude across tasks:

Ai = (ri − mean) / (std + ε)  // gradient magnitude normalized across all task types
Critic (PPO)
Extra 8B network • biased early • multi-task scale problem • needs warmup
Group baseline (GRPO) ✓
Zero extra params • unbiased MC estimate • per-task std normalization • works from step 1
Edge case: all responses tie
If all G responses receive identical rewards (std ≈ 0), all advantages collapse to zero and the gradient vanishes for that prompt — correct behavior, since there is no learning signal when all outputs are equally (un)rewarded. The small ε in the denominator prevents division-by-zero.
这一步做什么
计算 GRPO 组内归一化优势——告诉每个 token 它所属的回答相对同组兄弟回答是“好”还是“差”的标量信号。
组内白化公式 —— 按一个 prompt 具体算

对每个 prompt,rollout 会采样 G = 16 条回答。compute_advantages 做的事情不是逐 token 打分,而是先把这 16 条回答重新放回同一组,再在组内对标量奖励做白化。下面按一个具体例子来看:

例子:prompt = “What is 12 × 8?”
实际代码里这一组本来有 16 条回答,这里只展开其中 5 条来演示公式怎么落到每个样本上。
样本 i 回答内容 奖励 ri
1961.00
2961.00
3961.00
4950.50
5980.00
组内均值 mean = 0.70组内标准差 std = 0.41ε ≈ 1e-6
Ai = (ri0.70) / (0.41 + 1e-6)
样本 1:正确回答(r = 1.00)
A1 = (1.00 − 0.70) / 0.41 = +0.73
比组内平均更好,所以是正优势
样本 5:错误回答(r = 0.00)
A5 = (0.00 − 0.70) / 0.41 = −1.71
比组内平均更差,所以是负优势
广播到 token 是什么意思?
如果样本 1 的回答是 ["The", "answer", "is", "96"],那么这 4 个 token 在 loss 里都会拿到同一个 A = +0.73
如果样本 5 的回答是 ["It", "might", "be", "98"],那么这 4 个 token 都会拿到同一个 A = -1.71

所以,GRPO 不是在判断“哪个 token 对,哪个 token 错”,而是在说:“这整条回答相对同组兄弟回答更好还是更差”。然后把这个回合级信号平均分配给该回答中的每个 token。

为什么还要除以 std? 因为不同任务的奖励尺度不一样。比如二值题的 reward 可能只有 0/1,而数学题、IoU、OCR 匹配分可能是连续值。除以组内标准差之后,不同任务的 advantage 大小会被拉到更接近的量级,训练更稳定。
为什么用组归一化而非价值网络?

PPO 需要一个 基线 V(s) 来计算优势 A = Q(s,a) − V(s)。经典解法是引入 critic(评论家网络)——但在 LLM/VLM 上会在三个方面失效:

① 显存开销
Critic 规模必须与 actor 相当(8B 参数)。同时跑两个模型,显存与优化器状态直接翻倍。
② VLM 输入下欠拟合
Critic 需从 图片 + 问题 + 已生成部分 预测 V。训练早期估值几乎是噪声,增大了方差而非降低。
③ 多任务奖励尺度不一
二值格式奖励(0/1)与连续数学分数方差完全不同。单一 critic 无法同时拟合所有任务的奖励尺度。

GRPO 的解法——蒙特卡洛组基线:

Vπ(s) = Ea∼π[R(s,a)]mean(r1…rG)  // G=16 个兄弟回答来自同一 prompt

根据定义,这是 Vπ(s) 的无偏估计——无需额外参数,无需预热,从第一步起就自洽。

标准差归一化再进一步,使各任务的梯度量级统一:

Ai = (ri − mean) / (std + ε)  // 所有任务类型的梯度量级归一化
Critic(PPO)
额外 8B 网络 • 早期估值有偏 • 多任务尺度问题 • 需要 warmup
组基线(GRPO)✓
零额外参数 • 无偏 MC 估计 • 逐任务标准差归一化 • 第一步即可用
边界情况:G 条回答奖励相同
若 G 条回答获得相同奖励(std ≈ 0),所有优势値趋近于零,该 prompt 的梯度消失——这是正确行为,因为当所有输出同等好坏时不存在学习信号。分母中的小量 ε 防止除零。
actor.ppo_update(batch)
What this step does
Compute the full AReaL-GRPO loss from two multiplicative components, then backpropagate and update actor weights.
Component 1 — PPO clipped surrogate (trust-region term)
ratio = exp(logπθ − logπprox)
LPPO = −min(ratio · A,  clip(ratio, 1−ε, 1+ε′) · A)
Asymmetric clipping (ε = 0.2, ε′ = 0.28) lets probability mass rise faster than it falls — accelerating reward-seeking while capping runaway updates. The min is a pessimistic bound:
  • A > 0 (good token): policy is rewarded for raising ratio, capped at 1+ε′
  • A < 0 (bad token): policy is penalized once ratio exceeds 1−ε
Component 2 — off-policy IS weight (AReaL correction)

The total drift from sample generation to the current update is πbehave → πθ. AReaL splits this into two segments with different properties and handles each separately:

πbehave  ────────────────  πprox  ────────  πθ
        ↑ Segment 1: cross-step drift ↑       ↑ Segment 2 ↑
          1–4 global_steps, variable size      one gradient step
          handled by w (IS weight, cap 5.0)     handled by PPO clip [0.8, 1.28]
Segment 1 — cross-step drift
Size is unpredictable (0–4 steps). PPO’s fixed ε can’t cover it: too tight clips valid gradients, too loose loses trust-region. Handled by the IS weight w: soft correction with a hard cap at 5.0 to prevent gradient explosion on stale samples.
Segment 2 — within-step drift
Exactly one gradient step wide — predictable, calibrated. PPO clip [0.81.28] applies here with full trust-region guarantees restored.
w = clip(exp(logπprox − logπbehave), 0, 5.0)  # Segment 1 correction
L = w · LPPO                  # Segment 2 handled inside L_PPO via clip
No KL penalty (kl_ctl=0)
Standard RLHF adds a KL term to keep the policy near a reference model. AReaL omits it: PPO clip already provides a per-step trust region, and the staleness guard (η = 4) limits global drift. Adding KL on top would double-penalize deviation and slow convergence on vision tasks where the base model is already well-calibrated.
Backward & optimizer
Loss is averaged over all non-padding token positions and over the batch. loss.backward() computes per-parameter gradients; optimizer.step() (AdamW, lr = 2e-6) updates actor weights in-place.
这一步做什么
从两个相乘分量计算完整的 AReaL-GRPO 损失,然后反向传播并更新 actor 权重。
分量 1 —— PPO 裁剪代理损失(信任域项)
ratio = exp(logπθ − logπprox)
LPPO = −min(ratio · A,  clip(ratio, 1−ε, 1+ε′) · A)
非对称裁剪ε = 0.2ε′ = 0.28)允许概率质量增长比下降更快——加速奖励追求的同时遗制策略失控更新。min 是悲观下界:
  • A > 0(好 token):策略因提高 ratio 而获益,上限为 1+ε′
  • A < 0(坏 token):ratio 超过 1−ε 后受到惩罚
分量 2 —— 离策重要性权重(AReaL 校正项)

从样本生成到当前更新,策略经历的总漂移是 πbehave → πθ。AReaL 将其拆成两段性质不同的区间,分别处理:

πbehave  ────────────────  πprox  ────────  πθ
        ↑ 第一段:跨步偏移 ↑            ↑ 第二段 ↑
          1–4 个 global_step,大小不固定      一个梯度步
          由 w(IS 权重,上限 5.0)处理       由 PPO clip [0.8, 1.28] 处理
第一段 —— 跨步偏移
大小不可预测(0–4 步)。PPO 的固定 ε 无法覆盖:窗口太窄截掉有效梯度,太宽失去信任域保护。由 IS 权重 w 处理:软校正 + 硬截断 5.0,防止过期样本梯度爆炸。
第二段 —— 步内偏移
恰好是一个梯度步——大小可预测,经过标定。PPO clip [0.81.28] 在此段正常生效,信任域保证完全恢复。
w = clip(exp(logπprox − logπbehave), 0, 5.0)  # 第一段校正
L = w · LPPO                  # 第二段由 L_PPO 内部 clip 处理
无 KL 惩罚(kl_ctl=0
标准 RLHF 添加 KL 项以约束策略不偏离参考模型。AReaL 省略它:PPO 裁剪已提供逐步信任域,时滞守卫(η = 4)在全局限制了策略漂移。叠加 KL 惩罚会重复惩罚偏差,在基础模型已充分校准的视觉语言任务上减慢收敛。
反向传播与优化器
损失在所有非填充 token 位置及 batch 上取平均。loss.backward() 计算各参数梯度;optimizer.step()(AdamW,lr = 2e-6)就地更新 actor 权重。
rollout.pause() / actor.update_weights(...) / rollout.set_version(...) / rollout.resume()
What this step does
The weight synchronization window — the only moment actor and rollout workers are not fully parallel. Happens every single global_step, unconditionally. Four ordered sub-steps transfer the freshly-updated actor weights to the rollout GPU partition.
The four sub-steps
1rollout.pause() — calls self.paused.set() on a threading.Event. The rollout thread stops accepting new requests. Any in-flight generation that hits the paused check enters await asyncio.sleep(0.5) loop until the flag is cleared. pause_grace_period defaults to 0.0 s in this config.
2actor.update_weights(weight_update_meta) — transfers full parameter tensors (not deltas) to the rollout partition. Cross-node: NCCL/XCCL distributed. Single-node: disk-based load (fallback). weight_update_meta carries per-shard routing info (tensor-parallel rank → target device IDs).
3rollout.set_version(global_step + 1) — calls self._version = version under a lock, after the copy completes. All samples generated henceforth are tagged with this new version. The staleness guard in prepare_batch rejects samples where actor_version − sample_version > 4 (max_head_offpolicyness).
4rollout.resume() — calls self.paused.clear(). Workers exit the sleep loop and immediately resume generating with the new weights.
Why is this window so short?
NVLink / NVSwitch transfers an 8B model in roughly 30–50 ms. A typical gradient step on 4×A100s takes 1.5–3 s — the sync pause is <2% of wall-clock step time. AReaL’s 2.77× throughput gain comes precisely from keeping this window tiny relative to rollout generation time — which in a synchronous setup would block the actor entirely.
这一步做什么
权重同步窗口——actor 与 rollout worker 唯一不完全并行的时刻。每个 global_step 无条件执行一次。四个有序子步将最新更新的 actor 权重传输到 rollout GPU 分区。
四个子步骤
1rollout.pause() —— 调用 self.paused.set(),设置一个 threading.Event。rollout 线程停止接受新请求。正在生成中的序列遇到暂停检查后进入 await asyncio.sleep(0.5) 循环等待。此配置中 pause_grace_period 默认为 0.0 s
2actor.update_weights(weight_update_meta) —— 传输完整参数张量(非增量)到 rollout 分区。跨节点走 NCCL/XCCL 分布式传输,单节点走磁盘加载(fallback)。weight_update_meta 带有每个分片的路由信息(张量并行 rank → 目标设备 ID)。
3rollout.set_version(global_step + 1) —— 在复制完成后加锁执行 self._version = version。此后生成的样本均打上新版本号。prepare_batch 中的时滞检查拒绝满足 actor_version − sample_version > 4 的样本(max_head_offpolicyness)。
4rollout.resume() —— 调用 self.paused.clear(),worker 退出 sleep 循环,立即以新权重继续生成。
为什么这个窗口很短?
NVLink / NVSwitch 传输 8B 模型权重约需 30–50 ms。在 4×A100 上一个典型梯度步耗时 1.5–3 s——同步暂停占墙钟步时间不到 2%。AReaL 2.77× 的吸吐量提升,正是来自将这一窗口相对于 rollout 生成时间压缩到极小——而同步方案中 actor 本需等待整个 rollout 完成。

Training Configuration训练配置

eps_clip
0.2 / 0.28 (asymmetric)
kl_ctl
0.0 (disabled)
reward_scaling
10.0
reward_bias
−0.5
group_size
16
max_new_tokens
4096
temperature
1.0
learning_rate
2e-6
epochs
30
batch_size
32
offpolicyness (η)
4
behav_weight_cap
5.0
dtype
bfloat16
allocation
d4p1t1 + d4p1t1

Acknowledgements致谢

We thank the following projects and frameworks:

我们感谢以下项目与框架:

  • AReaL: Lightning-Fast RL for LLM Reasoning and Agents面向大模型推理与智能体的高吞吐强化学习框架
  • sglang: Fast serving framework for LLMs and vision language models面向大语言模型与视觉语言模型的高性能服务框架
  • lmms-eval: Standardized evaluation framework标准化评测框架
  • LLaVA: Large Language-and-Vision Assistant大型语言-视觉助手
  • LLaVA-NeXT: Next-generation multi-modal assistant新一代多模态助手

Citation引用

citation.bib
bibtex
@inproceedings{LLaVA-OneVision-1.5,
  title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
  author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
  booktitle={arXiv},
  year={2025}
}

@inproceedings{xie2025region,
  title={Region-based Cluster Discrimination for Visual Representation Learning},
  author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang},
  booktitle={ICCV},
  year={2025}
}

@article{lillava,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  journal={Transactions on Machine Learning Research},
  year={2024}
}