RICE

ICCV 2025 · Highlight

Region-based Cluster Discrimination for Visual Representation Learning基于区域的聚类判别视觉表征学习

arXiv Paper论文 ↗ GH GitHub ↗ 🤗 HuggingFace ↗ 🤗 Model Zoo模型库 ↗

Yin Xie*1 Kaicheng Yang*1 Xiang An*1 Kun Wu1 Yongle Zhao1 Weimo Deng1 Zimin Ran1 Yumeng Wang1 Ziyong Feng1 Miles Roy2 Elezi Ismail2 Jiankang Deng2

¹ GlintLab ² Imperial College London * Equal contribution同等贡献

How to use it使用方法

Quickstart快速开始Click to expand quickstart & code点击展开查看快速开始与代码示例

Use this quickstart to load the released RICE-ViT checkpoint, prepare an input image, and extract visual features in a few lines.通过这段 quickstart，你可以加载已发布的 RICE-ViT 模型权重、准备输入图像，并在几行代码内提取视觉特征。

Load dependencies加载依赖

Import Transformers, PIL, requests, and torch so the model and image pipeline are ready.导入 Transformers、PIL、requests 和 torch，先把模型与图像处理链路准备好。

Load model + processor加载模型与处理器

Pull the released Hugging Face checkpoint once, then reuse the same processor for inference.从 Hugging Face 拉取已发布权重，并复用同一个 processor 进行推理预处理。

Encode an image编码输入图像

Fetch an image, run a forward pass, and read out the feature tensor from the last hidden state.读取图像并执行前向推理，然后从最后一层隐藏状态中取出特征张量。

Python Quickstart

Minimal example for loading the official checkpoint and printing the feature shape.用于加载官方权重并输出特征 shape 的最小示例。

Imports导入模块

Model loading加载模型

Inference前向推理

from transformers import AutoModel, AutoProcessor
from PIL import Image
import requests
import torch

model = AutoModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
processor = AutoProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

features = outputs.last_hidden_state[0]
print(features.shape)

Output输出

This script prints the extracted feature tensor shape, e.g. torch.Size([...]), so you can quickly confirm the encoder is working.这段脚本会输出提取后的特征张量 shape，例如 torch.Size([...])，方便你快速确认编码器已经正常工作。

Adopted By被以下模型采用

RICE-ViT is used as the vision encoder in the following MLLMs:RICE-ViT 被以下多模态大语言模型用作视觉编码器：

Model模型	Organization机构	Downloads / month月下载量
LLaVA-OneVision-1.5-8B-Instruct	LMMs-Lab	16,562
VAETKI-VL-7B-A1B	NC-AI Consortium	949
Innovator-VL-8B-Instruct	InnovatorLab	64

The Margin-based Vision Transformer (MVT) series represents a family of state-of-the-art vision encoders designed for universal visual representation learning. The latest version, RICE (Region-based Cluster Discrimination), advances visual understanding by processing diverse semantic regions within images using a single forward pass.

Margin-based Vision Transformer (MVT) 系列代表了一类用于通用视觉表征学习的最先进视觉编码器。最新版本 RICE（Region-based Cluster Discrimination，基于区域的聚类判别） 通过单次前向传播处理图像中多样的语义区域，推动了视觉理解的进步。

RICE introduces a novel approach to visual representation learning that jointly captures general visual semantics (objects, scenes), OCR semantics (text within images), and unified representations that seamlessly integrate both modalities. This enables superior performance across multiple vision tasks including image retrieval, visual question answering, and multimodal understanding.

RICE 引入了一种新颖的视觉表征学习方法，联合捕获通用视觉语义（物体、场景）、OCR 语义（图像中的文字）以及统一表征，无缝整合两种模态。这使其在图像检索、视觉问答和多模态理解等多项视觉任务中实现卓越性能。

Highlights亮点

        Region-based Processing — Processes diverse
            semantic regions within images — including objects, scenes, and text
            — using a single efficient forward pass through the vision
            transformer.
基于区域的处理 — 通过视觉 Transformer 的单次高效前向传播处理图像中多样的语义区域——包括物体、场景和文字。
Joint OCR & Visual Semantics — Uniquely captures
            both general visual semantics and OCR semantics in a unified
            representation, enabling richer multimodal understanding of image
            content.
联合 OCR 与视觉语义 — 在统一表征中同时捕获通用视觉语义和 OCR 语义，实现更丰富的图像内容多模态理解。
State-of-the-Art Performance — Achieves superior
            results across multiple vision benchmarks including image retrieval,
            VQA, and multimodal understanding tasks using the LLaVA-NeXT
            framework.
最先进的性能 — 使用 LLaVA-NeXT 框架，在图像检索、VQA 和多模态理解等多个视觉基准测试中取得卓越结果。

      

Architecture架构

Input Image输入图像

H×W×3

→

Vision Transformer视觉 Transformer

Single Forward Pass单次前向传播

↓

Token GridToken 网格

H×W + 1 (Class Token)

→

Region Transformer区域 Transformer

Sample Regions by Mask
(ROI Align)按掩码采样区域
(ROI Align)

↓

Region Attention
Region-specific Visibility Mask区域注意力
区域可见性掩码

↓

Fixed-length Region Embeddings固定长度区域嵌入

Object Tokens物体 Token OCR TokensOCR Token

→

Unified Supervision统一监督

Object Region Loss物体区域损失

Single-label Cluster Discrimination单标签聚类判别

Semantic Cluster Centers语义聚类中心

OCR Region LossOCR 区域损失

Multi-label Cluster Discrimination多标签聚类判别

Vocabularies / Token Embeddings词表 / Token 嵌入

Figure 1. RICE architecture efficiently processes diverse semantic regions within images using region-based cluster discrimination. The model jointly captures general visual semantics, OCR semantics, and unified representations in a single forward pass.

图 1. RICE 架构通过基于区域的聚类判别高效处理图像中多样的语义区域。该模型在单次前向传播中联合捕获通用视觉语义、OCR 语义和统一表征。

LLaVA Experimental ResultsLLaVA 实验结果

Comparison of RICE-ViT with other vision encoders using the LLaVA-NeXT framework. All models are evaluated using identical configurations: Qwen2.5-7B as the language model, LLaVA-NeXT training data, and the same training pipeline. We adopt LLaVA-NeXT's tiling strategy (up to 2×2+1 tiles) for handling high-resolution images.

将 RICE-ViT 与其他视觉编码器在 LLaVA-NeXT 框架下进行比较。所有模型均使用相同配置进行评估：Qwen2.5-7B 作为语言模型，LLaVA-NeXT 训练数据和相同的训练流程。我们采用 LLaVA-NeXT 的分块策略（最多 2×2+1 块）处理高分辨率图像。

RICE-ViT Vision Encoder视觉编码器

→

Qwen2.5-7B Language Model语言模型

→

Benchmark Results基准测试结果 Output输出

Read the two Avg columns first to compare OCR gains and general vision quality at a glance.建议先看两个 Avg 列，可以最快比较 OCR 提升和通用视觉能力。

Swipe horizontally to view the full table横向滑动查看完整表格

Model Configuration模型配置		OCR & Document UnderstandingOCR 与文档理解								General Vision Understanding通用视觉理解
Method方法	Vision Tower视觉塔	InfoVQA	DocVQA	ChartQA	TextVQA	OCRBench	OCRBenchV2	LiveXivQA	OCR AvgOCR 平均	AI2D	MMB^EN	MME^Cog	MME^Per	POPE	RealworldQA	MMStar	Other Avg其他平均
CLIP	ViT-L-14-336px	38.9	75.2	66.5	62.5	52.5	23.0	47.4	52.3	73.2	74.6	48.0	75.6	88.8	63.7	49.0	67.6
MLCD	ViT-L-14-336px	43.5	76.5	67.8	61.7	53.1	24.0	48.4	53.6	77.0	76.4	54.1	79.9	88.7	61.1	51.0	69.7
AIMv2	ViT-L-14-336px	35.4	77.2	72.7	65.9	57.2	23.9	47.3	54.2	75.4	78.6	48.3	75.0	88.4	62.2	50.2	68.3
RICE-ViT	ViT-L-14-336px	45.2	79.2	72.3	65.9	57.5	24.1	48.9	56.2	77.9	76.6	54.6	80.7	88.5	63.1	51.8	70.5
DFN5B	ViT-H-14-378px	38.6	70.9	64.4	59.4	47.3	21.9	46.2	49.8	73.5	73.4	45.8	76.9	88.6	59.9	49.1	66.7
SigLIP	ViT-SO400M-14-384px	41.4	76.7	69.3	64.7	55.4	24.0	48.4	54.3	76.2	77.0	46.1	79.9	88.8	63.7	47.3	68.4
SigLIPv2	ViT-SO400M-14-384px	43.7	79.1	70.2	66.2	58.7	25.4	48.6	56.0	77.0	77.1	46.6	80.4	89.3	63.4	52.8	69.5
RICE-ViT	ViT-L-14-378px	48.1	82.6	75.1	66.2	58.8	25.8	49.5	58.0	76.5	77.6	54.1	79.0	89.1	62.9	51.2	70.1
SigLIPv2	ViT-SO400M-16-560px	50.2	86.2	77.4	70.2	62.7	26.5	52.9	60.9	77.0	76.5	53.5	79.9	89.3	68.2	53.1	71.1
RICE-ViT	ViT-L-14-560px	53.2	87.4	78.1	69.0	60.7	26.1	53.0	61.1	76.9	78.6	56.3	79.3	88.9	65.1	50.5	70.8
Qwen-ViT from Qwen2.5-VL 7B	ViT-H-14-560px	55.9	85.8	78.8	73.7	66.2	26.8	53.4	62.9	78.8	78.4	62.0	80.8	88.6	64.2	55.0	72.5
RICE-ViT from OV-1.5 3B	ViT-L-14-560px	53.7	87.1	81.9	73.8	73.3	30.4	53.6	64.8	80.3	79.6	58.6	82.2	89.0	67.3	56.6	73.4

Referring Image Segmentation指代图像分割

Table 2. Performance comparison of referring image segmentation across vision-language models. Results are reported as IoU scores (%) on the refCOCO, refCOCO+, and refCOCOg benchmarks. Our RICE vision encoder consistently outperforms all competing approaches, achieving state-of-the-art results across all benchmarks when integrated with Qwen2.5-7B in the LLaVA-NeXT framework.

表 2. 视觉-语言模型的指代图像分割性能比较。结果以 IoU 分数（%）报告，涵盖 refCOCO、refCOCO+ 和 refCOCOg 基准测试。当与 Qwen2.5-7B 集成到 LLaVA-NeXT 框架中时，我们的 RICE 视觉编码器在所有基准测试中始终优于所有竞争方法，达到最先进的结果。

RICE-ViT Vision Encoder视觉编码器

→

Qwen2.5-7B Language Model语言模型

→

SAM Segment Anything分割一切

→

Segmentation Results分割结果 Output输出

RICE is highlighted against prior methods and two LLaVA baselines so you can scan the strongest IoU gains quickly.RICE 行已与已有方法和两套 LLaVA 基线对照，方便快速查看 IoU 提升。

Swipe horizontally to view the full table横向滑动查看完整表格

Model Configuration模型配置			refCOCO			refCOCO+			refCOCOg
Vision Tower视觉塔	LLM	Method方法	val	testA	testB	val	testA	testB	val	test
Previous Methods已有方法
CLIP	Vicuna-7B	GLaMM	79.5	83.2	76.9	72.6	78.7	64.6	74.2	74.9
CLIP	Vicuna-7B	VisionLLMv2	79.2	82.3	77.0	68.9	75.8	61.8	73.3	74.8
CLIP	Vicuna-7B	LLaVA-G-7B	77.1	–	–	68.8	–	–	71.5	–
CLIP	LLaMA2-13B	GSVA	79.2	81.7	77.1	70.3	73.8	63.6	75.7	77.0
CLIP	LLaMA2-13B	PixelLM-7B	73.0	–	–	66.3	–	–	69.3	–
ConvNext-L	InternLM2-7B	OMG-LLaVA	77.2	79.8	74.1	68.7	73.0	61.6	71.7	71.9
InternViT2.5-300M	InternLM2.5-7B	Sa2VA	81.6	–	–	76.2	–	–	78.7	–
InternViT2.5-6B	InternLM2.5-20B	Sa2VA	82.5	–	–	78.8	–	–	79.7	–
LLaVA-1.5 Framework
CLIP	Vicuna-7B	LISA	74.9	79.1	72.3	65.1	70.8	58.1	67.9	70.6
RICE	Vicuna-7B	LISA	76.3	80.3	75.1	67.4	72.7	60.6	69.0	73.4
Avg: +2.00 ↑ Improvement over CLIP平均: +2.00 ↑ 相对 CLIP 的提升			+1.4	+1.2	+2.8	+2.3	+1.9	+2.5	+1.1	+2.8
LLaVA-NeXT Framework
CLIP	Qwen2.5-7B	LISA	81.8	84.0	79.1	76.6	80.5	70.9	77.3	78.5
MLCD	Qwen2.5-7B	LISA	82.8	84.6	80.2	77.4	81.6	73.1	78.5	79.7
RICE	Qwen2.5-7B	LISA	83.5	85.3	81.7	79.4	82.8	75.4	79.8	80.4
Avg: +2.45 ↑ Improvement over CLIP平均: +2.45 ↑ 相对 CLIP 的提升			+1.7	+1.3	+2.6	+2.8	+2.3	+4.5	+2.5	+1.9
Avg: +1.30 ↑ Improvement over MLCD平均: +1.30 ↑ 相对 MLCD 的提升			+0.7	+0.7	+1.5	+2.0	+1.2	+2.3	+1.3	+0.7

Detection Probe检测探测

We evaluate RICE against several state-of-the-art pretrained vision encoders across multiple benchmarks. All evaluations are conducted using the Cascade Mask R-CNN framework implemented in Detectron2. Experiments are performed on the COCO and LVIS validation sets, and on the Roboflow100 benchmarks. RICE achieves superior performance across all evaluation metrics, highlighting its strong representational quality for both natural images and specialized domains.

我们在多个基准测试中将 RICE 与几种最先进的预训练视觉编码器进行评估。所有评估均使用 Detectron2 实现的 Cascade Mask R-CNN 框架进行。实验在 COCO 和 LVIS 验证集以及 Roboflow100 基准测试上进行。RICE 在所有评估指标上均实现卓越性能，突出了其对自然图像和专业领域的强大表征能力。

RICE-ViT Vision Encoder视觉编码器

→

Cascade Mask R-CNN Detector检测器

→

COCO · LVIS · RF100 Output输出

Focus on the final AVG column to compare cross-domain robustness after checking COCO and LVIS detection quality.建议先看 COCO / LVIS 指标，再看最后的 AVG 列判断跨域鲁棒性。

Swipe horizontally to view the full table横向滑动查看完整表格

Configuration配置			COCO		LVIS		Roboflow100 BenchmarksRoboflow100 基准测试
Method	Arch	Res	Det AP	Seg AP	Det AP	Seg AP	Aerial	Video Games	Microscopic	Under Water	Documents	Electromagnetic	Real world	AVG
DINOv2	ViT-B/14	518	31.6	24.3	18.7	14.1	2.3	14.3	10.6	19.9	18.8	15.3	26.8	15.4
SigLIP	ViT-B/16	512	35.0	28.1	21.8	17.3	9.4	29.5	20.0	29.4	23.4	18.6	38.0	24.1
MLCD	ViT-B/16	512	35.6	28.6	22.1	17.8	11.4	19.9	14.9	21.0	13.3	15.8	25.0	17.3
RICE	ViT-B/16	512	38.9	31.5	26.5	21.4	14.9	31.7	23.4	30.7	27.0	18.7	39.1	26.5

Tracking Probe跟踪探测

We evaluate the temporal matching capability of local features within the general video object tracking framework OSTrack, adopting an attention probing approach to compare the four pre-trained models. Two standard vision transformer blocks are inserted between the frozen backbone and the prediction head to enhance information exchange between the template and search images. As shown in Table 4, RICE achieves the best performance across all metrics on LaSOT, TrackingNet, GOT-10k, and TNL2k.

我们在通用视频目标跟踪框架 OSTrack 中评估局部特征的时序匹配能力，采用注意力探测方法比较四种预训练模型。在冻结的主干网络和预测头之间插入两个标准视觉 Transformer 块，以增强模板和搜索图像之间的信息交换。如表 4 所示，RICE 在 LaSOT、TrackingNet、GOT-10k 和 TNL2k 的所有指标上均取得最佳性能。

RICE-ViT Vision Encoder视觉编码器

→

OSTrack Tracker跟踪器

→

LaSOT · TrackingNet · GOT-10k · TNL2k Output输出

Read each benchmark block left to right: RICE consistently lifts Suc., Pre., and Norm. Pre. together.建议按 benchmark 分块从左到右阅读，RICE 在 Suc.、Pre. 和 Norm. Pre. 上都保持同步提升。

Swipe horizontally to view the full table横向滑动查看完整表格

	LaSOT			TrackingNet			GOT-10k			TNL2k
Method	Suc.	Pre.	Norm. Pre.	Suc.	Pre.	Norm. Pre.	AO	SR-0.5	SR-0.75	Suc.	Pre.	Norm. Pre.
DINOv2	55.11	54.99	65.52	71.20	64.70	77.70	53.60	64.90	35.50	41.95	36.03	57.40
SigLIP	55.52	56.16	65.33	72.60	66.70	78.70	53.50	63.10	35.40	43.90	39.06	59.03
MLCD	58.05	60.75	68.31	75.30	70.20	80.80	53.80	62.80	39.70	45.22	40.62	60.64
RICE	60.24	63.16	69.66	76.30	71.80	81.30	55.40	63.50	41.60	45.70	41.95	61.18

Visualization可视化

Semantic Feature Visualization — **Figure 2.** Semantic feature visualization using 2048-resolution images as input to ViT-B/16. Token features are projected onto RGB channels via PCA. Sequential frames (arranged vertically) show consistent attention on salient objects (ice skaters, deers, motorcyclists, cyclists), with stable color patterns maintained throughout the sequence.

**图 2.** 使用 2048 分辨率图像作为 ViT-B/16 输入的语义特征可视化。通过 PCA 将 Token 特征投影到 RGB 通道。连续帧（垂直排列）显示出对显著物体（滑冰运动员、鹿、摩托车手、骑行者）的一致注意，且色彩模式在整个序列中保持稳定。

Citation引用

If you find this work useful, please cite our paper:如果您觉得这项工作有用，请引用我们的论文：

@inproceedings{yinxie_2025_rice,
  title     = {Region-based Cluster Discrimination for Visual Representation Learning},
  author    = {Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Roy, Miles and Ismail, Elezi and Deng, Jiankang},
  booktitle = {ICCV},
  year      = {2025}
}