RICE

ICCV 2025 · Highlight
Region-based Cluster Discrimination for Visual Representation Learning基于区域的聚类判别视觉表征学习
Yin Xie*1 Kaicheng Yang*1 Xiang An*1 Kun Wu1 Yongle Zhao1 Weimo Deng1 Zimin Ran1 Yumeng Wang1 Ziyong Feng1 Miles Roy2 Elezi Ismail2 Jiankang Deng2
1 GlintLab    2 Imperial College London    * Equal contribution同等贡献

How to use it使用方法

Quickstart快速开始Click to expand quickstart & code点击展开查看快速开始与代码示例

Use this quickstart to load the released RICE-ViT checkpoint, prepare an input image, and extract visual features in a few lines.通过这段 quickstart,你可以加载已发布的 RICE-ViT 模型权重、准备输入图像,并在几行代码内提取视觉特征

01
Load dependencies加载依赖
Import Transformers, PIL, requests, and torch so the model and image pipeline are ready.导入 Transformers、PIL、requests 和 torch,先把模型与图像处理链路准备好。
02
Load model + processor加载模型与处理器
Pull the released Hugging Face checkpoint once, then reuse the same processor for inference.从 Hugging Face 拉取已发布权重,并复用同一个 processor 进行推理预处理。
03
Encode an image编码输入图像
Fetch an image, run a forward pass, and read out the feature tensor from the last hidden state.读取图像并执行前向推理,然后从最后一层隐藏状态中取出特征张量。
Python Quickstart
Minimal example for loading the official checkpoint and printing the feature shape.用于加载官方权重并输出特征 shape 的最小示例。
Imports导入模块
Model loading加载模型
Inference前向推理
from transformers import AutoModel, AutoProcessor
from PIL import Image
import requests
import torch

model = AutoModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
processor = AutoProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

features = outputs.last_hidden_state[0]
print(features.shape)
Output输出
This script prints the extracted feature tensor shape, e.g. torch.Size([...]), so you can quickly confirm the encoder is working.这段脚本会输出提取后的特征张量 shape,例如 torch.Size([...]),方便你快速确认编码器已经正常工作。

Adopted By被以下模型采用

RICE-ViT is used as the vision encoder in the following MLLMs:RICE-ViT 被以下多模态大语言模型用作视觉编码器:

Model模型 Organization机构 Downloads / month月下载量
LLaVA-OneVision-1.5-8B-Instruct LMMs-Lab 16,562
VAETKI-VL-7B-A1B NC-AI Consortium 949
Innovator-VL-8B-Instruct InnovatorLab 64

The Margin-based Vision Transformer (MVT) series represents a family of state-of-the-art vision encoders designed for universal visual representation learning. The latest version, RICE (Region-based Cluster Discrimination), advances visual understanding by processing diverse semantic regions within images using a single forward pass.

Margin-based Vision Transformer (MVT) 系列代表了一类用于通用视觉表征学习的最先进视觉编码器。最新版本 RICE(Region-based Cluster Discrimination,基于区域的聚类判别) 通过单次前向传播处理图像中多样的语义区域,推动了视觉理解的进步。

RICE introduces a novel approach to visual representation learning that jointly captures general visual semantics (objects, scenes), OCR semantics (text within images), and unified representations that seamlessly integrate both modalities. This enables superior performance across multiple vision tasks including image retrieval, visual question answering, and multimodal understanding.

RICE 引入了一种新颖的视觉表征学习方法,联合捕获通用视觉语义(物体、场景)、OCR 语义(图像中的文字)以及统一表征,无缝整合两种模态。这使其在图像检索、视觉问答和多模态理解等多项视觉任务中实现卓越性能。

Highlights亮点

  • Region-based Processing — Processes diverse semantic regions within images — including objects, scenes, and text — using a single efficient forward pass through the vision transformer.
    基于区域的处理 — 通过视觉 Transformer 的单次高效前向传播处理图像中多样的语义区域——包括物体、场景和文字。
  • Joint OCR & Visual Semantics — Uniquely captures both general visual semantics and OCR semantics in a unified representation, enabling richer multimodal understanding of image content.
    联合 OCR 与视觉语义 — 在统一表征中同时捕获通用视觉语义和 OCR 语义,实现更丰富的图像内容多模态理解。
  • State-of-the-Art Performance — Achieves superior results across multiple vision benchmarks including image retrieval, VQA, and multimodal understanding tasks using the LLaVA-NeXT framework.
    最先进的性能 — 使用 LLaVA-NeXT 框架,在图像检索、VQA 和多模态理解等多个视觉基准测试中取得卓越结果。

Architecture架构

Input Image输入图像
H×W×3
Vision Transformer视觉 Transformer
Single Forward Pass单次前向传播
Token GridToken 网格
H×W + 1 (Class Token)
Region Transformer区域 Transformer
Sample Regions by Mask
(ROI Align)
按掩码采样区域
(ROI Align)
Region Attention
Region-specific Visibility Mask
区域注意力
区域可见性掩码
Fixed-length Region Embeddings固定长度区域嵌入
Object Tokens物体 Token OCR TokensOCR Token
Unified Supervision统一监督
Object Region Loss物体区域损失
Single-label Cluster Discrimination单标签聚类判别
Semantic Cluster Centers语义聚类中心
OCR Region LossOCR 区域损失
Multi-label Cluster Discrimination多标签聚类判别
Vocabularies / Token Embeddings词表 / Token 嵌入
Figure 1. RICE architecture efficiently processes diverse semantic regions within images using region-based cluster discrimination. The model jointly captures general visual semantics, OCR semantics, and unified representations in a single forward pass.
图 1. RICE 架构通过基于区域的聚类判别高效处理图像中多样的语义区域。该模型在单次前向传播中联合捕获通用视觉语义、OCR 语义和统一表征。

LLaVA Experimental ResultsLLaVA 实验结果

Comparison of RICE-ViT with other vision encoders using the LLaVA-NeXT framework. All models are evaluated using identical configurations: Qwen2.5-7B as the language model, LLaVA-NeXT training data, and the same training pipeline. We adopt LLaVA-NeXT's tiling strategy (up to 2×2+1 tiles) for handling high-resolution images.

将 RICE-ViT 与其他视觉编码器在 LLaVA-NeXT 框架下进行比较。所有模型均使用相同配置进行评估:Qwen2.5-7B 作为语言模型,LLaVA-NeXT 训练数据和相同的训练流程。我们采用 LLaVA-NeXT 的分块策略(最多 2×2+1 块)处理高分辨率图像。

RICE-ViT Vision Encoder视觉编码器
Qwen2.5-7B Language Model语言模型
Benchmark Results基准测试结果 Output输出
Read the two Avg columns first to compare OCR gains and general vision quality at a glance.建议先看两个 Avg 列,可以最快比较 OCR 提升和通用视觉能力。
Swipe horizontally to view the full table横向滑动查看完整表格
Model Configuration模型配置 OCR & Document UnderstandingOCR 与文档理解   General Vision Understanding通用视觉理解  
Method方法 Vision Tower视觉塔
InfoVQA
DocVQA
ChartQA
TextVQA
OCRBench
OCRBenchV2
LiveXivQA
OCR AvgOCR 平均
AI2D
MMBEN
MMECog
MMEPer
POPE
RealworldQA
MMStar
Other Avg其他平均
CLIP ViT-L-14-336px 38.9 75.2 66.5 62.5 52.5 23.0 47.4 52.3 73.2 74.6 48.0 75.6 88.8 63.7 49.0 67.6
MLCD ViT-L-14-336px 43.5 76.5 67.8 61.7 53.1 24.0 48.4 53.6 77.0 76.4 54.1 79.9 88.7 61.1 51.0 69.7
AIMv2 ViT-L-14-336px 35.4 77.2 72.7 65.9 57.2 23.9 47.3 54.2 75.4 78.6 48.3 75.0 88.4 62.2 50.2 68.3
RICE-ViT ViT-L-14-336px 45.2 79.2 72.3 65.9 57.5 24.1 48.9 56.2 77.9 76.6 54.6 80.7 88.5 63.1 51.8 70.5
DFN5B ViT-H-14-378px 38.6 70.9 64.4 59.4 47.3 21.9 46.2 49.8 73.5 73.4 45.8 76.9 88.6 59.9 49.1 66.7
SigLIP ViT-SO400M-14-384px 41.4 76.7 69.3 64.7 55.4 24.0 48.4 54.3 76.2 77.0 46.1 79.9 88.8 63.7 47.3 68.4
SigLIPv2 ViT-SO400M-14-384px 43.7 79.1 70.2 66.2 58.7 25.4 48.6 56.0 77.0 77.1 46.6 80.4 89.3 63.4 52.8 69.5
RICE-ViT ViT-L-14-378px 48.1 82.6 75.1 66.2 58.8 25.8 49.5 58.0 76.5 77.6 54.1 79.0 89.1 62.9 51.2 70.1
SigLIPv2 ViT-SO400M-16-560px 50.2 86.2 77.4 70.2 62.7 26.5 52.9 60.9 77.0 76.5 53.5 79.9 89.3 68.2 53.1 71.1
RICE-ViT ViT-L-14-560px 53.2 87.4 78.1 69.0 60.7 26.1 53.0 61.1 76.9 78.6 56.3 79.3 88.9 65.1 50.5 70.8
Qwen-ViT from Qwen2.5-VL 7B ViT-H-14-560px 55.9 85.8 78.8 73.7 66.2 26.8 53.4 62.9 78.8 78.4 62.0 80.8 88.6 64.2 55.0 72.5
RICE-ViT from OV-1.5 3B ViT-L-14-560px 53.7 87.1 81.9 73.8 73.3 30.4 53.6 64.8 80.3 79.6 58.6 82.2 89.0 67.3 56.6 73.4

Referring Image Segmentation指代图像分割

Table 2. Performance comparison of referring image segmentation across vision-language models. Results are reported as IoU scores (%) on the refCOCO, refCOCO+, and refCOCOg benchmarks. Our RICE vision encoder consistently outperforms all competing approaches, achieving state-of-the-art results across all benchmarks when integrated with Qwen2.5-7B in the LLaVA-NeXT framework.

表 2. 视觉-语言模型的指代图像分割性能比较。结果以 IoU 分数(%)报告,涵盖 refCOCO、refCOCO+ 和 refCOCOg 基准测试。当与 Qwen2.5-7B 集成到 LLaVA-NeXT 框架中时,我们的 RICE 视觉编码器在所有基准测试中始终优于所有竞争方法,达到最先进的结果。

RICE-ViT Vision Encoder视觉编码器
Qwen2.5-7B Language Model语言模型
SAM Segment Anything分割一切
Segmentation Results分割结果 Output输出
RICE is highlighted against prior methods and two LLaVA baselines so you can scan the strongest IoU gains quickly.RICE 行已与已有方法和两套 LLaVA 基线对照,方便快速查看 IoU 提升。
Swipe horizontally to view the full table横向滑动查看完整表格
Model Configuration模型配置 refCOCO refCOCO+ refCOCOg
Vision Tower视觉塔 LLM Method方法 val testA testB val testA testB val test
Previous Methods已有方法
CLIP Vicuna-7B GLaMM 79.5 83.2 76.9 72.6 78.7 64.6 74.2 74.9
CLIP Vicuna-7B VisionLLMv2 79.2 82.3 77.0 68.9 75.8 61.8 73.3 74.8
CLIP Vicuna-7B LLaVA-G-7B 77.1 68.8 71.5
CLIP LLaMA2-13B GSVA 79.2 81.7 77.1 70.3 73.8 63.6 75.7 77.0
CLIP LLaMA2-13B PixelLM-7B 73.0 66.3 69.3
ConvNext-L InternLM2-7B OMG-LLaVA 77.2 79.8 74.1 68.7 73.0 61.6 71.7 71.9
InternViT2.5-300M InternLM2.5-7B Sa2VA 81.6 76.2 78.7
InternViT2.5-6B InternLM2.5-20B Sa2VA 82.5 78.8 79.7
LLaVA-1.5 Framework
CLIP Vicuna-7B LISA 74.9 79.1 72.3 65.1 70.8 58.1 67.9 70.6
RICE Vicuna-7B LISA 76.3 80.3 75.1 67.4 72.7 60.6 69.0 73.4
Avg: +2.00 ↑ Improvement over CLIP平均: +2.00 ↑ 相对 CLIP 的提升 +1.4 +1.2 +2.8 +2.3 +1.9 +2.5 +1.1 +2.8
LLaVA-NeXT Framework
CLIP Qwen2.5-7B LISA 81.8 84.0 79.1 76.6 80.5 70.9 77.3 78.5
MLCD Qwen2.5-7B LISA 82.8 84.6 80.2 77.4 81.6 73.1 78.5 79.7
RICE Qwen2.5-7B LISA 83.5 85.3 81.7 79.4 82.8 75.4 79.8 80.4
Avg: +2.45 ↑ Improvement over CLIP平均: +2.45 ↑ 相对 CLIP 的提升 +1.7 +1.3 +2.6 +2.8 +2.3 +4.5 +2.5 +1.9
Avg: +1.30 ↑ Improvement over MLCD平均: +1.30 ↑ 相对 MLCD 的提升 +0.7 +0.7 +1.5 +2.0 +1.2 +2.3 +1.3 +0.7

Detection Probe检测探测

We evaluate RICE against several state-of-the-art pretrained vision encoders across multiple benchmarks. All evaluations are conducted using the Cascade Mask R-CNN framework implemented in Detectron2. Experiments are performed on the COCO and LVIS validation sets, and on the Roboflow100 benchmarks. RICE achieves superior performance across all evaluation metrics, highlighting its strong representational quality for both natural images and specialized domains.

我们在多个基准测试中将 RICE 与几种最先进的预训练视觉编码器进行评估。所有评估均使用 Detectron2 实现的 Cascade Mask R-CNN 框架进行。实验在 COCO 和 LVIS 验证集以及 Roboflow100 基准测试上进行。RICE 在所有评估指标上均实现卓越性能,突出了其对自然图像和专业领域的强大表征能力。

RICE-ViT Vision Encoder视觉编码器
Cascade Mask R-CNN Detector检测器
COCO · LVIS · RF100 Output输出
Focus on the final AVG column to compare cross-domain robustness after checking COCO and LVIS detection quality.建议先看 COCO / LVIS 指标,再看最后的 AVG 列判断跨域鲁棒性。
Swipe horizontally to view the full table横向滑动查看完整表格
Configuration配置 COCO LVIS Roboflow100 BenchmarksRoboflow100 基准测试
Method Arch Res Det AP Seg AP Det AP Seg AP
Aerial
Video Games
Microscopic
Under Water
Documents
Electromagnetic
Real world
AVG
DINOv2 ViT-B/14 518 31.6 24.3 18.7 14.1 2.3 14.3 10.6 19.9 18.8 15.3 26.8 15.4
SigLIP ViT-B/16 512 35.0 28.1 21.8 17.3 9.4 29.5 20.0 29.4 23.4 18.6 38.0 24.1
MLCD ViT-B/16 512 35.6 28.6 22.1 17.8 11.4 19.9 14.9 21.0 13.3 15.8 25.0 17.3
RICE ViT-B/16 512 38.9 31.5 26.5 21.4 14.9 31.7 23.4 30.7 27.0 18.7 39.1 26.5

Tracking Probe跟踪探测

We evaluate the temporal matching capability of local features within the general video object tracking framework OSTrack, adopting an attention probing approach to compare the four pre-trained models. Two standard vision transformer blocks are inserted between the frozen backbone and the prediction head to enhance information exchange between the template and search images. As shown in Table 4, RICE achieves the best performance across all metrics on LaSOT, TrackingNet, GOT-10k, and TNL2k.

我们在通用视频目标跟踪框架 OSTrack 中评估局部特征的时序匹配能力,采用注意力探测方法比较四种预训练模型。在冻结的主干网络和预测头之间插入两个标准视觉 Transformer 块,以增强模板和搜索图像之间的信息交换。如表 4 所示,RICE 在 LaSOT、TrackingNet、GOT-10k 和 TNL2k 的所有指标上均取得最佳性能。

RICE-ViT Vision Encoder视觉编码器
OSTrack Tracker跟踪器
LaSOT · TrackingNet · GOT-10k · TNL2k Output输出
Read each benchmark block left to right: RICE consistently lifts Suc., Pre., and Norm. Pre. together.建议按 benchmark 分块从左到右阅读,RICE 在 Suc.、Pre. 和 Norm. Pre. 上都保持同步提升。
Swipe horizontally to view the full table横向滑动查看完整表格
  LaSOT TrackingNet GOT-10k TNL2k
Method Suc. Pre. Norm. Pre. Suc. Pre. Norm. Pre. AO SR-0.5 SR-0.75 Suc. Pre. Norm. Pre.
DINOv2 55.11 54.99 65.52 71.20 64.70 77.70 53.60 64.90 35.50 41.95 36.03 57.40
SigLIP 55.52 56.16 65.33 72.60 66.70 78.70 53.50 63.10 35.40 43.90 39.06 59.03
MLCD 58.05 60.75 68.31 75.30 70.20 80.80 53.80 62.80 39.70 45.22 40.62 60.64
RICE 60.24 63.16 69.66 76.30 71.80 81.30 55.40 63.50 41.60 45.70 41.95 61.18

Visualization可视化

Semantic Feature Visualization
Figure 2. Semantic feature visualization using 2048-resolution images as input to ViT-B/16. Token features are projected onto RGB channels via PCA. Sequential frames (arranged vertically) show consistent attention on salient objects (ice skaters, deers, motorcyclists, cyclists), with stable color patterns maintained throughout the sequence.
图 2. 使用 2048 分辨率图像作为 ViT-B/16 输入的语义特征可视化。通过 PCA 将 Token 特征投影到 RGB 通道。连续帧(垂直排列)显示出对显著物体(滑冰运动员、鹿、摩托车手、骑行者)的一致注意,且色彩模式在整个序列中保持稳定。

Citation引用

If you find this work useful, please cite our paper:如果您觉得这项工作有用,请引用我们的论文:

@inproceedings{yinxie_2025_rice,
  title     = {Region-based Cluster Discrimination for Visual Representation Learning},
  author    = {Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Roy, Miles and Ismail, Elezi and Deng, Jiankang},
  booktitle = {ICCV},
  year      = {2025}
}