Use this quickstart to load the released RICE-ViT checkpoint, prepare an input image, and extract visual features in a few lines.通过这段 quickstart,你可以加载已发布的 RICE-ViT 模型权重、准备输入图像,并在几行代码内提取视觉特征。
from transformers import AutoModel, AutoProcessor from PIL import Image import requests import torch model = AutoModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560") processor = AutoProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560") url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(images=image, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) features = outputs.last_hidden_state[0] print(features.shape)
torch.Size([...]), so you can quickly confirm the encoder is working.这段脚本会输出提取后的特征张量 shape,例如 torch.Size([...]),方便你快速确认编码器已经正常工作。RICE-ViT is used as the vision encoder in the following MLLMs:RICE-ViT 被以下多模态大语言模型用作视觉编码器:
| Model模型 | Organization机构 | Downloads / month月下载量 |
|---|---|---|
| LLaVA-OneVision-1.5-8B-Instruct | LMMs-Lab | 16,562 |
| VAETKI-VL-7B-A1B | NC-AI Consortium | 949 |
| Innovator-VL-8B-Instruct | InnovatorLab | 64 |
The Margin-based Vision Transformer (MVT) series represents a family of state-of-the-art vision encoders designed for universal visual representation learning. The latest version, RICE (Region-based Cluster Discrimination), advances visual understanding by processing diverse semantic regions within images using a single forward pass.
Margin-based Vision Transformer (MVT) 系列代表了一类用于通用视觉表征学习的最先进视觉编码器。最新版本 RICE(Region-based Cluster Discrimination,基于区域的聚类判别) 通过单次前向传播处理图像中多样的语义区域,推动了视觉理解的进步。
RICE introduces a novel approach to visual representation learning that jointly captures general visual semantics (objects, scenes), OCR semantics (text within images), and unified representations that seamlessly integrate both modalities. This enables superior performance across multiple vision tasks including image retrieval, visual question answering, and multimodal understanding.
RICE 引入了一种新颖的视觉表征学习方法,联合捕获通用视觉语义(物体、场景)、OCR 语义(图像中的文字)以及统一表征,无缝整合两种模态。这使其在图像检索、视觉问答和多模态理解等多项视觉任务中实现卓越性能。
Comparison of RICE-ViT with other vision encoders using the LLaVA-NeXT framework. All models are evaluated using identical configurations: Qwen2.5-7B as the language model, LLaVA-NeXT training data, and the same training pipeline. We adopt LLaVA-NeXT's tiling strategy (up to 2×2+1 tiles) for handling high-resolution images.
将 RICE-ViT 与其他视觉编码器在 LLaVA-NeXT 框架下进行比较。所有模型均使用相同配置进行评估:Qwen2.5-7B 作为语言模型,LLaVA-NeXT 训练数据和相同的训练流程。我们采用 LLaVA-NeXT 的分块策略(最多 2×2+1 块)处理高分辨率图像。
| Model Configuration模型配置 | OCR & Document UnderstandingOCR 与文档理解 | General Vision Understanding通用视觉理解 | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method方法 | Vision Tower视觉塔 | InfoVQA |
DocVQA |
ChartQA |
TextVQA |
OCRBench |
OCRBenchV2 |
LiveXivQA |
OCR AvgOCR 平均
|
AI2D |
MMBEN
|
MMECog
|
MMEPer
|
POPE |
RealworldQA |
MMStar |
Other Avg其他平均
|
| CLIP | ViT-L-14-336px | 38.9 | 75.2 | 66.5 | 62.5 | 52.5 | 23.0 | 47.4 | 52.3 | 73.2 | 74.6 | 48.0 | 75.6 | 88.8 | 63.7 | 49.0 | 67.6 |
| MLCD | ViT-L-14-336px | 43.5 | 76.5 | 67.8 | 61.7 | 53.1 | 24.0 | 48.4 | 53.6 | 77.0 | 76.4 | 54.1 | 79.9 | 88.7 | 61.1 | 51.0 | 69.7 |
| AIMv2 | ViT-L-14-336px | 35.4 | 77.2 | 72.7 | 65.9 | 57.2 | 23.9 | 47.3 | 54.2 | 75.4 | 78.6 | 48.3 | 75.0 | 88.4 | 62.2 | 50.2 | 68.3 |
| RICE-ViT | ViT-L-14-336px | 45.2 | 79.2 | 72.3 | 65.9 | 57.5 | 24.1 | 48.9 | 56.2 | 77.9 | 76.6 | 54.6 | 80.7 | 88.5 | 63.1 | 51.8 | 70.5 |
| DFN5B | ViT-H-14-378px | 38.6 | 70.9 | 64.4 | 59.4 | 47.3 | 21.9 | 46.2 | 49.8 | 73.5 | 73.4 | 45.8 | 76.9 | 88.6 | 59.9 | 49.1 | 66.7 |
| SigLIP | ViT-SO400M-14-384px | 41.4 | 76.7 | 69.3 | 64.7 | 55.4 | 24.0 | 48.4 | 54.3 | 76.2 | 77.0 | 46.1 | 79.9 | 88.8 | 63.7 | 47.3 | 68.4 |
| SigLIPv2 | ViT-SO400M-14-384px | 43.7 | 79.1 | 70.2 | 66.2 | 58.7 | 25.4 | 48.6 | 56.0 | 77.0 | 77.1 | 46.6 | 80.4 | 89.3 | 63.4 | 52.8 | 69.5 |
| RICE-ViT | ViT-L-14-378px | 48.1 | 82.6 | 75.1 | 66.2 | 58.8 | 25.8 | 49.5 | 58.0 | 76.5 | 77.6 | 54.1 | 79.0 | 89.1 | 62.9 | 51.2 | 70.1 |
| SigLIPv2 | ViT-SO400M-16-560px | 50.2 | 86.2 | 77.4 | 70.2 | 62.7 | 26.5 | 52.9 | 60.9 | 77.0 | 76.5 | 53.5 | 79.9 | 89.3 | 68.2 | 53.1 | 71.1 |
| RICE-ViT | ViT-L-14-560px | 53.2 | 87.4 | 78.1 | 69.0 | 60.7 | 26.1 | 53.0 | 61.1 | 76.9 | 78.6 | 56.3 | 79.3 | 88.9 | 65.1 | 50.5 | 70.8 |
| Qwen-ViT from Qwen2.5-VL 7B | ViT-H-14-560px | 55.9 | 85.8 | 78.8 | 73.7 | 66.2 | 26.8 | 53.4 | 62.9 | 78.8 | 78.4 | 62.0 | 80.8 | 88.6 | 64.2 | 55.0 | 72.5 |
| RICE-ViT from OV-1.5 3B | ViT-L-14-560px | 53.7 | 87.1 | 81.9 | 73.8 | 73.3 | 30.4 | 53.6 | 64.8 | 80.3 | 79.6 | 58.6 | 82.2 | 89.0 | 67.3 | 56.6 | 73.4 |
Table 2. Performance comparison of referring image segmentation across vision-language models. Results are reported as IoU scores (%) on the refCOCO, refCOCO+, and refCOCOg benchmarks. Our RICE vision encoder consistently outperforms all competing approaches, achieving state-of-the-art results across all benchmarks when integrated with Qwen2.5-7B in the LLaVA-NeXT framework.
表 2. 视觉-语言模型的指代图像分割性能比较。结果以 IoU 分数(%)报告,涵盖 refCOCO、refCOCO+ 和 refCOCOg 基准测试。当与 Qwen2.5-7B 集成到 LLaVA-NeXT 框架中时,我们的 RICE 视觉编码器在所有基准测试中始终优于所有竞争方法,达到最先进的结果。
| Model Configuration模型配置 | refCOCO | refCOCO+ | refCOCOg | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Vision Tower视觉塔 | LLM | Method方法 | val | testA | testB | val | testA | testB | val | test |
| Previous Methods已有方法 | ||||||||||
| CLIP | Vicuna-7B | GLaMM | 79.5 | 83.2 | 76.9 | 72.6 | 78.7 | 64.6 | 74.2 | 74.9 |
| CLIP | Vicuna-7B | VisionLLMv2 | 79.2 | 82.3 | 77.0 | 68.9 | 75.8 | 61.8 | 73.3 | 74.8 |
| CLIP | Vicuna-7B | LLaVA-G-7B | 77.1 | – | – | 68.8 | – | – | 71.5 | – |
| CLIP | LLaMA2-13B | GSVA | 79.2 | 81.7 | 77.1 | 70.3 | 73.8 | 63.6 | 75.7 | 77.0 |
| CLIP | LLaMA2-13B | PixelLM-7B | 73.0 | – | – | 66.3 | – | – | 69.3 | – |
| ConvNext-L | InternLM2-7B | OMG-LLaVA | 77.2 | 79.8 | 74.1 | 68.7 | 73.0 | 61.6 | 71.7 | 71.9 |
| InternViT2.5-300M | InternLM2.5-7B | Sa2VA | 81.6 | – | – | 76.2 | – | – | 78.7 | – |
| InternViT2.5-6B | InternLM2.5-20B | Sa2VA | 82.5 | – | – | 78.8 | – | – | 79.7 | – |
| LLaVA-1.5 Framework | ||||||||||
| CLIP | Vicuna-7B | LISA | 74.9 | 79.1 | 72.3 | 65.1 | 70.8 | 58.1 | 67.9 | 70.6 |
| RICE | Vicuna-7B | LISA | 76.3 | 80.3 | 75.1 | 67.4 | 72.7 | 60.6 | 69.0 | 73.4 |
| Avg: +2.00 ↑ Improvement over CLIP平均: +2.00 ↑ 相对 CLIP 的提升 | +1.4 | +1.2 | +2.8 | +2.3 | +1.9 | +2.5 | +1.1 | +2.8 | ||
| LLaVA-NeXT Framework | ||||||||||
| CLIP | Qwen2.5-7B | LISA | 81.8 | 84.0 | 79.1 | 76.6 | 80.5 | 70.9 | 77.3 | 78.5 |
| MLCD | Qwen2.5-7B | LISA | 82.8 | 84.6 | 80.2 | 77.4 | 81.6 | 73.1 | 78.5 | 79.7 |
| RICE | Qwen2.5-7B | LISA | 83.5 | 85.3 | 81.7 | 79.4 | 82.8 | 75.4 | 79.8 | 80.4 |
| Avg: +2.45 ↑ Improvement over CLIP平均: +2.45 ↑ 相对 CLIP 的提升 | +1.7 | +1.3 | +2.6 | +2.8 | +2.3 | +4.5 | +2.5 | +1.9 | ||
| Avg: +1.30 ↑ Improvement over MLCD平均: +1.30 ↑ 相对 MLCD 的提升 | +0.7 | +0.7 | +1.5 | +2.0 | +1.2 | +2.3 | +1.3 | +0.7 | ||
We evaluate RICE against several state-of-the-art pretrained vision encoders across multiple benchmarks. All evaluations are conducted using the Cascade Mask R-CNN framework implemented in Detectron2. Experiments are performed on the COCO and LVIS validation sets, and on the Roboflow100 benchmarks. RICE achieves superior performance across all evaluation metrics, highlighting its strong representational quality for both natural images and specialized domains.
我们在多个基准测试中将 RICE 与几种最先进的预训练视觉编码器进行评估。所有评估均使用 Detectron2 实现的 Cascade Mask R-CNN 框架进行。实验在 COCO 和 LVIS 验证集以及 Roboflow100 基准测试上进行。RICE 在所有评估指标上均实现卓越性能,突出了其对自然图像和专业领域的强大表征能力。
| Configuration配置 | COCO | LVIS | Roboflow100 BenchmarksRoboflow100 基准测试 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Arch | Res | Det AP | Seg AP | Det AP | Seg AP | Aerial |
Video Games |
Microscopic |
Under Water |
Documents |
Electromagnetic |
Real world |
AVG
|
| DINOv2 | ViT-B/14 | 518 | 31.6 | 24.3 | 18.7 | 14.1 | 2.3 | 14.3 | 10.6 | 19.9 | 18.8 | 15.3 | 26.8 | 15.4 |
| SigLIP | ViT-B/16 | 512 | 35.0 | 28.1 | 21.8 | 17.3 | 9.4 | 29.5 | 20.0 | 29.4 | 23.4 | 18.6 | 38.0 | 24.1 |
| MLCD | ViT-B/16 | 512 | 35.6 | 28.6 | 22.1 | 17.8 | 11.4 | 19.9 | 14.9 | 21.0 | 13.3 | 15.8 | 25.0 | 17.3 |
| RICE | ViT-B/16 | 512 | 38.9 | 31.5 | 26.5 | 21.4 | 14.9 | 31.7 | 23.4 | 30.7 | 27.0 | 18.7 | 39.1 | 26.5 |
We evaluate the temporal matching capability of local features within the general video object tracking framework OSTrack, adopting an attention probing approach to compare the four pre-trained models. Two standard vision transformer blocks are inserted between the frozen backbone and the prediction head to enhance information exchange between the template and search images. As shown in Table 4, RICE achieves the best performance across all metrics on LaSOT, TrackingNet, GOT-10k, and TNL2k.
我们在通用视频目标跟踪框架 OSTrack 中评估局部特征的时序匹配能力,采用注意力探测方法比较四种预训练模型。在冻结的主干网络和预测头之间插入两个标准视觉 Transformer 块,以增强模板和搜索图像之间的信息交换。如表 4 所示,RICE 在 LaSOT、TrackingNet、GOT-10k 和 TNL2k 的所有指标上均取得最佳性能。
| LaSOT | TrackingNet | GOT-10k | TNL2k | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Suc. | Pre. | Norm. Pre. | Suc. | Pre. | Norm. Pre. | AO | SR-0.5 | SR-0.75 | Suc. | Pre. | Norm. Pre. |
| DINOv2 | 55.11 | 54.99 | 65.52 | 71.20 | 64.70 | 77.70 | 53.60 | 64.90 | 35.50 | 41.95 | 36.03 | 57.40 |
| SigLIP | 55.52 | 56.16 | 65.33 | 72.60 | 66.70 | 78.70 | 53.50 | 63.10 | 35.40 | 43.90 | 39.06 | 59.03 |
| MLCD | 58.05 | 60.75 | 68.31 | 75.30 | 70.20 | 80.80 | 53.80 | 62.80 | 39.70 | 45.22 | 40.62 | 60.64 |
| RICE | 60.24 | 63.16 | 69.66 | 76.30 | 71.80 | 81.30 | 55.40 | 63.50 | 41.60 | 45.70 | 41.95 | 61.18 |
If you find this work useful, please cite our paper:如果您觉得这项工作有用,请引用我们的论文:
@inproceedings{yinxie_2025_rice,
title = {Region-based Cluster Discrimination for Visual Representation Learning},
author = {Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Roy, Miles and Ismail, Elezi and Deng, Jiankang},
booktitle = {ICCV},
year = {2025}
}