Preprint 2026

OneVision-Encoder

Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

LLaVA-OneVision Community

Homepage Paper Code Bilibili YouTube 我爱计算机视觉

Proposes codec-aligned sparsity as a foundational principle, focusing computation exclusively on the 3.1%–25% of visual regions rich in signal entropy. Achieves a 4.1% average improvement over Qwen3-ViT on video understanding, demonstrating that efficiency and accuracy are positively correlated.

1 Introduction

Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs.

Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OneVision-Encoder abandons uniform computation to focus exclusively on the 3.1%-25% of regions rich in signal entropy. To unify spatial and temporal reasoning under irregular token layouts, OneVision-Encoder employs a shared 3D RoPE and is trained with a large-scale cluster discrimination objective over more than one million semantic concepts, jointly capturing object permanence and motion dynamics.

Evidence. The results validate our core hypothesis: efficiency and accuracy are not a trade-off; they are positively correlated. By resolving the dichotomy between dense grids and sparse semantics, OneVision-Encoder redefines the performance frontier. When integrated into large multimodal models, it consistently outperforms strong vision backbones such as Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks, despite using substantially fewer visual tokens and pretraining data. Notably, on video understanding tasks, OneVision-Encoder achieves an average improvement of 4.1% over Qwen3-ViT. Under attentive probing, it achieves state-of-the-art representation quality, with 17.1% and 8.1% Top-1 accuracy improvements over SigLIP2 and DINOv3, respectively, on Diving-48 under identical patch budgets. These results demonstrate that codec-aligned, patch-level sparsity is not an optimization trick, but a foundational principle for next-generation visual generalists, positioning OneVision-Encoder as a scalable engine for universal multimodal intelligence.

2 Codec-Style Patch Selection

Traditional video understanding models process frames by uniform temporal sampling—selecting evenly-spaced frames regardless of content. This approach treats all spatial regions equally, wasting computation on redundant background pixels that remain static across frames.

Inspired by HEVC video compression, our codec-style approach identifies and processes only the patches that carry meaningful temporal changes. Just as video codecs encode motion vectors and residuals rather than full frames, we select patches based on their information density—preserving the dynamic, semantically-rich regions while discarding redundant static content.

Codec-Style Input. Left: Reference frame (t=1) with all patches. Right: Three animated blocks showing consecutive frames (t=2,3,4 → t=5,6,7 → ...), cycling through t=2 to t=64. Each frame shows only salient patches at their spatial positions. The result: 75-98% fewer patches while retaining the information that matters.

Traditional Frame Sampling. Uniformly samples 4 frames and processes all patches from each. Notice the redundancy: static backgrounds, repeated textures, and unchanging regions are processed multiple times across frames—wasting computation on pixels that add no new information.

Video → 3D Vision Transformer Pipeline

HEVC codec decomposition → sparse patch selection → OneVision-Encoder

Multi-Modal Vision Input

Image, uniform frames, or codec-aligned tokens — all feed the same Vision Transformer via RoPE3D

3 Video Processing Pipeline

The visualization below demonstrates our video processing pipeline on a synthesized video scene. Panel 1 shows the dense video, panel 2 shows uniform frame sampling from that same stream, panel 3 visualizes the fused heat map computed from motion vectors and residuals, and panel 4 keeps only the final ViT patches selected from that heat.

Residual
Eres(x,y)=clip((|Y(x,y)-128|)/p95,0,1)

MotionVec

E_mv(x,y)=clip(√(v_x²+v_y²)/p₉₅,0,1)

Heat
H(x,y)=wmvEmv(x,y)+wresEres(x,y)

Mask

M(x,y)=1{H(x,y)>τ}

1Dense

Original Video

All patches visible, smooth motion

2Sampled

Uniform Frame Sampling

Periodic freeze, sub-sampled updates

3Heat

Temporal Saliency Detection

Heatmap near motion regions

4Selected

ViT Patch Selection

Select salient patches for the encoder

All four panels come from the same synthesized video. Uniform sampling freezes that stream in time, while the heat map and ViT token selection are computed from the formula block above: residual energy + motion-vector magnitude → weighted heat → patch mask.

4 Global Contrastive Learning

Standard contrastive learning (e.g., CLIP) is limited by batch size—negative samples are drawn only from the current batch, typically 32K-64K examples. This creates a narrow view of the embedding space and leads to suboptimal representations. Our approach maintains a global concept bank of 2M clustered centers, enabling each training sample to contrast against a diverse, representative set of negatives regardless of batch composition. This produces more discriminative embeddings with better-separated semantic clusters.

5 LMM Probe Results

Comparison of different vision encoders on multimodal benchmarks. All models are evaluated on a unified multimodal setting using Qwen3-4B-Instruct2507 as the language backbone. OV-Encoder-Lang denotes the language-aligned variant of the OV-Encoder architecture. Qwen3-ViT is extracted from Qwen3-VL-4B. SigLIP2 uses siglip2-so400m-patch16-naflex. (Codec) indicates codec-guided visual encoding using motion vectors and residual signals, while (Frame) indicates frame-based visual encoding with dense spatial patchification. Bold values indicate the best performance under the same evaluation setting. Results reported in the left columns correspond to encoders trained with caption supervision, whereas results in the right columns correspond to encoders trained without caption supervision.

Table 3Comparison on multimodal benchmarks with a unified LMM probe setting. These results use Qwen3-4B-Instruct2507 as the language backbone and compare caption-supervised versus non-caption-supervised encoders under a single project-page presentation style.

Qwen3-4B-Instruct2507		with caption supervision		without caption supervision
Task	Benchmark	OV-Encoder-Lang (Codec)	Qwen3-ViT (Frame)	OV-Encoder (Codec)	OV-Encoder-Frame (Frame)	SigLIP2 (Frame)
Video
Video	MVBench	53.2	47.4	52.4	49.8	47.2
Video	MLVU-dev	47.4	47.2	46.3	49.4	48.4
Video	NExT-QA (MC)	76.1	70.1	75.6	71.9	70.6
Video	VideoMME	54.1	47.2	53.4	49.3	46.8
Video	Perception Test	60.6	57.1	60.3	56.7	56.0
Video	TOMATO	21.8	22.2	22.2	21.8	22.3
Video	LongVideoBench-Val-Video	51.6	45.0	50.4	45.5	45.2
Image
Image	AI2D	80.2	77.8	75.7	76.5	78.6
Image	ChartQA	80.1	79.6	76.5	77.8	76.4
Image	DocVQA	83.2	85.1	78.4	79.5	75.0
Image	InfoVQA	51.6	49.0	43.1	45.5	42.0
Image	MMBench-EN	80.2	79.4	77.2	78.5	79.6
Image	OCRBench	657	706	605	630	621
Image	OCRBench v2	30.8	30.6	26.3	26.1	26.1
Image	MMStar	56.6	56.6	52.1	54.3	55.0
Image	RealWorldQA	66.1	63.3	60.8	61.2	62.1

* Bold values indicate the best performance under the same evaluation setting.

6 Attentive Probe Results

Performance comparison of different vision encoders using Attentive Probe evaluation. Models are evaluated using single clip input and trained for 10 epochs across 8 action recognition datasets. Results show average performance and per-dataset scores for 8-frame and 16-frame configurations.

OV-Encoder (Codec) refers to a variant of OV-Encoder that replaces traditional frame sampling with codec-style input, where dense full-frame inputs are substituted by codec-guided patch reorganization. Under the same attentive probe setting and token budget, patches are selectively reallocated across the input clip based on codec-native motion vectors and residuals, without changing the backbone architecture or training protocol. This results in stronger performance on motion-sensitive datasets, particularly Diving48 and Perception Test.

Table 4 Comparison with state-of-the-art methods on video understanding benchmarks. We report top-1 accuracy (%) using an attentive probe with frozen backbones, evaluated under two input configurations: 8 Frames / 2048 Patches and 16 Frames / 4096 Patches. For OV-Encoder (Codec), inputs are constructed using Dense Video-codec Patchification, which selectively encodes temporally salient patches from dense video inputs under the corresponding patch budgets. Bold indicates the best performance and underline indicates the second-best.

Model Setup				Video Benchmarks (Acc. %)
Method	Backbone	Res.	Avg.	SSV2	Diving48	Perception Test	CharadesEgo	Epic-Verb	Epic-Noun	Kinetics-400	HMDB51
8 Frames / 2048 Patches
CLIP	ViT-L/14	224	50.5	48.2	46.6	52.2	10.8	52.8	36.1	79.3	78.0
SigLIP	ViT-L/16	256	50.1	50.7	43.9	48.9	10.9	52.2	39.1	78.2	77.0
MetaCLIP	ViT-L/14	224	48.5	50.6	28.9	49.8	10.4	54.1	37.1	79.6	77.1
MetaCLIP2	ViT-L/14	224	50.2	47.2	48.0	47.7	11.0	48.0	40.9	82.4	76.3
AIMv2	ViT-L/14	224	53.8	55.1	43.6	55.1	12.0	56.6	45.6	81.1	81.3
SigLIP2	ViT-L/16	256	53.1	52.6	50.1	52.7	11.6	54.2	43.8	80.9	79.1
DINOv3	ViT-L/14	224	58.0	57.4	58.6	59.3	13.2	62.5	51.7	82.9	78.6
OV-Encoder (Frame)	ViT-L/14	224	58.4	57.7	57.6	58.3	12.1	61.4	52.5	84.3	83.1
OV-Encoder (Codec)	ViT-L/14	224	60.2	58.5	67.2	60.0	12.3	62.3	53.9	84.4	83.4
16 Frames / 4096 Patches
SigLIP	ViT-L/16	256	52.8	52.7	54.7	51.0	11.7	54.1	40.2	79.1	78.8
MetaCLIP2	ViT-L/14	224	51.0	49.3	42.1	51.1	11.2	49.2	43.2	84.0	78.2
AIMv2	ViT-L/14	224	56.4	57.2	55.7	56.4	12.4	58.3	46.2	82.2	82.6
SigLIP2	ViT-L/16	256	55.7	58.2	56.7	53.3	11.9	56.4	45.2	82.7	81.2
DINOv3	ViT-L/14	224	59.1	58.3	61.3	60.8	14.0	63.2	51.9	83.9	79.7
OV-Encoder (Frame)	ViT-L/14	224	59.9	58.7	63.2	60.3	12.6	62.9	54.5	85.1	81.6
OV-Encoder (Codec)	ViT-L/14	224	61.5	60.1	69.4	60.9	12.9	63.3	54.4	85.4	85.3

7 Patch-Efficient Video Understanding Comparison

Efficiency analysis comparing SigLIP2 with dense full-frame patch processing and OV-Encoder (Codec) under a fixed token budget. It is important to emphasize that OV-Encoder (Codec) does not perform temporal downsampling of the input video. All results are obtained from the same 64-frame (16384 patches) source video, where codec-native motion vectors and residuals are used to selectively extract a fixed number of spatiotemporal patches distributed across the entire temporal extent.

For a fair comparison, SigLIP2 is evaluated under the same token budgets and adopts a traditional frame sampling strategy, where each group of 256 patches corresponds to a contiguous RGB frame. Under a fixed token budget, OV-Encoder (Codec) redistributes patches across time while preserving their spatial positions, enabling long-range temporal coverage. As a result, it outperforms SigLIP2 on Diving48 and Perception Test while reducing patch processing by 75.0%–96.9% compared to dense processing of 16,384 patches.

Table 5 Patch-efficient comparison under a fixed video token budget. We compare dense frame sampling against codec-guided patch selection on motion-sensitive benchmarks. The OV-Encoder rows use the same visual language as Table 4 so the performance and efficiency gains read consistently across both sections.

Dataset	Model	512 Patches	1024 Patches	2048 Patches	4096 Patches
Diving48	SigLIP2 (ViT-L/16, 256px)Traditional Frame Sampling (dense patch processing)	28.1	48.7	50.1	56.7
Diving48	OV-Encoder (Codec) (ViT-L/14, 224px)16384 patches → N patches	46.596.9% ↓	54.993.8% ↓	67.287.5% ↓	69.475.0% ↓
Perception Test	SigLIP2 (ViT-L/16, 256px)Traditional Frame Sampling (dense patch processing)	38.7	50.1	52.7	53.3
Perception Test	OV-Encoder (Codec) (ViT-L/14, 224px)16384 patches → N patches	50.596.9% ↓	58.693.8% ↓	60.087.5% ↓	60.975.0% ↓

* Percentages under OV-Encoder-Codec indicate patch reduction relative to dense processing of all 16384 patches.

'16384 patches → N patches' indicates Codec-Style Patch Selection, where motion-relevant patches are selectively retained instead of temporal frame sampling.

Contributors

Core Contributors

Feilong Tang, Xiang An, Yunyao Yan, Yin Xie

Contributors

Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, Jiankang Deng

9 BibTeX

BibTeX

@article{onevision_encoder_2026,
  title={OneVision Encoder},
  author={LMMs Lab, Glint Lab, AIM for Health Lab, MVP Lab},
  journal={arXiv preprint},
  year={2026}
}

If you find our work useful, please consider citing our paper.