Video Codec & OneVision-Encoder: From Codec Basics to Sparse Patch Selection

1 视频压缩与 I/P 帧

1 Video Compression & I/P Frames

未压缩的视频体积极其庞大。一段 1080p 60fps 的视频，每秒大约需要 350MB 的存储空间。为了在互联网上流畅传输视频，我们需要消除其中的冗余（Redundancy）。视频通常包含很强的时间冗余：相邻的两帧之间往往极其相似，只有少部分物体在移动。

Uncompressed video is massive. A single second of 1080p 60fps video requires about 350MB of data. To stream video over the internet, we must eliminate redundancies. Video contains strong temporal redundancy: adjacent frames are usually very similar, with only a few objects moving.

因此，编码器将视频帧分为不同的类型：

I帧 (Intra-coded frame，关键帧)：自带完整图像信息，就像一张独立的 JPEG 图片。它的压缩率最低，但它是独立解码的基础。
P帧 (Predictive frame，预测帧)：不记录完整的画面，而是参考前面的帧，只记录与前一帧的“差异”。这能大幅度节省空间。

Therefore, encoders categorize video frames into different types:

I-frame (Intra-coded frame, Keyframe): Contains full image information, like a standalone JPEG. It has the lowest compression ratio but is essential for independent decoding.
P-frame (Predictive frame): Does not store the full picture. It references previous frames and only stores the "differences". This saves massive amounts of space.

2 运动向量 (Motion Vector, MV)

2 Motion Vector (MV)

既然 P 帧只记录差异，编码器是如何找到差异的呢？这就需要进行运动估计 (Motion Estimation)。编码器会将当前画面切分成小块（Macroblocks），然后在参考帧（前一帧）的局部区域内搜索，找到一个最相似的块。

这个匹配块相对于当前块的位置偏移量，就是运动向量 (Motion Vector)。

Since P-frames only record differences, how does the encoder find them? This is done via Motion Estimation. The encoder splits the current frame into blocks (Macroblocks) and searches the reference frame (previous frame) to find the most similar block.

The spatial offset of this matching block relative to the current block is the Motion Vector.

\vec{MV} = (mv_x, mv_y) = \arg\min_{(dx,dy)} \sum_{(i,j) \in B} |F_t(x+i, y+j) - F_{t-1}(x+i+dx, y+j+dy)|

SAD: — | Best: —

3 残差 (Residual)

3 Residual

即使找到了最相似的块，预测块 $\hat{F}_t$ 和当前块 $F_t$ 之间通常还是会有一点点不同（比如光照变化、形变等）。两者相减得到的差异像素矩阵，被称为残差 (Residual)。

在 P 帧中，编码器实际写入视频文件的是：运动向量 + 压缩后的残差。解码时，只要把预测块和残差加起来，就能重建出当前画面。

Even after finding the best matching block, there are usually minor differences between the predicted block $\hat{F}_t$ and the current block $F_t$ (due to lighting changes, deformation, etc.). This difference matrix is the Residual.

For a P-frame, the encoder actually writes to the file: Motion Vectors + Compressed Residual. During decoding, adding the residual to the predicted block reconstructs the current frame.

\text{Prediction: } \hat{F}_t(x,y) = F_{t-1}(x + mv_x, y + mv_y)

Computing Difference...

4 GOP 结构 (Group of Pictures)

4 GOP Structure (Group of Pictures)

视频帧被组织成 GOP (图像组)。一个典型的 GOP 总是以一个 I 帧开始，随后跟着多个 P 帧（有时还有双向预测的 B 帧）。

GOP 的长度非常关键：较短的 GOP 有利于视频流的快速加载和拖拽（Seek），因为你必须找到 I 帧才能开始解码；较长的 GOP 则能提供更高的压缩率。

Video frames are organized into a GOP (Group of Pictures). A typical GOP always starts with an I-frame, followed by multiple P-frames (and sometimes bi-directional B-frames).

GOP length is crucial: shorter GOPs enable faster seeking and stream loading because decoding must start from an I-frame. Longer GOPs provide much higher compression ratios.

点击帧以查看其参考依赖关系。 Click a frame to see its reference dependencies.

5 完整编码流水线

5 Full Encode Pipeline

综合以上所有概念，我们可以清晰地看到一帧 P 帧是如何被编码和传输的。它首先通过运动估计找到 MV，然后计算残差，最后将这两者转化为二进制比特流。

Combining all these concepts, we can see exactly how a P-frame is encoded and transmitted. It first finds the MV via motion estimation, computes the residual, and finally encodes both into the binary bitstream.

Ready

6 应用：Codec 对齐的 Patch 选择 (OneVision-Encoder)

6 Application: Codec-Aligned Patch Selection (OneVision-Encoder)

Source Code Paper: arXiv:2602.08683

上面我们学习了视频编解码的核心概念——运动向量、残差、I/P帧和 GOP 结构。那么，这些概念如何在现代 AI 视频理解中发挥作用？OneVision-Encoder (arXiv:2602.08683) 给出了一个优雅的答案——直接利用 Codec 的 MV 和残差信号来指导视觉 Token 的稀疏选择。

传统的视频理解模型通常采用均匀帧采样（如采样 4-8 帧），并处理每一帧的所有图像块（Patches）。这导致模型将大量算力浪费在静态背景上。

OneVision-Encoder 提出了“密集帧稀疏化（Dense Frames Sparsely）”的核心理念：大幅增加采样帧数（如 64 帧）以获取极高的时间覆盖率，但同时仅挑选出全局最具信息量的 3%-25% 的 Patch 进行计算，从而在保持总计算预算不变的情况下，实现更强的视频理解能力。

We've now covered the core concepts of video codecs — motion vectors, residuals, I/P frames, and GOP structure. How do these concepts apply to modern AI video understanding? OneVision-Encoder (arXiv:2602.08683) provides an elegant answer — directly leveraging codec MV and residual signals to guide sparse visual token selection.

Traditional video understanding models usually employ uniform frame sampling (e.g., 4-8 frames) and process all patches from each frame. This wastes significant compute on static backgrounds.

OneVision-Encoder proposes the core principle of "Dense Frames Sparsely": massively increasing sampled frames (e.g., 64) for high temporal coverage, while only selecting the top 3%-25% most informative patches globally. This achieves stronger video understanding while keeping the total compute budget identical.

Total Patches: Equal

7 运动向量与残差能量计算

7 MV & Residual Energy Computation

如何判断哪些 Patch 包含重要的信息？OneVision-Encoder 直接从视频编解码器（Codec）底层提取运动向量 (Motion Vector, MV) 和残差 (Residual) 作为先验知识。

通过 cv_reader 提取比特流后，我们分别计算基于 95 分位数归一化的残差能量 $E_{res}$ 和运动能量 $E_{mv}$ （在减去全局相机运动后）。最后将两者加权融合得到该区域的“热度”得分。

How do we determine which patches contain important information? OneVision-Encoder directly extracts Motion Vectors (MV) and Residuals from the underlying video codec bitstream as prior knowledge.

After extracting the bitstream via cv_reader, we calculate the residual energy $E_{res}$ and motion energy $E_{mv}$ (after subtracting global camera motion), normalized by the 95th percentile. Finally, we fuse them with a weighted sum to get the "heat" score of the region.

E_{res}(x,y) = \text{clip}\left(\frac{|Y(x,y) - 128|}{\text{percentile}_{95}}, 0, 1\right)

w_mv: 1.0

w_res: 1.0

Fusion Heatmap

8 Square576 映射与 Patch 化

8 Square576 Mapping & Patchification

为了适配视觉 Transformer (ViT)，视频帧首先被等比例缩放，使得长边达到 576 像素，随后进行居中填充 (Center Padding)，形成 576×576 的标准方形输入。

接着，画面被切分为 16×16 大小的 Patch，总共得到 36×36 = 1296 个 Patch。位于填充区域（Padding Mask）内的 Patch 得分被强制置为负无穷（排除选择），而有效区域内的 Patch 得分则为其内部像素能量的总和。

To fit Vision Transformers (ViTs), video frames are proportionally resized so the longer side is 576 pixels, followed by Center Padding to form a standard 576×576 square input.

The frame is then divided into 16×16 patches, yielding 36×36 = 1296 patches in total. Patches falling inside the padded area (Padding Mask) have their scores forced to negative infinity (excluded from selection), while the score of a valid patch is the sum of pixel energies within it.

36×36 = 1296 Patches

9 全局 TopK 选择

9 Global TopK Selection

这是该机制最精彩的一步：在总计 64 帧中，第 0 帧（I 帧）保留全部 1296 个 Patch 以提供全局的空间场景上下文。

剩下的 Patch 选择预算（例如等效于 7 帧的预算，即 7 × 1296 = 9072 个 Patch）将在剩余的 63 帧中展开全局竞争 (Global Competition)。这意味着，动作剧烈的帧会自动获得更多的 Patch 分配，而静止帧几乎不会被分配到任何 Patch。

This is the most brilliant step: across the 64 frames, Frame 0 (I-frame) retains all 1296 patches to provide full spatial scene context.

The remaining patch selection budget (e.g., equivalent to 7 frames' budget, or 7 × 1296 = 9072 patches) initiates a Global Competition across the remaining 63 frames. This means frames with intense motion automatically receive more patches, while static frames receive almost none.

Selected: 0 / 10368

10 Patch 打包与 Mosaic 图

10 Patch Packing into Mosaic Images

被选中的稀疏 Patch 不能直接输入模型。它们会被光栅化（Raster order）紧密堆叠，打包成数张 576×576 的 Mosaic 马赛克图像。

例如，等效保留 8 帧的预算最终将被重组为 8 张完整的图像。为了保留时间与空间信息，系统会同时生成一个 positions_thw.npy 记录每个被选中 Patch 原本的 $(t, h, w)$ 坐标，供 ViT 的 3D 旋转位置编码 (3D RoPE) 使用。

The selected sparse patches cannot be fed into the model directly. They are densely packed in raster order into several 576×576 Mosaic images.

For instance, an equivalent budget of 8 frames is reassembled into 8 complete images. To preserve spatiotemporal information, the system simultaneously generates a positions_thw.npy recording the original $(t, h, w)$ coordinates of each selected patch, driving the ViT's 3D Rotary Position Embedding (3D RoPE).

Mosaic 1 / 8

特性 Concept	传统方法 Traditional	OneVision-Encoder OneVision-Encoder
帧采样数量 Frame sampling	4-8 frames	64 frames
Patch 选择策略 Patch selection	处理每一帧的全部 Patch ALL patches per frame	全局 TopK 筛选 (3%-25%) Global TopK (3%-25%)
空间上下文保留 Spatial context	每帧完整保留 Complete per frame	首帧保留完整 (I-frame) First frame complete (I-frame)
时间维度覆盖率 Temporal coverage	较低（帧间跨度大） Low (sparse frames)	极高（连续密集帧） High (dense frames)
核心理念 Core principle	Dense Frames Densely	Dense Frames Sparsely

概念 Concept	作用 Purpose	特性 Characteristics
I 帧 (I-Frame)	独立解码起点 Independent decoding starting point	体积最大，不依赖其他帧 Largest size, no dependencies
P 帧 (P-Frame)	记录画面变化 Records visual changes	体积小，依赖前面的帧 Small size, depends on previous frames
运动向量 (MV)	描述块的空间移动 Describes spatial block movement	二维向量 (X, Y) 偏移量 2D vector (X, Y) offset
残差 (Residual)	修补预测的误差 Fixes prediction errors	像素差异矩阵，经过变换量化后极易压缩 Pixel diff matrix, highly compressible after transform

11 Bitcost 评分系统

11 Bitcost Scoring System

概述：Bitcost 是一种改进的评分方法，来源于 lmms-eval-ov2 评估流水线（见 scoring.py 中的 mv_res_score_map()）。它直接利用视频编解码器的率失真优化 (RDO) 比特分配数据来评估图像区域的重要性。

Overview: Bitcost is an improved scoring method from the lmms-eval-ov2 evaluation pipeline (see mv_res_score_map() in scoring.py). It leverages Rate-Distortion Optimization (RDO) bit allocation data directly from the video codec to evaluate regional importance.

11.2 RDO 理论基础
在视频压缩中，编码器在决定如何划分宏块时，会进行率失真优化 (Rate-Distortion Optimization)。其核心公式为：

11.2 RDO Theory
In video compression, encoders perform Rate-Distortion Optimization when deciding how to partition macroblocks. The core formula is:

J = D + \lambda \cdot R

其中， $D$ 是失真 (Distortion)， $R$ 是消耗的比特数 (Rate)， $\lambda$ 是拉格朗日乘子。核心洞察在于：编码器会自动为视觉上更复杂、包含更多信息的区域分配更多的比特。因此，Bitcost 得分越高，代表该区域的视觉信息越丰富。

Where $D$ is Distortion, $R$ is Rate (bits), and $\lambda$ is the Lagrange multiplier. The key insight: the encoder automatically allocates more bits to visually complex, information-rich regions. Thus, a higher bitcost indicates a region with more visual information.

11.3 H.264 宏块 vs H.265 四叉树
H.264 使用固定的 16×16 宏块 (Macroblock)，而 H.265 (HEVC) 引入了编码树单元 (CTU)，支持从 64×64 到 8×8 的四叉树 (Quad-Tree) 递归划分。点击下方按钮演示四叉树划分过程。

11.3 H.264 MB vs H.265 Quad-Tree
H.264 uses fixed 16×16 Macroblocks (MB), while H.265 (HEVC) introduces Coding Tree Units (CTU) supporting recursive Quad-Tree splits from 64×64 down to 8×8. Click below to animate the split.

H.264 (16×16 MBs)

H.265 (CTU Quad-Tree)

CTU: 64×64

11.4 Bitcost 归一化流水线
提取出原始比特开销后，需要经过严格的 4 步归一化处理。

11.4 Bitcost Normalization Pipeline
After extracting raw bit costs, they pass through a strict 4-step normalization pipeline.

非负截断Clip Negative cost = max(0, cost)

Log 变换Log Transform cost = log1p(cost)

P99 截断 + Min-Max 归一化P99 Clip + Min-Max cost = clip(cost, 0, P99) → scale to [0,1]

空间插值Spatial Interpolation cv2.resize(grid, (W, H), interpolation=BILINEAR)

Ready

11.5 三级网格层级
为了兼容不同编码器，评估管线采用“层级回退 (Fallback)”机制：优先使用高分辨率的 CTU 网格，若不支持则降级为 MB 或 Sub-MB。

11.5 Three-Level Grid Hierarchy
To support various encoders, the pipeline uses a fallback mechanism: preferring high-res CTU grids, then degrading to MB or Sub-MB if unsupported.

Level 3: Sub-MB (4×4)Level 3: Sub-MB (4×4)

Level 2: MB (16×16)Level 2: MB (16×16)

Level 1: CTU (64×64)
(优先/Preferred)Level 1: CTU (64×64)
(Preferred)

11.6 Bitcost vs MV+残差对比

11.6 Bitcost vs MV+Residual Comparison

维度Dimension	Bitcost	MV+Residual
信号来源Source	RDO bit allocation (single signal)	MV magnitude + residual energy (two signals)
归一化Normalization	log1p → P99 → MinMax	P95 percentile clipping
空间分辨率Resolution	Adaptive (64×64 → 4×4)	Fixed macroblock grid
编码器依赖Encoder	H.264/H.265 auto-detect	Codec-agnostic
计算复杂度Complexity	Lower (single stream)	Higher (dual stream fusion)
推荐场景Best for	Production evaluation	Research & visualization

11.7 Canvas 交互：生成 Bitcost 热力图
下方的 Canvas 会生成模拟的 Bitcost 数据，并经过完整的 4 步归一化渲染成热力图。你可以切换编码器以观察网格粒度的变化。

11.7 Interactive Canvas: Bitcost Heatmap
The Canvas below generates synthetic Bitcost data and renders a heatmap after applying the full 4-step normalization. Toggle the encoder to see grid granularity changes.

编码器:Encoder:

Idle

"视频压缩的艺术在于平衡——在有限的比特预算下，保留最珍贵的视觉信息。"

"The art of video compression lies in balance—preserving the most valuable visual information within a limited bit budget."

基于 OneVision-Encoder 项目的技术文档 Technical documentation based on the OneVision-Encoder project