Research Note · Video Understanding · Subtitle Evaluation

视频理解Benchmark中的字幕评测:
从VideoMME到VideoMME-v2

聚焦 VideoMME / VideoMME-v2 与主流评测框架,拆清字幕如何被切分、对齐、注入 prompt,并最终影响视频理解模型的推理难度与得分。

Apr 8, 2026 Benchmark Analysis Subtitle Pipeline ~12 min
Research Note · Video Understanding · Subtitle Evaluation

Subtitle Evaluation in Video Understanding Benchmarks:
From VideoMME to VideoMME-v2

A concrete walkthrough of how VideoMME / VideoMME-v2 and mainstream evaluation stacks parse, align, and inject subtitles into prompts—and why that changes what video understanding scores actually measure.

Apr 8, 2026 Benchmark Analysis Subtitle Pipeline ~12 min

1引言 — 为什么需要字幕评测

在多模态大型语言模型(MLLMs)如 GPT-4o、Gemini 1.5 Pro、Qwen-VL 和 InternVL 的高速迭代中,视频理解正逐渐从单纯的“看”(视觉帧序列处理)向更深层次的“听”和“读”演进。视频本身是高度多模态的媒介,而字幕(Subtitle)作为视频中文本信息的直接载体,不仅精确包含了对话内容,更是串联视频剧情、理解人物意图、甚至进行多跳逻辑推理的关键维度。

不同模型家族对字幕的处理能力差异巨大。例如,GPT-4o 展现了强大的原生视听融合能力,而多数开源模型目前仍依赖外部抽取的文本字幕进行 Prompt 拼接。这种差异直接催生了对字幕评测基准的迫切需求。

本文核心脉络:
1. 盘点当前主流 Benchmark 如何设计字幕测试(VideoMME, MMOU 等)。
2. 深入解析最新一代 VideoMME-v2 的渐进式复杂推理机制。
3. 从底层代码(lmms_eval, VLMEvalKit)剖析框架是如何将字幕注入到视觉模型中的。
4. 探讨 Qwen3-VL 等前沿模型在数据集层面的字幕处理范式。

1Introduction — Why Subtitle Evaluation Matters

In the rapid iteration of Multimodal Large Language Models (MLLMs) such as GPT-4o, Gemini 1.5 Pro, Qwen-VL, and InternVL, video understanding is evolving from merely "watching" (visual frame processing) to deeper levels of "listening" and "reading". Video is inherently a highly multimodal medium, and subtitles, as the direct carrier of textual information, not only contain exact dialogues but also serve as a crucial dimension for connecting plots, understanding human intentions, and performing multi-hop logical reasoning.

Different model families exhibit vast differences in handling subtitles. For instance, GPT-4o demonstrates strong native audio-visual fusion, while most open-source models currently rely on externally extracted text subtitles concatenated via Prompts. This divergence has directly birthed an urgent need for subtitle evaluation benchmarks.

Article Roadmap:
1. Review how mainstream Benchmarks design subtitle tests (VideoMME, MMOU, etc.).
2. Deep dive into the progressive complex reasoning mechanisms of the next-gen VideoMME-v2.
3. Analyze from the code level (lmms_eval, VLMEvalKit) how frameworks inject subtitles into vision models.
4. Discuss the dataset-level subtitle processing paradigms of cutting-edge models like Qwen3-VL.

2Benchmark字幕支持总览

不同的 Benchmark 对视频理解的侧重点不同,这也直接体现在它们对字幕的支持和处理方式上。从早期的纯视觉评测,到加入 SRT 文本,再到如今的词级别时间戳与原生音频理解,基准测试的难度在不断攀升。

2Overview of Benchmark Subtitle Support

Different benchmarks emphasize varying aspects of video understanding, which is directly reflected in their subtitle support. From early pure-visual evaluations to the inclusion of SRT texts, and now word-level timestamps and native audio understanding, the difficulty of benchmarks is constantly escalating.

评测集 (Benchmark)Benchmark 发布时间Release Date 字幕支持与格式Subtitle Support & Format 评测模式与特点Eval Mode & Features 论文链接Paper Link
VideoMME 2024.05 显式 (SRT文件)Explicit (SRT File) 双轨测试: w/ subs vs w/o subs. 首个系统性引入字幕w/ subs vs w/o subs. First systematic subtitle eval. 2405.21075
VideoMME-v2 2026.04 Word-level JSONLWord-level JSONL 8选项非线性评分, Concatenated vs Interleaved 注入8 options, non-linear scoring, Concat vs Interleave. 2604.05015
MMOU 2026.03 🔄 隐式 (原生音频流)Implicit (Raw Audio) 无文本文件, 强迫端到端 Omni 语音识别与视觉对齐No text files, forces end-to-end Omni speech-vision alignment. 2603.14145
MLVU 2024.06 无字幕 (纯视觉)No Subtitles (Pure Visual) 专注极长视频 (3min-2hr) 的纯视觉情节理解Focuses on pure visual plot in extremely long videos. 2406.04306

趋势分析:从上表可以看出,评测体系正在经历从“文本附加物”向“原生模态”的转变。早期的评测集要么忽略字幕(如 MLVU),要么简单地将 SRT 作为文本补充(VideoMME)。而新一代评测集则开始追求极高的时间精度(VideoMME-v2 的词级时间戳)或完全抛弃文本转向真正的多模态端到端音频理解(MMOU)。

Trend Analysis: As seen in the table, the evaluation paradigm is shifting from treating subtitles as a "textual appendage" to treating them as a "native modality". Early benchmarks either ignored subtitles (MLVU) or simply used SRT as a text supplement (VideoMME). New generation benchmarks pursue extreme temporal precision (word-level timestamps in VideoMME-v2) or completely abandon text for true multimodal end-to-end audio understanding (MMOU).

3VideoMME — 首个系统性字幕评测

作为视频理解领域的标杆之一,VideoMME 开创性地将字幕作为评测的一个独立控制变量。它构建了一个包含 900 个视频 的庞大数据集,广泛覆盖了 6 大领域(知识、电影、体育、艺术、生活、科学)。

在时长分布上,数据集被精心划分为三个层级:短视频(< 2分钟)、中等视频(4-15分钟)和长视频(30-60分钟)。为了保证评测质量,VideoMME 提供了 744 个标准 SRT 格式字幕文件。特别是对于长视频,考虑到自动生成字幕(ASR)可能存在的幻觉和漂移,团队几乎全部采用了高质量的人工校对字幕。

3VideoMME — First Systematic Evaluation

As a flagship in video understanding, VideoMME pioneered the use of subtitles as an independent control variable for evaluation. It constructed a massive dataset of 900 videos, covering 6 major domains (Knowledge, Film, Sports, Artistic, Life, Science).

In terms of duration, the dataset is meticulously divided into three tiers: short (< 2min), medium (4-15min), and long (30-60min) videos. To ensure evaluation quality, VideoMME provides 744 standard SRT formatted subtitle files. Notably, for long videos, considering the hallucinations and drift of auto-generated subtitles (ASR), the team almost exclusively utilized high-quality human-proofread subtitles.

0
测试视频总数
Total Videos
0
SRT 字幕文件
SRT Subtitle Files
0
覆盖领域大类
Domain Categories

VideoMME 采用了严格的平行测试模式。在启用字幕时,使用的 Prompt 模板极为直接:

VideoMME employs a strict parallel testing mode. When subtitles are enabled, the Prompt template used is extremely direct:

VideoMME Subtitle PromptText
# The subtitles are injected directly before the question
"This video's subtitles are listed below:\n{subtitles}\n\nQuestion: {question}\nOptions: {options}"

核心发现:研究结果明确表明,对于对话密集型视频(如访谈、电影片段),引入高质量的人工字幕能使主流开源 MLLMs 的准确率平均提升 12% - 15%。然而,当使用低质量自动生成的字幕时,模型的表现甚至可能产生倒退。这揭示了大模型强烈的“文本依赖偏见”——当视觉信号与文本信号冲突时,模型往往盲信文本。

Key Finding: Research results clearly show that for dialogue-dense videos (e.g., interviews, movie clips), introducing high-quality human subtitles boosts the accuracy of mainstream open-source MLLMs by 12% - 15% on average. However, when using low-quality auto-generated subtitles, model performance can actually degrade. This reveals a strong "text dependency bias" in LLMs—when visual signals conflict with text signals, models tend to blindly trust the text.

4VideoMME-v2 — 下一代字幕评测 (2604.05015)

2026年4月,VideoMME 团队推出了重磅升级的 VideoMME-v2。这一版本彻底重构了评测体系,包含 800 个精选视频和 800 个精准字幕文件,构建了 3200 个精标注问答对,耗费了惊人的 3300 个人类标注工时。它带来了三大革命性创新:

A. 渐进式多层级系统 (Progressive Multi-Level)

VideoMME-v2 摒弃了扁平的提问方式,将视频理解拆解为三个严格递进的认知层级。只有当模型在底层理解无误时,才能真正证明其具备高层推理能力,从而严厉打击了模型的“猜题”行为:

  • L1: Information Aggregation (信息聚合) - 考察模型对字幕基本事实、视频画面实体的基础识别与提取。
  • L2: Temporal Understanding (时序理解) - 考察事件发生的先后顺序、因果关系,以及字幕台词与特定画面的时序对齐。
  • L3: Temporal Complex Reasoning (时空复杂推理) - 需要跨越长视频的多个片段,结合前后文语境进行多跳逻辑推理。

B. 分组非线性评分机制 (Grouped Non-Linear Scoring)

为了进一步消除随机猜测带来的虚高分数,v2 版本将传统选择题的选项数从 4 个激增至 8 个(选项 A-H)。更绝的是其独特的评分公式:

4VideoMME-v2 — Next-Gen Eval (2604.05015)

In April 2026, the VideoMME team launched the massive upgrade: VideoMME-v2. This version completely restructures the evaluation system, containing 800 curated videos and 800 precise subtitle files, formulating 3,200 meticulously annotated QA pairs at the cost of a staggering 3,300 human-hours. It introduces three revolutionary innovations:

A. Progressive Multi-Level System

VideoMME-v2 abandons flat questioning, dissecting video understanding into three strictly progressive cognitive levels. Only when a model succeeds at the base level can it prove high-level reasoning, heavily penalizing "guessing":

  • L1: Information Aggregation - Tests basic recognition and extraction of facts from subtitles and visual entities.
  • L2: Temporal Understanding - Examines chronological order, causality, and temporal alignment between spoken lines and specific frames.
  • L3: Temporal Complex Reasoning - Requires multi-hop logical inference across multiple segments of a long video, combining context.

B. Grouped Non-Linear Scoring

To further eliminate inflated scores from random guessing, v2 expands the number of multiple-choice options from 4 to 8 (A-H). More impressively, it introduces a unique scoring formula:

$$ \text{Score}_{\text{group}} = \left(\frac{N_{\text{correct}}}{N_{\text{total}}}\right)^2 $$

在这个机制下,同一个视频的多个相关问题被打包成一个“Group”。系统实行首错截断 (First-error truncation):如果模型在 L1 级别的问题上回答错误,那么该组后续的 L2、L3 问题无论是否猜对,都记为错误。这种指数级的惩罚曲线 ($x^2$) 迫使模型必须展现出高度的内部一致性。

C. 词级别时间戳与双重注入模式

格式上,v2 彻底抛弃了粗粒度的 SRT,转而采用词级别(Word-level)时间戳的 JSONL 格式。在评测框架(如 VLMEvalKit)中,它提供了多种 Config 变体(如 Video-MME-v2_64frame_subs),并定义了两种字幕与视频帧的融合注入模式:

Under this mechanism, multiple related questions for a single video are bundled into a "Group". The system enforces First-error truncation: if a model fails an L1 question, subsequent L2/L3 questions in that group are marked wrong regardless of the guess. This exponential penalty curve ($x^2$) forces models to exhibit high internal consistency.

C. Word-level Timestamps & Dual Injection Modes

Format-wise, v2 abandons coarse SRT for Word-level timestamped JSONL. In evaluation frameworks (e.g., VLMEvalKit), it provides various Config variants (like Video-MME-v2_64frame_subs) and defines two distinct modes for fusing subtitles with video frames:

Frame 1
Frame 2
Frame 3
"Hello world, today we..."
"Hello"
"world, today"
LLM
(在窄屏设备上排版已适配简化)
(Layout adapted for narrow screens)

5MMOU — Omni-Modal的隐式路径 (2603.14145)

如果说 VideoMME 致力于将文本字幕做到极致,那么由 NVIDIA 发布的 MMOU 则代表了迈向 AGI 的另一种截然不同的路径:抛弃文本,拥抱原生音频

MMOU 包含 9038 个视频和 15000 个问题,横跨 6 大能力维度。其最核心的特点是拒绝提供任何形式的字幕文件。模型必须像人类一样,直接处理视频帧和伴随的 Raw Audio 流。这意味着模型必须在内部同时完成语音识别(ASR)和多模态联合语义推理。

这种隐式理解范式带来了全新的提问结构。常见的测试题如:
"When the speaker says [specific audio segment], what object is shown on the right side of the screen?"
这里,声音信号成为了视觉帧序列的“时间锚点”。这不仅测试了模型是否“听懂了”,更测试了模型是否能在毫秒级别上将声音流与像素流进行对齐对位。

5MMOU — The Implicit Omni-Modal Path (2603.14145)

If VideoMME strives to perfect text subtitles, MMOU released by NVIDIA represents an entirely different path towards AGI: abandoning text to embrace raw audio.

MMOU encompasses 9,038 videos and 15,000 questions across 6 capability dimensions. Its defining feature is the refusal to provide any subtitle files. The model must process video frames and the accompanying Raw Audio stream directly, like a human. This means the model must internally perform simultaneous Speech Recognition (ASR) and multimodal joint semantic reasoning.

This implicit understanding paradigm introduces novel question structures. Common test formats include:
"When the speaker says [specific audio segment], what object is shown on the right side of the screen?"
Here, the audio signal becomes the "temporal anchor" for the visual frame sequence. This tests not just whether the model "heard it", but whether it can align the audio stream with the pixel stream at millisecond precision.

6技术解密:lmms_eval 的字幕管线

在工程层面,评测框架是如何将孤立的字幕文件与视频帧耦合在一起的?在开源评测集大成的 lmms_eval 中,这一切由精确的时间戳计算和 Prompt 模板注入完成。

通过对比其 YAML 配置文件(videomme.yamlvideomme_w_subtitle.yaml),我们可以清晰地看到字幕流的开关。当开启字幕模式时,系统会调用手写的正则解析器读取 SRT,并根据模型指定的 frame_num 参数(采样帧数)使用 np.linspace 进行帧对齐。

6Tech Decoded: Subtitle Pipeline in lmms_eval

At the engineering level, how do evaluation frameworks couple isolated subtitle files with video frames? In the widely used lmms_eval suite, this is accomplished via precise timestamp calculation and Prompt template injection.

By comparing its YAML configurations (videomme.yaml vs videomme_w_subtitle.yaml), we can clearly see the subtitle stream switch. When enabled, the system uses a hand-written regex parser to read the SRT and applies np.linspace for frame alignment based on the model's frame_num parameter.

lmms_eval/utils.py — Subtitle Frame Alignment Python
# Generate uniform frame indices based on model config (e.g. frame_num=768 for Qwen3-VL)
uniform_sampled_frames = np.linspace(0, total_frame - 1, frame_num, dtype=int).tolist()
subtitle_by_frame_idx = []

for frame_idx in uniform_sampled_frames:
    for idx, title in enumerate(subtitle_by_frame):
        # Check if the uniformly sampled frame falls within the subtitle's timestamp window
        if frame_idx < title[1] and frame_idx >= title[0]:
            subtitle_by_frame_idx.append(idx)

# Subtitle text is then actually inserted into the prompt string
prompt_template = "This video's subtitles are listed below:\n{subs}\nQuestion..."

实况演示:从 SRT 到 Prompt 的数据流管线

Live Demo: Data Pipeline from SRT to Prompt

1
解析原始 SRT 文件Raw SRT File
1 00:00:01,000 --> 00:00:03,500 Hello everyone, welcome.
2 00:00:04,000 --> 00:00:07,200 Today we discuss video understanding.
3 00:00:08,500 --> 00:00:12,000 Let's look at how models process subtitles.
2
时间戳与文本分离 (解析提取)Parsing & Extraction
{(1.0, 3.5)}: "Hello everyone, welcome.",
{(4.0, 7.2)}: "Today we discuss video understanding.",
{(8.5, 12.0)}: "Let's look at how models process subtitles."
3
np.linspace 帧采样与对齐Frame Sampling & Alignment
Frame 0
Frame 1
Frame 47
Frame 98
Frame N
t=1.2s
✓ [1.0, 3.5]
t=2.3s
✓ [1.0, 3.5]
t=9.1s
✓ [8.5, 12.0]
4
Prompt 构建组装Prompt Construction
<frame_0>, <frame_1>, ... <frame_N>
"This video's subtitles are listed below:"
- Hello everyone, welcome.
- Today we discuss video understanding.
- Let's look at how models process subtitles.
"Question: What is discussed in this video?"
"A. Cooking B. Video understanding C. Sports"
"Answer with the option letter only."
5
输入模型 (Qwen3-VL / GPT-4o)Into the Model
Processing Multi-modal Stream...
✅ Output: B

7VLMEvalKit vs lmms_eval 架构深度对比

lmms_eval 的纯手工解析不同,OpenCompass 团队主导的 VLMEvalKit 采用了更工程化的设计模式。两者在字幕解析、时间轴对齐、配置管理和异常处理上存在根本性的架构分歧。接下来我们从代码级别逐层拆解。

7VLMEvalKit vs lmms_eval: Deep Architecture Comparison

Unlike the manual parsing in lmms_eval, the VLMEvalKit led by the OpenCompass team adopts a fundamentally different engineering approach. The two frameworks diverge at every layer — from subtitle parsing to timeline alignment, configuration management, and error handling. Let's dissect them at code level.

A. 技术路线全景对比

A. Full Technical Comparison

处理环节Pipeline Stage lmms_eval VLMEvalKit
文件解析器File Parser 手动 Regex 与 string splitManual Regex & split 强大的 pysubs2Robust pysubs2 library
时间轴对齐Timeline Align np.linspace(0, total_frames, n) pysubs2.make_time(fps, frames) (绝对时间)(Absolute)
配置管理层Config Layer 独立 YAML (videomme_w_subtitle)Standalone YAML files 代码层级 partial() 参数动态绑定Python partial() dynamic bind
支持格式Format Support 仅 SRTSRT only SRT / ASS / SSA / VTT (pysubs2 原生)SRT / ASS / SSA / VTT (pysubs2 native)
HTML 标签处理HTML Tag Handling ❌ 无 — 脏标签残留在文本中❌ None — dirty tags leak into text ✅ pysubs2 自动剥离 <font>, <i>✅ Auto-strips <font>, <i> etc.
编码容错Encoding Tolerance UTF-8 硬编码, BOM 可能出错UTF-8 hardcoded, BOM may break 自动检测 (chardet)Auto-detect (chardet)
时间精度Time Precision 帧索引相对值 (int)Frame-index relative (int) 毫秒绝对值 (ms)Absolute milliseconds (ms)
VideoMME-v2 支持VideoMME-v2 Support 需扩展 JSONL 解析器Requires JSONL parser extension 原生 Config 变体 (Video-MME-v2_*_subs)Native Config variant (Video-MME-v2_*_subs)

B. 字幕解析器:手工正则 vs pysubs2

这是两个框架最根本的分歧。lmms_eval 使用手写正则逐行切割 SRT 文件,手动将时间戳字符串 HH:MM:SS,mmm 转换为秒数;而 VLMEvalKit 直接调用 pysubs2 的高级 API,一行完成全部解析。

B. Subtitle Parser: Hand-written Regex vs pysubs2

This is the most fundamental divergence. lmms_eval splits SRT files line-by-line with hand-crafted regex, manually converting HH:MM:SS,mmm timestamp strings into seconds; VLMEvalKit calls the pysubs2 high-level API to parse everything in one line.

lmms_eval — Hand-written SRT ParserPython
# load_subtitles() in lmms_eval/utils.py
def load_subtitles(srt_path):
    with open(srt_path, 'r') as f:
        content = f.read()

    # Hand-crafted regex to split SRT blocks
    pattern = r'(\d+)\n(\d{2}:\d{2}:\d{2},\d{3})'
    blocks = re.findall(pattern, content)

    # Manual time conversion
    def to_seconds(ts):
        h, m, rest = ts.split(':')
        s, ms = rest.split(',')
        return (int(h)*3600
            + int(m)*60
            + int(s)
            + int(ms)/1000)
VLMEvalKit — pysubs2 One-linerPython
# video_dataset_config.py
import pysubs2

# Entire parsing: one line
subs = pysubs2.load(srt_path)

# Handles SRT, ASS, SSA, VTT auto
# Auto-strips HTML tags
# Auto-detects encoding

# Time alignment via absolute ms
frame_time = pysubs2.make_time(
    fps=fps, frames=frame_idx)

# Match subtitle to frame
matched = [s.text for s in subs
    if s.start <= frame_time
    < s.end]

C. 配置管理哲学

两个框架在“如何控制字幕开关”上也走了截然不同的路线。lmms_eval 采用声明式 YAML:为每个评测变体创建独立的 YAML 文件;VLMEvalKit 则采用编程式 Python:用 functools.partial() 在代码中动态绑定参数。

C. Configuration Philosophy

The two frameworks also differ in how they toggle subtitle mode. lmms_eval uses declarative YAML: creating standalone YAML files for each eval variant; VLMEvalKit uses programmatic Python: dynamically binding parameters via functools.partial().

lmms_eval — YAML ConfigYAML
# videomme_w_subtitle.yaml
include: videomme.yaml

metadata:
  load_subtitle: true
  subtitle_dir: /data/videomme/srt/

# Separate file per variant
# videomme.yaml → no subs
# videomme_w_subtitle.yaml → subs
VLMEvalKit — Python partial()Python
# Programmatic config
from functools import partial

DATASETS = {
  'Video-MME': partial(
    VideoMME,
    use_subtitle=False,
    nframe=128
  ),
  'Video-MME_sub': partial(
    VideoMME,
    use_subtitle=True,
    nframe=128
  ),
}

D. 完整数据流对比动画

下面的动画展示了同一份 SRT 文件在两个框架中的完整处理流程。左侧是 lmms_eval 的手工流水线,右侧是 VLMEvalKit 的 pysubs2 流水线:

D. Full Data Flow Comparison Animation

The animation below shows how the same SRT file is processed through both frameworks. Left: lmms_eval manual pipeline. Right: VLMEvalKit pysubs2 pipeline:

L lmms_eval
Step 1 · Input 1
00:00:01,000 --> 00:00:03,500
Hello everyone, welcome.
Step 2 · Regex Parse re.findall(r'(\d+)\n...')
→ manual string split
Step 3 · Time Convert int(h)*3600 + int(m)*60
+ int(s) + int(ms)/1000
→ {(1.0, 3.5): "Hello..."}
Step 4 · Frame Align np.linspace(0, N, frame_num)
→ loop & window match
Step 5 · Prompt Concat "This video's subtitles..."
+ matched subtitle text
⚠️ 依赖正则健壮性⚠️ Regex-Dependent
V VLMEvalKit
Step 1 · Input (dirty) 1
00:00:01,000 --> 00:00:03,500
<font>Hello everyone</font>
Step 2 · pysubs2.load() subs = pysubs2.load(path)
→ auto format detect
→ HTML tags stripped ✓
Step 3 · make_time() pysubs2.make_time(
fps=30, frames=idx)
→ absolute ms timestamps
Step 4 · Match & Extract [s.text for s in subs
if s.start <= t < s.end]
Step 5 · Prompt Fill "Subtitles: ..."
+ clean subtitle text
✅ 工业级健壮性✅ Production-Grade

E. 脏数据处理实战

真实世界的 SRT 文件经常包含 HTML 样式标签。以下是同一段脏字幕在两个框架中的不同命运:

E. Dirty Data Handling in Practice

Real-world SRT files frequently contain HTML styling tags. Here's how the same dirty subtitle is handled differently:

Input SRT line:
<font color="#ffffff"><i>Hello everyone, welcome to the show.</i></font>
lmms_eval output <font color="#ffffff"><i>Hello everyone, welcome to the show.</i></font>

❌ HTML 标签残留,污染 Prompt,可能误导模型❌ HTML tags leak into prompt, may confuse model
VLMEvalKit output Hello everyone, welcome to the show.

✅ 干净文本,pysubs2 自动剥离所有标签✅ Clean text, pysubs2 auto-strips all tags
关键结论:
lmms_eval 的优势在于透明性——所有逻辑一目了然,适合快速定制和学术研究。但其手工 Regex 在面对脏数据、非 SRT 格式、BOM 编码时较脆弱。
VLMEvalKit 的优势在于工程健壮性——pysubs2 天然支持多格式、多编码、HTML 标签清洗。partial() 配置模式也更 Pythonic,适合大规模评测。
• 两者在功能上互补:lmms_eval 的 YAML 声明式配置更利于可复现研究,VLMEvalKit 的编程式配置更利于生产环境部署。
Key Takeaways:
lmms_eval excels at transparency—all logic is visible, ideal for quick customization and academic research. But its hand-crafted Regex breaks on dirty data, non-SRT formats, and BOM encodings.
VLMEvalKit excels at engineering robustness—pysubs2 natively supports multi-format, multi-encoding, HTML tag stripping. The partial() config pattern is more Pythonic for large-scale evaluation.
• The two are complementary: lmms_eval's declarative YAML is better for reproducible research; VLMEvalKit's programmatic config is better for production deployment.

8深度案例:Qwen3-VL 的极简哲学

当我们审视当今最强的开源视听模型 Qwen3-VL 时,会发现一个令人惊讶的现象:在模型网络结构层面,Qwen3-VL 并没有为“字幕”设计任何特殊的模块。

为什么不需要架构改动?因为 Qwen3-VL 拥有极其强悍的长上下文 Tokenizer,能够原生无缝处理交织的文本和图像 Token。因此,字幕的处理完全被抽象成了一个数据集构造(Dataset)的问题。

在评测运行中,Qwen3-VL 会将采样帧数推向极限:frame_num=768(而常规模型通常仅为 8 到 32 帧,VLMEvalKit 默认也是保守的 128 帧)。极高密度的视频帧与成百上千字的字幕文本被简单粗暴地拼接在一起,输入模型。这种“大力出奇迹”的极简哲学,恰恰证明了强大的基础模型不需要花哨的插件工程。

8Deep Case: The Minimalist Philosophy of Qwen3-VL

When examining today's strongest open-source audio-visual model, Qwen3-VL, a surprising phenomenon emerges: at the model network architecture level, Qwen3-VL incorporates absolutely zero special modules for "subtitles".

Why are no architectural changes needed? Because Qwen3-VL possesses an overwhelmingly powerful long-context Tokenizer capable of natively processing seamlessly interleaved text and image Tokens. Therefore, subtitle processing is entirely abstracted as a Dataset construction problem.

During evaluation, Qwen3-VL pushes the sampled frames to the extreme: frame_num=768 (whereas conventional models typically sample 8 to 32 frames, and VLMEvalKit defaults conservatively to 128). An incredibly high density of video frames combined with thousands of words of subtitle text are simply and brutally concatenated into the model. This minimalist "brute force" philosophy exactly proves that powerful base models do not require flashy engineering plugins.

Qwen3-VL Final Injected Prompt Structure Text
<|vision_start|><|frame_0|><|vision_end|>
<|vision_start|><|frame_1|><|vision_end|>
... (up to 768 frames) ...

This video's subtitles are listed below:
[00:00:01] Hello everyone, welcome to the test.
[00:00:05] Today we are exploring MLLMs.

Question: What is the speaker discussing?
Options:
A. Baking a cake
B. MLLMs
C. Sports
Answer with the option letter only.

9Allen AI 的 Molmo2:训练阶段融入字幕

与仅在评测阶段拼接字幕的方法不同,Allen AI 的 Molmo2 采用了一种全新的范式:将字幕处理视为训练层面的核心任务,而非仅仅是评测层面的工程手段。通过在训练阶段引入特定的字幕问答数据集,Molmo2 能够原生理解字幕与视频画面之间的关联,让模型在学习期间就“看过”并“读过”字幕。

9Allen AI's Molmo2: Integrating Subtitles at Training Time

Unlike approaches that merely concatenate subtitles during evaluation, Allen AI's Molmo2 adopts a completely new paradigm: treating subtitle processing as a core training concern, rather than just an engineering mechanism for evaluation. By introducing specific subtitle QA datasets during the training phase, Molmo2 natively understands the correlation between subtitles and video frames, enabling the model to have "seen" and "read" subtitles during its learning process.

Molmo2-4B

最小巧且高效的变体。基于 Qwen3-4B-Instruct 与 SigLIP2 视觉编码器,适合资源受限场景。

Molmo2-4B

The smallest and most efficient variant. Based on Qwen3-4B-Instruct and SigLIP2 vision encoder, ideal for resource-constrained scenarios.

Base: Qwen3-4B
Vision: SigLIP2

Molmo2-8B

在体积与性能之间取得最佳平衡的旗舰模型。基于 Qwen3-8B 与 SigLIP2,全面超越同级开源模型。

Molmo2-8B

The flagship model striking the best balance between size and performance. Based on Qwen3-8B and SigLIP2, outperforming peers.

Base: Qwen3-8B
Vision: SigLIP2

Molmo2-O-7B

完全开源的纯血模型(Apache 2.0 协议)。基于完全开源的 OLMo3-7B-Instruct 构建。

Molmo2-O-7B

A fully open-source pure-blood model (Apache 2.0). Built entirely on the fully open-source OLMo3-7B-Instruct.

Base: OLMo3-7B
Vision: SigLIP2
Pre-Training

图文对齐与基础视觉理解,使用海量 image-text 数据。

Vision-language alignment and basic understanding using massive image-text pairs.

SFT

监督微调。涵盖指令遵循、常规图文QA与VQA任务。

Supervised fine-tuning. Covers instruction following, general image-text QA, and VQA.

Long-Context SFT

超长上下文微调。序列长度扩展至 36,864 tokens,最大支持 384 帧的长视频理解与字幕QA。

Long-context fine-tuning. Extends seq_len to 36,864 tokens, supporting up to 384 frames for long video understanding and subtitle QA.

关键突破:在第三阶段,Molmo2 引入了字幕 QA 训练数据(Molmo2SynCaptionsSubtitleQA, CinepileHf(with_subtitle=True), TVQA(with_subtitle=True)),使模型在训练阶段就学会了理解字幕与视频内容的关联。
Key breakthrough: In Stage 3, Molmo2 incorporates subtitle QA training data (Molmo2SynCaptionsSubtitleQA, CinepileHf(with_subtitle=True), TVQA(with_subtitle=True)), enabling the model to learn subtitle-video correlation during training.

与 lmms_eval 不同,Molmo2 使用自定义的 subtitles.json 格式(从 HuggingFace 仓库 allenai/Molmo2-VideoMME 下载),将字幕以 {(start, end): text} 字典形式附加到每个样本,并通过 style 字段区分有/无字幕模式。

Unlike lmms_eval, Molmo2 uses a custom subtitles.json format (downloaded from HuggingFace repo allenai/Molmo2-VideoMME), attaching subtitles as {(start, end): text} dictionaries to each example, and distinguishing subtitle/no-subtitle modes via the style field.

Molmo2 VideoMME Dataset Processing Python
class VideoMME(DatasetBase):
    def __init__(self, split, duration="all", 
                 difficulty="all", with_subtitle=False):
        self.with_subtitle = with_subtitle
    
    def load(self):
        # Custom subtitles.json from HuggingFace
        subtitles = json.loads(read_file(
            os.path.join(self.data_path, "subtitles.json")))
        
        for idx, row in df.iterrows():
            example = {
                "question": question,
                "video": video_path,
                "style": "video_eval_multiple_choice"
            }
            if self.with_subtitle and row["videoID"] in subtitles:
                subtitle = {}
                for i, entry in subtitles[row["videoID"]].items():
                    subtitle[(entry["start"], entry["end"])] = entry["text"]
                example["subtitle"] = subtitle
                example["style"] = "video_eval_multiple_choice_w_subtitle"

Molmo2 将评测任务明确分为“有字幕”和“无字幕”两大组,支持通过 --tasks=video_subtitle--tasks=video_no_subtitle 命令行参数选择性运行,实现了字幕测试的清晰隔离。

Molmo2 explicitly divides evaluation tasks into "with subtitle" and "no subtitle" groups, supporting selective execution via --tasks=video_subtitle or --tasks=video_no_subtitle CLI arguments, achieving clean isolation of subtitle testing.

Task Groups Configuration YAML
LONG_VIDEO_SUBTITLES:
  - video_mme_w_subtitle
  - long_video_bench_w_subtitle
  - lvbench

LONG_VIDEO_NO_SUBTITLES:
  - mlvu_mc
  - video_mme
  - long_video_bench_no_subtitle
  - vixmo_caps_eval2:test
  - video_eval_pro_mc:test
维度Dimension Qwen3-VL Molmo2
字幕在训练中Subtitles in Training ❌ 仅评测时注入❌ Injected only at Eval ✅ 训练数据包含字幕QA✅ Subtitle QA in Training
字幕格式Subtitle Format SRT (via lmms_eval) 自定义 subtitles.jsonCustom subtitles.json
最大帧数Max Frames 768 384
长上下文Long Context 原生支持Native Support Context Parallel (跨GPU分片)

10总结:走向 AGI 的分岔路口

回顾从 VideoMME 到 VideoMME-v2,再到 MMOU 的演进,我们清晰地看到视频评测中针对声音与文本的探索,已经分化出三种主流范式:

  1. 显式文本流(Explicit Text):如 VideoMME,将外部获取的 SRT/JSONL 作为坚实的提示词后盾,适合需要极高精度知识提取的场景。
  2. 纯视觉流(Pure Visual):如 MLVU,刻意剥离声音和字幕,强迫模型提升对长视频纯视觉情节和微小动作的洞察力。
  3. 隐式全能流(Implicit Omni):如 MMOU,代表着未来的终极形态。模型必须直面嘈杂的原生音频流,在视听交融的混沌中提取真理。

未来的研究留下了激动人心的开放问题:随着模型 OCR 能力的突破,未来的 MLLM 是否能直接“读懂”画面中烧录(Burned-in)的字幕,从而彻底终结外部字幕文件的历史?让我们拭目以待。

10Conclusion: Crossings on the Path to AGI

Looking back at the evolution from VideoMME to VideoMME-v2, and on to MMOU, we clearly see that the exploration of audio and text in video evaluation has branched into three mainstream paradigms:

  1. Explicit Text Stream: Like VideoMME, utilizing externally acquired SRT/JSONL as a solid prompt backbone. Ideal for scenarios requiring extreme precision in knowledge extraction.
  2. Pure Visual Stream: Like MLVU, deliberately stripping away audio and subtitles to force models to enhance their insight into pure visual plots and subtle actions in long videos.
  3. Implicit Omni Stream: Like MMOU, representing the ultimate future form. Models must confront noisy raw audio streams, extracting truth from the chaos of audio-visual fusion.

Future research leaves thrilling open questions: As model OCR capabilities break through, will future MLLMs directly "read" burned-in subtitles from the frames, thereby entirely ending the history of external subtitle files? We wait with anticipation.

Back to top