Speech / ASR / Tooling

FunASR 调研:从原理、模型家族到部署实践

FunASR Survey: Principles, Model Family, and Deployment

如果只看一句话:FunASR 不是单一 ASR 模型,而是一套把 VAD、ASR、标点恢复、时间戳、说话人能力和运行时部署串起来的语音工具箱。 它试图把学术模型和工业落地之间的那一段“最后一公里”补齐。

In one sentence: FunASR is not a single ASR model, but a speech toolkit that connects VAD, ASR, Punctuation Restoration, Timestamps, Speaker Diarization, and Runtime Deployment. It bridges the "last mile" between academic models and industrial deployment.

Apr 8, 2026 FunASR ASR Paraformer SenseVoice ~14 min
31 Fun-ASR-Nano 语言覆盖数 Languages supported by Fun-ASR-Nano
220M Paraformer 主力模型量级 Paraformer Main Model Size
0.4M fsmn-vad 量级,前置模块 fsmn-vad Size (Frontend Module)
RTF 0.0076 官方 GPU 长音频单线程指标 GPU Single-thread RTF for Long Audio
先讲结论 TL;DR

1. FunASR 是什么:它解决的不是“识别”,而是“整条链路”

1. What is FunASR: Solving the "Pipeline", Not Just "Recognition"

官方 README 对 FunASR 的定位很直接:它希望在语音识别的学术研究和工业应用之间搭桥。 所以你会发现它不是只给你一个 checkpoint,然后让你自己拼剩下的一切;相反,它把工程里常见的一整条链路都显式放到了工具层里。

The official README describes FunASR explicitly as a bridge between academic speech research and industrial application. Instead of giving you a raw checkpoint and leaving you to build the rest, it explicitly integrates the entire common engineering pipeline into the toolkit layer.

官方原话的重点其实只有两个:工业级模型研究到生产的桥梁。 这意味着 FunASR 评价自己的方式,不只是 paper 指标,而是“这个能力能不能真正被接到业务里”。

The official description focuses on two keywords: Industrial-grade Models and Bridge from Research to Production. This means FunASR evaluates itself not just by paper metrics, but by whether the capabilities can be successfully integrated into real business logic.

它覆盖哪些能力

What It Covers

  • ASR(自动语音识别)
  • VAD(语音端点检测)
  • 标点恢复 / ITN
  • 时间戳预测
  • 说话人确认 / 分离
  • 关键词唤醒、多说话人 ASR、情感识别等
  • ASR (Automatic Speech Recognition)
  • VAD (Voice Activity Detection)
  • Punctuation Restoration / ITN
  • Timestamp Prediction
  • Speaker Verification / Diarization
  • Keyword Spotting, Multi-talker ASR, Emotion Recognition, etc.

它和“单模型 demo”最大的差别

Difference from "Single Model Demos"

  • 统一的 AutoModel 接口
  • 离线与流式能力同时存在
  • 支持 ModelScope / Hugging Face / OpenAI 等模型来源
  • 提供 runtime SDK、服务部署、ONNX 导出
  • Unified AutoModel API
  • Both offline and streaming capabilities
  • Supports hubs like ModelScope / Hugging Face / OpenAI
  • Provides runtime SDKs, service deployment, and ONNX exports
一个很重要的认知 An Important Insight 很多开源 ASR 项目更像“训练框架”或者“论文复现集合”;FunASR 更像是一个偏生产可用的语音能力拼装层。这也解释了为什么它会提供 Windows SDK 和高并发部署说明。 Many open-source ASR projects act as "training frameworks" or "paper reproductions"; FunASR is more like a production-ready speech capability assembly layer. This explains why it includes Windows SDKs and high-concurrency deployment guides.

2. 核心原理:统一接口背后,其实是一条可组合的语音流水线

2. Core Principles: A Composable Speech Pipeline Behind a Unified API

官方示例里最常见的写法是 from funasr import AutoModel。这背后的设计哲学很清楚: 把“模型选择”和“链路编排”收敛到一个入口,让用户可以按需把 VAD、ASR、标点、说话人、时间戳拼起来。

The most common idiom in the official examples is from funasr import AutoModel. The design philosophy is clear: Converge "model selection" and "pipeline orchestration" into a single entry point, allowing users to combine VAD, ASR, Punctuation, Diarization, and Timestamps on demand.

输入音频单文件、列表、流式chunk
Input AudioFile, List, or Stream chunk
VAD切分长音频,降负担
VADSegment long audio
ASR离线或流式解码
ASROffline or streaming decode
后处理标点、ITN、热词
Post-processPunctuation, ITN, Hotwords
补充能力时间戳、说话人等
Add-onsTimestamps, Speakers
服务输出脚本、ONNX、SDK
OutputScript, ONNX, SDK
State: Idle (Raw Audio)

2.1 长音频为什么先过 VAD

2.1 Why Long Audio Goes Through VAD First

启用 vad_model 后,VAD 会先把长音频切成更短的片段,再送给 ASR。这有两个直接收益:规避显存爆炸/高延迟,以及让离线模型能处理无尽的长播客。

When vad_model is enabled, VAD cuts long audio into shorter segments before feeding them to the ASR. This yields two benefits: avoiding OOM/high latency, and allowing offline models to process infinitely long podcasts.

2.2 流式识别的关键:Chunk 与 Cache

2.2 The Key to Streaming: Chunk and Cache

在 streaming 示例里,FunASR 使用 chunk_sizecache 等参数组织推理。它并不是简单切片,而是显式建模流式上下文:chunk_size 决定出字粒度和未来视野,而 cache 保留历史状态。

In streaming examples, FunASR uses parameters like chunk_size and cache. It doesn't just slice audio; it explicitly models streaming context: chunk_size dictates emission granularity and look-ahead, while cache preserves historical state.

3. 模型家族:分工明确的组件矩阵

3. Model Family: A Matrix of Specialized Components

FunASR 把不同任务拆成了多个可搭配的模型,你可以按场景裁剪能力和成本。

FunASR splits tasks into multiple composable models, letting you tailor capabilities and costs to your scenario.

Paraformer (Non-Autoregressive)

核心非自回归端到端识别模型。高精度、高效率,是中文离线/流式的主力。 Core Non-Autoregressive E2E recognition model. High accuracy and efficiency, the main workhorse for Chinese offline/streaming.

识别精度Accuracy
推理速度Speed
多语种支持Multilingual
情感与事件Emotion/Event

Paraformer

核心非自回归端到端识别模型。高精度、高效率,是中文离线/流式的主力。

Core Non-Autoregressive E2E recognition model. High accuracy and efficiency, the main workhorse for Chinese offline/streaming.

SenseVoice

偏“语音理解基础模型”。包含多语种识别 (LID)、情绪识别 (SER)、声学事件 (AED)。

Leans towards "Speech Understanding Foundation Model". Includes LID, Emotion Recognition (SER), and Acoustic Events (AED).

Fun-ASR-Nano

端到端大模型,支持 31 种语言和低延迟实时转写。

Large E2E model supporting 31 languages and low-latency real-time transcription.

周边模块

Peripheral Modules

fsmn-vad(端点), ct-punc(标点), fa-zh(时间戳), cam++(说话人)。

fsmn-vad (VAD), ct-punc (Punctuation), fa-zh (Timestamps), cam++ (Diarization).

4. 安装与上手

4. Installation & Getting Started

pip3 install -U funasr modelscope huggingface_hub

4.1 一行代码构建完整流水线

4.1 Complete Pipeline in One Line

from funasr import AutoModel
model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc")
res = model.generate(input="audio.wav", batch_size_s=300, hotword="魔搭")

4.2 SenseVoice 语音理解示例

4.2 SenseVoice Speech Understanding Example

model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad")
res = model.generate(input="audio.mp3", language="auto", use_itn=True)

5. 流式与部署 (Runtime & ONNX)

5. Streaming & Deployment (Runtime & ONNX)

离线处理架构 (Offline)Offline Architecture

适合长视频、会议纪要。极高的吞吐量与时间戳对齐精度。 Ideal for long videos, meetings. High throughput & timestamp precision.

完整音频文件Full Audio File
VAD 端点检测切分VAD Segmentation
Paraformer 批量推理Paraformer Batch Inference
标点 & 时间戳对齐Punctuation & Timestamps

流式处理架构 (Streaming)Streaming Architecture

适合直播字幕、语音助手。低延迟,边说边出字。 Ideal for live subs, voice assistants. Low latency, real-time output.

音频流 (Audio Stream)Audio Stream
分块 (Chunking)Chunking
流式模型 + 缓存记忆Streaming Model + Cache
增量文本输出 (Incremental)Incremental Text Output

高并发 Runtime 服务Concurrency Runtime Service

工业级在线部署。处理成百上千的并发请求。 Industrial online deployment. Handles hundreds of concurrent streams.

客户端 (gRPC/WebSocket)Client (gRPC/WS)
负载均衡与线程池Load Balancer & Thread Pool
C++ SDK 推理内核C++ SDK Inference Core
2-pass 融合输出2-pass Fusion Output

FunASR 提供了专门的 Runtime 库用于生产部署,支持 C++ SDK、Docker、高并发调度,并支持导出为 ONNX 以实现跨平台 CPU 推理。

FunASR provides a dedicated Runtime repository for production deployment, supporting C++ SDKs, Docker, high-concurrency scheduling, and ONNX exports for cross-platform CPU inference.

为什么 runtime 重要 Why runtime matters 因为真正上线时,问题不只是“能不能识别”,而是“能不能长期稳定地识别很多路请求”。 Because in production, the problem is not merely “can it recognize speech,” but “can it keep recognizing many concurrent requests reliably over time.”
2-pass 的意义 Why 2-pass matters 在线阶段先给出低延迟结果,句末或完整段落再做精修,这通常比只追求单次输出更贴近真实产品体验。 The online stage emits low-latency partial results first, then refines them at sentence end or full segment completion—much closer to real product behavior.
# 导出 ONNX / Export to ONNX
funasr-export ++model=paraformer ++quantize=false ++device=cpu

5.1 流式 chunk 演示器

5.1 Streaming Chunk Demo

下面这个交互块不是在跑真实声学模型,而是在概念层面模拟 FunASR 流式推理时的窗口移动。 你可以把蓝色块理解为已经进入历史缓存的 chunk,把高亮框理解为当前推理窗口,把输出区理解为持续累积的 partial text。

The interactive block below does not run a real acoustic model; it simulates the concept of how a FunASR streaming window advances. Think of the blue blocks as chunks that have already entered history cache, the highlighted frame as the current inference window, and the output area as incrementally accumulated partial text.

缓存 / CacheCache
当前 / CurrCurrent
右看 / LookLookahead
Output:

怎么理解 chunk_size

How to read chunk_size

它控制一次前向推理处理多大时间窗,以及允许看多少未来信息。窗更大,通常更稳;窗更小,通常更快。

It controls how much time each forward pass consumes and how much future context is visible. Larger windows are usually more stable; smaller windows are usually faster.

怎么理解 is_final

How to read is_final

它可以理解为“告诉模型流结束了,可以把尾巴上的字吐干净了”。没有 final flush,最后几个 token 可能还挂在缓存里。

Think of it as telling the model: “the stream has ended—flush the tail.” Without a final flush, the last few tokens may remain buffered.

5.2 我会怎么理解部署选型

5.2 How I think about deployment choices

6. 场景模型推荐器(交互式)

你的主要应用场景是什么?What is your primary use case?

你需要“边说边出字”(低延迟流式)吗?Do you need "Real-time Streaming" (low latency)?

你需要区分“谁在说话”吗?(说话人分离)Do you need Speaker Diarization (who spoke when)?

🎯 SenseVoice

它是 FunASR 家族中的“语音理解基础模型”。不仅能转写多语言,还能识别情绪(开心/生气)和声音事件(笑声/掌声)。It's the "Speech Understanding Foundation Model" in the FunASR family. It handles multiple languages, emotion detection, and acoustic events (laughter/applause).

🎯 Fun-ASR-Nano

适合大范围国际化业务,支持 31 种语言的端到端转写。Ideal for broad internationalization, supporting E2E transcription across 31 languages.

🎯 Paraformer-Streaming / 2-Pass Runtime

你应该使用带有 -streaming 后缀的模型,或直接部署官方的 C++ Runtime 在线服务。注意配置好 chunk_size 以平衡延迟与精度。You should use the -streaming models, or deploy the official C++ Runtime online service. Make sure to tune chunk_size to balance latency and accuracy.

🎯 Paraformer + cam++ (说话人分离)

在标准的离线转写管道上叠加 cam++ 模型。它会提取说话人声纹特征并进行聚类,输出带角色标签的纪要。Stack the cam++ model on top of the standard offline pipeline. It extracts speaker voiceprints and clusters them, outputting role-tagged transcripts.

🎯 Paraformer + VAD + Punc (经典离线管道)

最稳健的工业组合。VAD 切分长音频,Paraformer 主力识别,最后用 CT-Punc 加上标点,FA-ZH 对齐时间戳。The most robust industrial combo. VAD segments audio, Paraformer recognizes, CT-Punc adds punctuation, and FA-ZH aligns timestamps.

6. Scenario Model Recommender (Interactive)

根据您的具体需求,选择最合适的 FunASR 组件组合。点击下方选项卡查看详细架构与参数建议:

Select the most appropriate FunASR component combination based on your specific requirements. Click the tabs below to view detailed architectures and parameter recommendations:

长音频高精转写管道 (会议纪要、播客、视频字幕)

High-Precision Pipeline for Long Audio (Meetings, Podcasts, Subs)

这是目前最稳健的离线组合。利用 VAD 切割长语音,Paraformer 主力识别,最后用 CT-Punc 和 FA-ZH 赋予文本标点与精确的时间戳对齐。

The most robust offline stack. VAD segments long audio, Paraformer recognizes the text, and CT-Punc & FA-ZH provide punctuation and precise timestamp alignment.

paraformer-zh fsmn-vad ct-punc fa-zh
model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc", spk_model="cam++")

低延迟流式听写管道 (直播字幕、语音助手)

Low-latency Streaming Pipeline (Live Subs, Voice Assistants)

使用带有流式后缀的 Paraformer 模型。需要精确控制 chunk_sizecache 以在延迟(通常设定为 600ms)和识别精度间取得平衡。

Utilizes the streaming variant of Paraformer. Requires precise control over chunk_size and cache to balance latency (typically ~600ms) and accuracy.

paraformer-zh-streaming Chunking (600ms) Cache Management
model = AutoModel(model="paraformer-zh-streaming")
res = model.generate(input=chunk, cache=cache, is_final=is_final, chunk_size=[0, 10, 5])

泛语音理解基础模型 (跨语种转写、视频事件分析)

Universal Speech Understanding (Cross-lingual, Event Analysis)

如果音频包含中英混杂,甚至日韩小语种,或者你需要知道说话者的情绪(开心/生气)与环境音(鼓掌/笑声),请放弃 Paraformer,直接上 SenseVoice。

If your audio contains mixed languages, or you need to detect emotion (happy/angry) and ambient sounds (applause/laughter), switch from Paraformer to SenseVoice.

SenseVoiceSmall fsmn-vad Emotion / LID
model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad")
res = model.generate(input="audio.wav", language="auto", use_itn=True)

结构化会议记录 (谁在什么时候说了什么)

Structured Meeting Minutes (Who said what, and when)

在标准 ASR 管道上额外叠加 cam++ 模型。它会在识别文本的同时,提取说话人特征(Speaker Embeddings)并进行聚类,输出带 SPK-ID 的结构化日志。

Adds the cam++ model on top of the standard ASR pipeline. It extracts Speaker Embeddings and performs clustering, outputting structured logs with SPK-IDs.

paraformer-zh cam++ (Speaker) Clustering
model = AutoModel(model="paraformer-zh", spk_model="cam++", vad_model="fsmn-vad")

7. 优势与局限

7. Advantages and Limitations

优势

Pros

  • 从模型到部署链路非常完整。
  • 统一 AutoModel 显著降低组装难度。
  • 原生支持 VAD 和标点热词,极具工程实用性。
  • 官方维护 C++ Runtime 与 ONNX 支持。
  • Complete pipeline from model to deployment.
  • Unified AutoModel drastically reduces boilerplate.
  • Native support for VAD, Punctuation, and Hotwords.
  • Officially maintained C++ Runtime and ONNX export.

局限

Cons

  • 能力过广,初学者容易被参数海洋淹没。
  • 文档与版本迭代极快,部分示例偶尔脱节。
  • 统一 API 封装过深,定制底层逻辑需要扒源码。
  • Broad capabilities can overwhelm beginners.
  • Fast iterations cause occasional documentation lag.
  • Deep API encapsulation makes low-level custom logic harder.

8. 总结

8. Conclusion

如果你的目标是处理长音频、生成带时间戳和标点的字幕、或者需要高并发的在线服务部署,FunASR 是目前最成熟的开源选择之一。它最大的价值在于将散落的论文能力,拧成了一股工业可用的绳。

If your goal is handling long audio, generating timestamped/punctuated subtitles, or deploying high-concurrency online services, FunASR is one of the most mature open-source choices. Its true value lies in weaving scattered academic capabilities into an industrial-strength rope.