FunASR 调研:从原理、模型家族到部署实践
FunASR Survey: Principles, Model Family, and Deployment
如果只看一句话:FunASR 不是单一 ASR 模型,而是一套把 VAD、ASR、标点恢复、时间戳、说话人能力和运行时部署串起来的语音工具箱。 它试图把学术模型和工业落地之间的那一段“最后一公里”补齐。
In one sentence: FunASR is not a single ASR model, but a speech toolkit that connects VAD, ASR, Punctuation Restoration, Timestamps, Speaker Diarization, and Runtime Deployment. It bridges the "last mile" between academic models and industrial deployment.
- FunASR 的核心价值不是“某一个模型多强”,而是统一接口 + 可组合流水线 + 可部署 runtime。
- 如果你做中文离线转写,最常见组合是:
VAD + Paraformer + 标点 + 时间戳。 - 如果你做多语言/多任务语音理解,SenseVoice 更值得优先看。
- 如果你要实时上屏或服务化,FunASR 的流式模型、2-pass runtime、ONNX 导出才是它和普通 demo 仓库拉开差距的地方。
- FunASR's core value isn't just "one strong model", but its Unified API + Composable Pipeline + Deployable Runtime.
- For Chinese offline transcription, the go-to stack is:
VAD + Paraformer + Punctuation + Timestamps. - For multilingual/multi-task speech understanding, prioritize SenseVoice.
- For real-time streaming or production services, FunASR's streaming models, 2-pass runtime, and ONNX export set it apart from basic demo repos.
1. FunASR 是什么:它解决的不是“识别”,而是“整条链路”
1. What is FunASR: Solving the "Pipeline", Not Just "Recognition"
官方 README 对 FunASR 的定位很直接:它希望在语音识别的学术研究和工业应用之间搭桥。 所以你会发现它不是只给你一个 checkpoint,然后让你自己拼剩下的一切;相反,它把工程里常见的一整条链路都显式放到了工具层里。
The official README describes FunASR explicitly as a bridge between academic speech research and industrial application. Instead of giving you a raw checkpoint and leaving you to build the rest, it explicitly integrates the entire common engineering pipeline into the toolkit layer.
官方原话的重点其实只有两个:工业级模型 和 研究到生产的桥梁。 这意味着 FunASR 评价自己的方式,不只是 paper 指标,而是“这个能力能不能真正被接到业务里”。
The official description focuses on two keywords: Industrial-grade Models and Bridge from Research to Production. This means FunASR evaluates itself not just by paper metrics, but by whether the capabilities can be successfully integrated into real business logic.
它覆盖哪些能力
What It Covers
- ASR(自动语音识别)
- VAD(语音端点检测)
- 标点恢复 / ITN
- 时间戳预测
- 说话人确认 / 分离
- 关键词唤醒、多说话人 ASR、情感识别等
- ASR (Automatic Speech Recognition)
- VAD (Voice Activity Detection)
- Punctuation Restoration / ITN
- Timestamp Prediction
- Speaker Verification / Diarization
- Keyword Spotting, Multi-talker ASR, Emotion Recognition, etc.
它和“单模型 demo”最大的差别
Difference from "Single Model Demos"
- 统一的
AutoModel接口 - 离线与流式能力同时存在
- 支持 ModelScope / Hugging Face / OpenAI 等模型来源
- 提供 runtime SDK、服务部署、ONNX 导出
- Unified
AutoModelAPI - Both offline and streaming capabilities
- Supports hubs like ModelScope / Hugging Face / OpenAI
- Provides runtime SDKs, service deployment, and ONNX exports
2. 核心原理:统一接口背后,其实是一条可组合的语音流水线
2. Core Principles: A Composable Speech Pipeline Behind a Unified API
官方示例里最常见的写法是 from funasr import AutoModel。这背后的设计哲学很清楚:
把“模型选择”和“链路编排”收敛到一个入口,让用户可以按需把 VAD、ASR、标点、说话人、时间戳拼起来。
The most common idiom in the official examples is from funasr import AutoModel. The design philosophy is clear:
Converge "model selection" and "pipeline orchestration" into a single entry point, allowing users to combine VAD, ASR, Punctuation, Diarization, and Timestamps on demand.
2.1 长音频为什么先过 VAD
2.1 Why Long Audio Goes Through VAD First
启用 vad_model 后,VAD 会先把长音频切成更短的片段,再送给 ASR。这有两个直接收益:规避显存爆炸/高延迟,以及让离线模型能处理无尽的长播客。
When vad_model is enabled, VAD cuts long audio into shorter segments before feeding them to the ASR. This yields two benefits: avoiding OOM/high latency, and allowing offline models to process infinitely long podcasts.
2.2 流式识别的关键:Chunk 与 Cache
2.2 The Key to Streaming: Chunk and Cache
在 streaming 示例里,FunASR 使用 chunk_size 和 cache 等参数组织推理。它并不是简单切片,而是显式建模流式上下文:chunk_size 决定出字粒度和未来视野,而 cache 保留历史状态。
In streaming examples, FunASR uses parameters like chunk_size and cache. It doesn't just slice audio; it explicitly models streaming context: chunk_size dictates emission granularity and look-ahead, while cache preserves historical state.
3. 模型家族:分工明确的组件矩阵
3. Model Family: A Matrix of Specialized Components
FunASR 把不同任务拆成了多个可搭配的模型,你可以按场景裁剪能力和成本。
FunASR splits tasks into multiple composable models, letting you tailor capabilities and costs to your scenario.
Paraformer (Non-Autoregressive)
核心非自回归端到端识别模型。高精度、高效率,是中文离线/流式的主力。 Core Non-Autoregressive E2E recognition model. High accuracy and efficiency, the main workhorse for Chinese offline/streaming.
Paraformer
核心非自回归端到端识别模型。高精度、高效率,是中文离线/流式的主力。
Core Non-Autoregressive E2E recognition model. High accuracy and efficiency, the main workhorse for Chinese offline/streaming.
SenseVoice
偏“语音理解基础模型”。包含多语种识别 (LID)、情绪识别 (SER)、声学事件 (AED)。
Leans towards "Speech Understanding Foundation Model". Includes LID, Emotion Recognition (SER), and Acoustic Events (AED).
Fun-ASR-Nano
端到端大模型,支持 31 种语言和低延迟实时转写。
Large E2E model supporting 31 languages and low-latency real-time transcription.
周边模块
Peripheral Modules
fsmn-vad(端点), ct-punc(标点), fa-zh(时间戳), cam++(说话人)。
fsmn-vad (VAD), ct-punc (Punctuation), fa-zh (Timestamps), cam++ (Diarization).
4. 安装与上手
4. Installation & Getting Started
pip3 install -U funasr modelscope huggingface_hub
4.1 一行代码构建完整流水线
4.1 Complete Pipeline in One Line
from funasr import AutoModel
model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc")
res = model.generate(input="audio.wav", batch_size_s=300, hotword="魔搭")
4.2 SenseVoice 语音理解示例
4.2 SenseVoice Speech Understanding Example
model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad")
res = model.generate(input="audio.mp3", language="auto", use_itn=True)
5. 流式与部署 (Runtime & ONNX)
5. Streaming & Deployment (Runtime & ONNX)
离线处理架构 (Offline)Offline Architecture
适合长视频、会议纪要。极高的吞吐量与时间戳对齐精度。 Ideal for long videos, meetings. High throughput & timestamp precision.
流式处理架构 (Streaming)Streaming Architecture
适合直播字幕、语音助手。低延迟,边说边出字。 Ideal for live subs, voice assistants. Low latency, real-time output.
高并发 Runtime 服务Concurrency Runtime Service
工业级在线部署。处理成百上千的并发请求。 Industrial online deployment. Handles hundreds of concurrent streams.
FunASR 提供了专门的 Runtime 库用于生产部署,支持 C++ SDK、Docker、高并发调度,并支持导出为 ONNX 以实现跨平台 CPU 推理。
FunASR provides a dedicated Runtime repository for production deployment, supporting C++ SDKs, Docker, high-concurrency scheduling, and ONNX exports for cross-platform CPU inference.
# 导出 ONNX / Export to ONNX
funasr-export ++model=paraformer ++quantize=false ++device=cpu
5.1 流式 chunk 演示器
5.1 Streaming Chunk Demo
下面这个交互块不是在跑真实声学模型,而是在概念层面模拟 FunASR 流式推理时的窗口移动。 你可以把蓝色块理解为已经进入历史缓存的 chunk,把高亮框理解为当前推理窗口,把输出区理解为持续累积的 partial text。
The interactive block below does not run a real acoustic model; it simulates the concept of how a FunASR streaming window advances. Think of the blue blocks as chunks that have already entered history cache, the highlighted frame as the current inference window, and the output area as incrementally accumulated partial text.
怎么理解 chunk_size
How to read chunk_size
它控制一次前向推理处理多大时间窗,以及允许看多少未来信息。窗更大,通常更稳;窗更小,通常更快。
It controls how much time each forward pass consumes and how much future context is visible. Larger windows are usually more stable; smaller windows are usually faster.
怎么理解 is_final
How to read is_final
它可以理解为“告诉模型流结束了,可以把尾巴上的字吐干净了”。没有 final flush,最后几个 token 可能还挂在缓存里。
Think of it as telling the model: “the stream has ended—flush the tail.” Without a final flush, the last few tokens may remain buffered.
5.2 我会怎么理解部署选型
5.2 How I think about deployment choices
- 如果你只是离线批量转写,Python + AutoModel 往往已经够用。
- 如果你需要高并发、稳定服务、镜像化交付,应该尽早转向 runtime SDK / docker。
- 如果你的目标平台偏 CPU、跨平台或推理引擎统一,ONNX 路线通常更自然。
- If you only need offline batch transcription, Python + AutoModel is often enough.
- If you need concurrency, service stability, and containerized delivery, move early toward runtime SDK / docker.
- If your target platform is CPU-heavy, cross-platform, or uses a unified inference engine, ONNX is often the more natural path.
6. 场景模型推荐器(交互式)
你的主要应用场景是什么?What is your primary use case?
你需要“边说边出字”(低延迟流式)吗?Do you need "Real-time Streaming" (low latency)?
你需要区分“谁在说话”吗?(说话人分离)Do you need Speaker Diarization (who spoke when)?
🎯 SenseVoice
它是 FunASR 家族中的“语音理解基础模型”。不仅能转写多语言,还能识别情绪(开心/生气)和声音事件(笑声/掌声)。It's the "Speech Understanding Foundation Model" in the FunASR family. It handles multiple languages, emotion detection, and acoustic events (laughter/applause).
🎯 Fun-ASR-Nano
适合大范围国际化业务,支持 31 种语言的端到端转写。Ideal for broad internationalization, supporting E2E transcription across 31 languages.
🎯 Paraformer-Streaming / 2-Pass Runtime
你应该使用带有 -streaming 后缀的模型,或直接部署官方的 C++ Runtime 在线服务。注意配置好 chunk_size 以平衡延迟与精度。You should use the -streaming models, or deploy the official C++ Runtime online service. Make sure to tune chunk_size to balance latency and accuracy.
🎯 Paraformer + cam++ (说话人分离)
在标准的离线转写管道上叠加 cam++ 模型。它会提取说话人声纹特征并进行聚类,输出带角色标签的纪要。Stack the cam++ model on top of the standard offline pipeline. It extracts speaker voiceprints and clusters them, outputting role-tagged transcripts.
🎯 Paraformer + VAD + Punc (经典离线管道)
最稳健的工业组合。VAD 切分长音频,Paraformer 主力识别,最后用 CT-Punc 加上标点,FA-ZH 对齐时间戳。The most robust industrial combo. VAD segments audio, Paraformer recognizes, CT-Punc adds punctuation, and FA-ZH aligns timestamps.
6. Scenario Model Recommender (Interactive)
根据您的具体需求,选择最合适的 FunASR 组件组合。点击下方选项卡查看详细架构与参数建议:
Select the most appropriate FunASR component combination based on your specific requirements. Click the tabs below to view detailed architectures and parameter recommendations:
长音频高精转写管道 (会议纪要、播客、视频字幕)
High-Precision Pipeline for Long Audio (Meetings, Podcasts, Subs)
这是目前最稳健的离线组合。利用 VAD 切割长语音,Paraformer 主力识别,最后用 CT-Punc 和 FA-ZH 赋予文本标点与精确的时间戳对齐。
The most robust offline stack. VAD segments long audio, Paraformer recognizes the text, and CT-Punc & FA-ZH provide punctuation and precise timestamp alignment.
model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc", spk_model="cam++")
低延迟流式听写管道 (直播字幕、语音助手)
Low-latency Streaming Pipeline (Live Subs, Voice Assistants)
使用带有流式后缀的 Paraformer 模型。需要精确控制 chunk_size 和 cache 以在延迟(通常设定为 600ms)和识别精度间取得平衡。
Utilizes the streaming variant of Paraformer. Requires precise control over chunk_size and cache to balance latency (typically ~600ms) and accuracy.
model = AutoModel(model="paraformer-zh-streaming")
res = model.generate(input=chunk, cache=cache, is_final=is_final, chunk_size=[0, 10, 5])
泛语音理解基础模型 (跨语种转写、视频事件分析)
Universal Speech Understanding (Cross-lingual, Event Analysis)
如果音频包含中英混杂,甚至日韩小语种,或者你需要知道说话者的情绪(开心/生气)与环境音(鼓掌/笑声),请放弃 Paraformer,直接上 SenseVoice。
If your audio contains mixed languages, or you need to detect emotion (happy/angry) and ambient sounds (applause/laughter), switch from Paraformer to SenseVoice.
model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad")
res = model.generate(input="audio.wav", language="auto", use_itn=True)
结构化会议记录 (谁在什么时候说了什么)
Structured Meeting Minutes (Who said what, and when)
在标准 ASR 管道上额外叠加 cam++ 模型。它会在识别文本的同时,提取说话人特征(Speaker Embeddings)并进行聚类,输出带 SPK-ID 的结构化日志。
Adds the cam++ model on top of the standard ASR pipeline. It extracts Speaker Embeddings and performs clustering, outputting structured logs with SPK-IDs.
model = AutoModel(model="paraformer-zh", spk_model="cam++", vad_model="fsmn-vad")
7. 优势与局限
7. Advantages and Limitations
优势
Pros
- 从模型到部署链路非常完整。
- 统一
AutoModel显著降低组装难度。 - 原生支持 VAD 和标点热词,极具工程实用性。
- 官方维护 C++ Runtime 与 ONNX 支持。
- Complete pipeline from model to deployment.
- Unified
AutoModeldrastically reduces boilerplate. - Native support for VAD, Punctuation, and Hotwords.
- Officially maintained C++ Runtime and ONNX export.
局限
Cons
- 能力过广,初学者容易被参数海洋淹没。
- 文档与版本迭代极快,部分示例偶尔脱节。
- 统一 API 封装过深,定制底层逻辑需要扒源码。
- Broad capabilities can overwhelm beginners.
- Fast iterations cause occasional documentation lag.
- Deep API encapsulation makes low-level custom logic harder.
8. 总结
8. Conclusion
如果你的目标是处理长音频、生成带时间戳和标点的字幕、或者需要高并发的在线服务部署,FunASR 是目前最成熟的开源选择之一。它最大的价值在于将散落的论文能力,拧成了一股工业可用的绳。
If your goal is handling long audio, generating timestamped/punctuated subtitles, or deploying high-concurrency online services, FunASR is one of the most mature open-source choices. Its true value lies in weaving scattered academic capabilities into an industrial-strength rope.