Xiang An

Xiang An (Chinese: 安翔) specializes in computer vision and multimodal large models. You can find his research on Google Scholar (with citations) and his open-source projects on GitHub (with a total of 34,177+ stars). His current research focuses on building the next-generation Vision Transformer (ViT) to address urgent needs in modern MLLMs. He is also the #2 contributor to the InsightFace ecosystem (~27k⭐).

安翔，专注于计算机视觉与多模态大模型研究。他的学术成果可在 Google Scholar（引用）上查阅，开源项目发布在 GitHub（共获 34,177+ Star）。他目前的研究方向是构建下一代视觉 Transformer (ViT)，以满足现代MLLM的迫切需求。他同时也是 InsightFace 生态系统的第二贡献者（~27k⭐）。

For a complete list of publications, see All Publications.

完整论文列表请参见所有发表论文。

Chat with AI Assistant 与 AI 助手对话 Beta

What is LLaVA-OneVision-2.0? 最近在搞的 LLaVA-OneVision-2.0 是什么？

LLaVA-OneVision-2.0 is Xiang An's latest multimodal research, building on the OneVision series with... LLaVA-OneVision-2.0 是安翔最新的多模态研究，在 OneVision 系列基础上...

Publications §

发表论文 §

The following is a selection of notable publications. For a complete list, see All Publications.

以下为代表性论文精选。完整列表请参见所有发表论文。

LLaVA-OneVision-2 Preprint 2026 Homepage Code 🤗 Models Operationalizes codec-aligned sparsity into a fully-open 8B-class multimodal recipe, feeding the OneVision-Encoder with codec-based dense video input to preserve the native temporal signal. A four-stage progressive training pipeline extends video understanding from 30-second clips to 15-minute footage. 将编解码器对齐稀疏性落地为一套完全开源的 8B 级多模态训练配方，以基于 codec 的密集视频输入喂给 OneVision-Encoder，保留视频原生时序信号。四阶段渐进式训练流程将视频理解能力从 30 秒扩展至 15 分钟长视频。
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence Preprint 2026 Homepage Paper Code Bilibili YouTube 我爱计算机视觉 Proposes codec-aligned sparsity as a foundational principle, focusing computation exclusively on the 3.1%–25% of visual regions rich in signal entropy. Achieves a 4.1% average improvement over Qwen3-ViT on video understanding, demonstrating that efficiency and accuracy are positively correlated. 提出编解码器对齐稀疏性作为多模态智能的基础原则，将计算聚焦于仅占3.1%–25%的高信息熵视觉区域。在视频理解任务上比Qwen3-ViT平均提升4.1%，证明了效率与精度可以正相关。 Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, Jiankang Deng
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training Preprint 2025 Homepage RL Page Paper Code Megatron-NeMo-AutoModel 搜狐科技机器之心 A fully open-source multimodal training framework that releases all code, data, checkpoints, and training logs to democratize MLLM research. Introduces a better open-source ViT and demonstrates that simply scaling dense captions significantly improves overall multimodal task performance. 完全开源的多模态训练框架，公开所有代码、数据、模型权重和训练日志，推动MLLM研究的民主化。引入更优的开源ViT，并证明简单扩展密集描述即可显著提升多模态整体性能。 Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Changrui Chen, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, Jiankang Deng, et al.
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning AAAI 2026 (Oral) Homepage Code Introduces the MLLM-as-a-Judge paradigm for universal multimodal embedding learning, using large language models to evaluate and guide embedding quality. Achieves superior cross-modal retrieval by leveraging rich feedback signals beyond simple similarity metrics. 提出MLLM-as-a-Judge范式用于通用多模态嵌入学习，利用大语言模型评估和引导嵌入质量。通过超越简单相似度指标的丰富反馈信号，实现了卓越的跨模态检索性能。 Tiancheng Gu, Kaicheng Yang, Kaichen Zhang, Xiang An, Ziyong Feng, Yueyi Zhang, Weidong Cai, Jiankang Deng, Lidong Bing
Region-based Cluster Discrimination for Visual Representation Learning ICCV 2025 (Highlight) Homepage Paper Code Introduces region-level cluster discrimination for self-supervised visual representation learning, capturing fine-grained spatial relationships that global methods miss. Already adopted by state-of-the-art models including openPangu-VL-7B, LLaVA-OneVision-1.5, and Innovator-VL. 提出区域级聚类判别方法用于自监督视觉表征学习，捕捉全局方法无法获取的细粒度空间关系。已被openPangu-VL-7B、LLaVA-OneVision-1.5和Innovator-VL等前沿模型采用。 Yin Xie, Kaicheng Yang, Xiang An (Project Leader), Kun Wu, Yongle Zhao, Weimo Deng, Zimin Ran, Yumeng Wang, Ziyong Feng, Jiankang Deng
Multi-label Cluster Discrimination for Visual Representation Learning ECCV 2024 Code 🤗 Transformers MLCD-Seg Proposes multi-label cluster discrimination that assigns multiple cluster labels per image, capturing the multi-faceted semantic structure of visual scenes. Integrated into HuggingFace Transformers as an official model and adopted by the LLaVA-NeXT community, enabling widespread adoption. 提出多标签聚类判别方法，为每张图像分配多个聚类标签，捕捉视觉场景的多面语义结构。已集成至HuggingFace Transformers官方模型库并被LLaVA-NeXT社区采纳，推动了广泛的社区采用。 Xiang An, Kaicheng Yang, Xiangzi Dai, Ziyong Feng, Jiankang Deng
Unicom: Universal and Compact Representation Learning for Image Retrieval ICLR 2023 Code Presents a universal and compact representation learning framework for large-scale image retrieval, achieving state-of-the-art results with significantly smaller model sizes. Serves as the foundation for scalable image retrieval systems in both academic research and industrial applications. 提出通用且紧凑的表征学习框架用于大规模图像检索，以显著更小的模型实现了最优性能。成为学术研究和工业应用中可扩展图像检索系统的基础。 Xiang An, Jiankang Deng, Kaicheng Yang, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, Tongliang Liu
Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC CVPR 2022 Code MXNet PyTorch 知乎 Solves the GPU memory bottleneck in large-scale face recognition by dynamically sampling identity subsets during training, enabling 10M+ identities on a single machine. Achieved 1st place in the NIST FRVT competition, with the sampling process also acting as implicit regularization that simultaneously improves training efficiency and model robustness. 通过训练中动态采样身份子集，解决了大规模人脸识别的GPU显存瓶颈，实现单机训练千万级身份。达成NIST FRVT竞赛第一名，采样过程同时充当隐式正则化，在提升训练效率的同时增强了模型鲁棒性。 Xiang An, Jiankang Deng, Jia Guo, Ziyong Feng, Xuhan Zhu, Jing Yang, Tongliang Liu
Partial FC: Training 10 Million Identities on a Single Machine ICCVW 2021 Code MXNet PyTorch 知乎 The original Partial FC method that reduces the classification layer's memory from O(n) to O(k) through dynamic class center sampling, making million-scale face recognition training accessible. This foundational work has become a standard component in modern face recognition systems and the InsightFace framework. Partial FC的原创方法，通过动态类中心采样将分类层显存从O(n)降至O(k)，使百万级人脸识别训练成为可能。这一基础性工作已成为现代人脸识别系统和InsightFace框架的标准组件。 Xiang An, Xuhan Zhu, Yuan Gao, Yang Xiao, Yongle Zhao, Ziyong Feng, Lan Wu, Bin Qin, Ming Zhang, Debing Zhang, Ying Fu

Awards & Competitions §

荣誉与竞赛 §

ICCV 2025 Outstanding Reviewer
CVPR 2024 Outstanding Reviewer
Ranked 1st in NIST FRVT Competition, Visa Track 1:1
2024 中国年度力量人物提名
Ranked 1st in the graduate entrance examination (major)
First Place in Vehicle Re-Identification, PRCV 2019

ICCV 2025 杰出审稿人
CVPR 2024 杰出审稿人
NIST FRVT 竞赛 Visa Track 1:1 第一名
2024 中国年度力量人物提名
研究生入学考试（专业课）第一名
PRCV 2019 车辆重识别第一名

Open Source §

开源项目 §

InsightFace Open Source Library #2 contributor to the open-source 2D & 3D deep face analysis library. Author of Glint360K (the largest open-source face recognition training dataset) and Partial FC (enabling training 10 million identities on a single machine). Also organized the ICCV 2021 Workshop on masked face recognition challenge. 开源2D/3D深度人脸分析库的第二贡献者。Glint360K（最大开源人脸识别训练数据集）和Partial FC（实现单机训练千万级身份）的作者。还组织了ICCV 2021口罩人脸识别挑战赛Workshop。
LLaVA-OneVision-1.5 Multimodal LLM Framework Team Leader of this fully open framework designed to democratize multimodal training. Released mid-training and instruct data for community use, and developed offline sampling pack for efficient training. Implemented RiceViT with native resolution support. 该完全开放框架的团队负责人，旨在推动多模态训练的民主化。向社区发布了中期训练数据和指令数据，并开发了离线采样包以提高训练效率。实现了支持原生分辨率的RiceViT。
OneVision-Encoder Vision Encoder Lead author of this next-generation vision encoder that introduces codec-aligned sparsity as a foundational principle for multimodal intelligence. Achieves state-of-the-art performance on 16 image, video, and document understanding benchmarks while using substantially fewer visual tokens. Demonstrates 4.1% average improvement over Qwen3-ViT on video understanding tasks. 下一代视觉编码器的第一作者，提出编解码器对齐稀疏性作为多模态智能的基础原则。在內16个图像、视频和文档理解基准上取得最先进性能，同时显著减少视觉token数量。在视频理解任务上比Qwen3-ViT平均提升84.1%。
UNICOM Image Retrieval Framework Lead author and maintainer of Universal and Compact Representation Learning framework for universal image representations. Designed the novel cluster discrimination approach for representation learning. Developed the multi-label and region-based extensions (published at ECCV 2024 and ICCV 2025 (Highlight)). 通用紧凑表征学习框架的第一作者和维护者，用于通用图像表征。设计了新颖的聚类判别方法用于表征学习。开发了多标签和区域级扩展（分别发表于ECCV 2024和ICCV 2025 (Highlight)）。
LLaVA-NeXT Large Multimodal Model Vision module contributor to the next-generation large multimodal model. Enhanced the OCR capability of the vision module for better text recognition in images. Optimized the visual encoder for processing text-rich and document images. 下一代大型多模态模型的视觉模块贡献者。增强了视觉模块的OCR能力以改善图像中的文字识别。优化了视觉编码器以处理富文本和文档图像。
Urban Seg Educational Project Author and maintainer of this educational project for semantic segmentation on remote sensing and satellite imagery. Designed a simple single-file training approach for accessibility and integrated popular pretrained models. Created comprehensive tutorials and documentation for beginners. 该教育项目的作者和维护者，用于遥感和卫星图像的语义分割。设计了简洁的单文件训练方法以提高可用性，并集成了流行的预训练模型。为初学者编写了全面的教程和文档。

External Links §

外部链接 §

Citation Map §

引用地图 §

City-level citing-author locations generated offline from Semantic Scholar + OpenAlex.

引用作者的城市级地理分布，通过 Semantic Scholar + OpenAlex 离线生成。

Loading citation data...

This page is styled after Wikipedia.

本页面样式参考自维基百科。