Xiang An
Xiang An (Chinese: 安翔) is a Research Scientist and Team Lead of the Multimodal Large Model Group at GlintLab, specializing in computer vision and multimodal large models. You can find his research on Google Scholar and his open-source projects on GitHub (with a total of 34,177+ stars). His current research focuses on building the next-generation Vision Transformer (ViT) to address urgent needs in modern MLLMs. He is also the #2 contributor to the InsightFace ecosystem (~27k⭐).
安翔,GlintLab 多模态大模型组研究科学家、团队负责人,专注于计算机视觉与多模态大模型研究。他的学术成果可在 Google Scholar上查阅,开源项目发布在 GitHub(共获 34,177+ Star)。他目前的研究方向是构建下一代视觉 Transformer (ViT),以满足现代MLLM的迫切需求。他同时也是 InsightFace 生态系统的第二贡献者(~27k⭐)。
For a complete list of publications, see All Publications.
完整论文列表请参见所有发表论文。
Publications §
发表论文 §
The following is a selection of notable publications. For a complete list, see All Publications.
以下为代表性论文精选。完整列表请参见所有发表论文。
- OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence Preprint 2026 Homepage Paper Code Bilibili YouTube 我爱计算机视觉 Proposes codec-aligned sparsity as a foundational principle, focusing computation exclusively on the 3.1%–25% of visual regions rich in signal entropy. Achieves a 4.1% average improvement over Qwen3-ViT on video understanding, demonstrating that efficiency and accuracy are positively correlated. 提出编解码器对齐稀疏性作为多模态智能的基础原则,将计算聚焦于仅占3.1%–25%的高信息熵视觉区域。在视频理解任务上比Qwen3-ViT平均提升4.1%,证明了效率与精度可以正相关。
- LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training Preprint 2025 Homepage RL Page Paper Code 搜狐科技 机器之心 A fully open-source multimodal training framework that releases all code, data, checkpoints, and training logs to democratize MLLM research. Introduces a better open-source ViT and demonstrates that simply scaling dense captions significantly improves overall multimodal task performance. 完全开源的多模态训练框架,公开所有代码、数据、模型权重和训练日志,推动MLLM研究的民主化。引入更优的开源ViT,并证明简单扩展密集描述即可显著提升多模态整体性能。
- UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning AAAI 2026 (Oral) Homepage Code Introduces the MLLM-as-a-Judge paradigm for universal multimodal embedding learning, using large language models to evaluate and guide embedding quality. Achieves superior cross-modal retrieval by leveraging rich feedback signals beyond simple similarity metrics. 提出MLLM-as-a-Judge范式用于通用多模态嵌入学习,利用大语言模型评估和引导嵌入质量。通过超越简单相似度指标的丰富反馈信号,实现了卓越的跨模态检索性能。
- Region-based Cluster Discrimination for Visual Representation Learning ICCV 2025 (Highlight) Homepage Paper Code Introduces region-level cluster discrimination for self-supervised visual representation learning, capturing fine-grained spatial relationships that global methods miss. Already adopted by state-of-the-art models including openPangu-VL-7B, LLaVA-OneVision-1.5, and Innovator-VL. 提出区域级聚类判别方法用于自监督视觉表征学习,捕捉全局方法无法获取的细粒度空间关系。已被openPangu-VL-7B、LLaVA-OneVision-1.5和Innovator-VL等前沿模型采用。
- Multi-label Cluster Discrimination for Visual Representation Learning ECCV 2024 Code 🤗 Transformers MLCD-Seg Proposes multi-label cluster discrimination that assigns multiple cluster labels per image, capturing the multi-faceted semantic structure of visual scenes. Integrated into HuggingFace Transformers as an official model and adopted by the LLaVA-NeXT community, enabling widespread adoption. 提出多标签聚类判别方法,为每张图像分配多个聚类标签,捕捉视觉场景的多面语义结构。已集成至HuggingFace Transformers官方模型库并被LLaVA-NeXT社区采纳,推动了广泛的社区采用。
- Unicom: Universal and Compact Representation Learning for Image Retrieval ICLR 2023 Code Presents a universal and compact representation learning framework for large-scale image retrieval, achieving state-of-the-art results with significantly smaller model sizes. Serves as the foundation for scalable image retrieval systems in both academic research and industrial applications. 提出通用且紧凑的表征学习框架用于大规模图像检索,以显著更小的模型实现了最优性能。成为学术研究和工业应用中可扩展图像检索系统的基础。
- Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC CVPR 2022 Code MXNet PyTorch 知乎 Solves the GPU memory bottleneck in large-scale face recognition by dynamically sampling identity subsets during training, enabling 10M+ identities on a single machine. Achieved 1st place in the NIST FRVT competition, with the sampling process also acting as implicit regularization that simultaneously improves training efficiency and model robustness. 通过训练中动态采样身份子集,解决了大规模人脸识别的GPU显存瓶颈,实现单机训练千万级身份。达成NIST FRVT竞赛第一名,采样过程同时充当隐式正则化,在提升训练效率的同时增强了模型鲁棒性。
- Partial FC: Training 10 Million Identities on a Single Machine ICCVW 2021 Code MXNet PyTorch 知乎 The original Partial FC method that reduces the classification layer's memory from O(n) to O(k) through dynamic class center sampling, making million-scale face recognition training accessible. This foundational work has become a standard component in modern face recognition systems and the InsightFace framework. Partial FC的原创方法,通过动态类中心采样将分类层显存从O(n)降至O(k),使百万级人脸识别训练成为可能。这一基础性工作已成为现代人脸识别系统和InsightFace框架的标准组件。
Awards & Competitions §
荣誉与竞赛 §
- ICCV 2025 Outstanding Reviewer
- CVPR 2024 Outstanding Reviewer
- Ranked 1st in NIST FRVT Competition, Visa Track 1:1
- 2024 中国年度力量人物提名
- Ranked 1st in the graduate entrance examination (major)
- First Place in Vehicle Re-Identification, PRCV 2019
- ICCV 2025 杰出审稿人
- CVPR 2024 杰出审稿人
- NIST FRVT 竞赛 Visa Track 1:1 第一名
- 2024 中国年度力量人物提名
- 研究生入学考试(专业课)第一名
- PRCV 2019 车辆重识别第一名
Open Source §
开源项目 §
- InsightFace Open Source Library
- LLaVA-OneVision-1.5 Multimodal LLM Framework
- OneVision-Encoder Vision Encoder
- UNICOM Image Retrieval Framework
- LLaVA-NeXT Large Multimodal Model
- Urban Seg Educational Project
External Links §
外部链接 §
This page is styled after Wikipedia.
本页面样式参考自维基百科。