Xiang An
Xiang An (Chinese: 安翔) specializes in computer vision and multimodal large models. You can find his research on Google Scholar and his open-source projects on GitHub (with a total of 34,177+ stars). His current research focuses on building the next-generation Vision Transformer (ViT) to address urgent needs in modern MLLMs. He is also the #2 contributor to the InsightFace ecosystem (~27k⭐).
安翔,专注于计算机视觉与多模态大模型研究。他的学术成果可在 Google Scholar上查阅,开源项目发布在 GitHub(共获 34,177+ Star)。他目前的研究方向是构建下一代视觉 Transformer (ViT),以满足现代MLLM的迫切需求。他同时也是 InsightFace 生态系统的第二贡献者(~27k⭐)。
For a complete list of publications, see All Publications.
完整论文列表请参见所有发表论文。
Publications §
发表论文 §
The following is a selection of notable publications. For a complete list, see All Publications.
以下为代表性论文精选。完整列表请参见所有发表论文。
- LLaVA-OneVision-2 Preprint 2026 Homepage Code 🤗 Models Operationalizes codec-aligned sparsity into a fully-open 8B-class multimodal recipe, feeding the OneVision-Encoder with codec-based dense video input to preserve the native temporal signal. A four-stage progressive training pipeline extends video understanding from 30-second clips to 15-minute footage. 将编解码器对齐稀疏性落地为一套完全开源的 8B 级多模态训练配方,以基于 codec 的密集视频输入喂给 OneVision-Encoder,保留视频原生时序信号。四阶段渐进式训练流程将视频理解能力从 30 秒扩展至 15 分钟长视频。
- OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence Preprint 2026 Homepage Paper Code Bilibili YouTube 我爱计算机视觉 Proposes codec-aligned sparsity as a foundational principle, focusing computation exclusively on the 3.1%–25% of visual regions rich in signal entropy. Achieves a 4.1% average improvement over Qwen3-ViT on video understanding, demonstrating that efficiency and accuracy are positively correlated. 提出编解码器对齐稀疏性作为多模态智能的基础原则,将计算聚焦于仅占3.1%–25%的高信息熵视觉区域。在视频理解任务上比Qwen3-ViT平均提升4.1%,证明了效率与精度可以正相关。
- LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training Preprint 2025 Homepage RL Page Paper Code Megatron-NeMo-AutoModel 搜狐科技 机器之心 A fully open-source multimodal training framework that releases all code, data, checkpoints, and training logs to democratize MLLM research. Introduces a better open-source ViT and demonstrates that simply scaling dense captions significantly improves overall multimodal task performance. 完全开源的多模态训练框架,公开所有代码、数据、模型权重和训练日志,推动MLLM研究的民主化。引入更优的开源ViT,并证明简单扩展密集描述即可显著提升多模态整体性能。
- UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning AAAI 2026 (Oral) Homepage Code Introduces the MLLM-as-a-Judge paradigm for universal multimodal embedding learning, using large language models to evaluate and guide embedding quality. Achieves superior cross-modal retrieval by leveraging rich feedback signals beyond simple similarity metrics. 提出MLLM-as-a-Judge范式用于通用多模态嵌入学习,利用大语言模型评估和引导嵌入质量。通过超越简单相似度指标的丰富反馈信号,实现了卓越的跨模态检索性能。
- Region-based Cluster Discrimination for Visual Representation Learning ICCV 2025 (Highlight) Homepage Paper Code Introduces region-level cluster discrimination for self-supervised visual representation learning, capturing fine-grained spatial relationships that global methods miss. Already adopted by state-of-the-art models including openPangu-VL-7B, LLaVA-OneVision-1.5, and Innovator-VL. 提出区域级聚类判别方法用于自监督视觉表征学习,捕捉全局方法无法获取的细粒度空间关系。已被openPangu-VL-7B、LLaVA-OneVision-1.5和Innovator-VL等前沿模型采用。
- Multi-label Cluster Discrimination for Visual Representation Learning ECCV 2024 Code 🤗 Transformers MLCD-Seg Proposes multi-label cluster discrimination that assigns multiple cluster labels per image, capturing the multi-faceted semantic structure of visual scenes. Integrated into HuggingFace Transformers as an official model and adopted by the LLaVA-NeXT community, enabling widespread adoption. 提出多标签聚类判别方法,为每张图像分配多个聚类标签,捕捉视觉场景的多面语义结构。已集成至HuggingFace Transformers官方模型库并被LLaVA-NeXT社区采纳,推动了广泛的社区采用。
- Unicom: Universal and Compact Representation Learning for Image Retrieval ICLR 2023 Code Presents a universal and compact representation learning framework for large-scale image retrieval, achieving state-of-the-art results with significantly smaller model sizes. Serves as the foundation for scalable image retrieval systems in both academic research and industrial applications. 提出通用且紧凑的表征学习框架用于大规模图像检索,以显著更小的模型实现了最优性能。成为学术研究和工业应用中可扩展图像检索系统的基础。
- Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC CVPR 2022 Code MXNet PyTorch 知乎 Solves the GPU memory bottleneck in large-scale face recognition by dynamically sampling identity subsets during training, enabling 10M+ identities on a single machine. Achieved 1st place in the NIST FRVT competition, with the sampling process also acting as implicit regularization that simultaneously improves training efficiency and model robustness. 通过训练中动态采样身份子集,解决了大规模人脸识别的GPU显存瓶颈,实现单机训练千万级身份。达成NIST FRVT竞赛第一名,采样过程同时充当隐式正则化,在提升训练效率的同时增强了模型鲁棒性。
- Partial FC: Training 10 Million Identities on a Single Machine ICCVW 2021 Code MXNet PyTorch 知乎 The original Partial FC method that reduces the classification layer's memory from O(n) to O(k) through dynamic class center sampling, making million-scale face recognition training accessible. This foundational work has become a standard component in modern face recognition systems and the InsightFace framework. Partial FC的原创方法,通过动态类中心采样将分类层显存从O(n)降至O(k),使百万级人脸识别训练成为可能。这一基础性工作已成为现代人脸识别系统和InsightFace框架的标准组件。
Awards & Competitions §
荣誉与竞赛 §
- ICCV 2025 Outstanding Reviewer
- CVPR 2024 Outstanding Reviewer
- Ranked 1st in NIST FRVT Competition, Visa Track 1:1
- 2024 中国年度力量人物提名
- Ranked 1st in the graduate entrance examination (major)
- First Place in Vehicle Re-Identification, PRCV 2019
- ICCV 2025 杰出审稿人
- CVPR 2024 杰出审稿人
- NIST FRVT 竞赛 Visa Track 1:1 第一名
- 2024 中国年度力量人物提名
- 研究生入学考试(专业课)第一名
- PRCV 2019 车辆重识别第一名
Open Source §
开源项目 §
- InsightFace Open Source Library
- LLaVA-OneVision-1.5 Multimodal LLM Framework
- OneVision-Encoder Vision Encoder
- UNICOM Image Retrieval Framework
- LLaVA-NeXT Large Multimodal Model
- Urban Seg Educational Project
External Links §
外部链接 §
Citation Map §
引用地图 §
City-level citing-author locations generated offline from Semantic Scholar + OpenAlex.
引用作者的城市级地理分布,通过 Semantic Scholar + OpenAlex 离线生成。
This page is styled after Wikipedia.
本页面样式参考自维基百科。