Each GPU computes feature embeddings for its local batch. To compute the global loss, embeddings from all GPUs are gathered together.
The massive 10M-class weight matrix is partitioned across GPUs. Each GPU holds approximately 2.5M class centers.
Instead of computing logits against all 2.5M local classes, each GPU samples the positive classes plus a small random negative subset.
The similarity between normalized embeddings and sampled weight centers is calculated, followed by ArcFace angular margin.
Gradients flow backward, are synchronized via AllReduce, and only the sampled weight subset is updated.
References:
- "Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC" — arXiv:2203.15565 (CVPR 2022)
- "Partial FC: Training 10 Million Identities on a Single Machine" — arXiv:2010.05222
| Data Parallel | Model Parallel | PartialFC (r=0.1) | |
|---|---|---|---|
| Weight Memory 显存 / GPU | W: 19.07 GB Full C×d on every GPU 每张卡存全部 C×d |
W: 298 MB C/N columns per GPU 每卡存 C/N 列 |
W: 298 MB Same as Model Parallel 同 Model Parallel |
| Logits Memory 显存 | BS×C = 156 GB All classes 全量类别 |
BS×C/N = 2.44 GB Sharded but still large 分片后仍大 |
BS×C·r/N = 250 MB Sampled subset only 仅采样子集 |
| Phase 1 AllGather Features AllGather 特征 |
— | 384 KB / GPU | 384 KB / GPU |
| Phase 3 Softmax Comm Softmax 通信 |
— | 32 KB 2 scalars / sample 每样本仅2个标量 |
32 KB 2 scalars / sample 每样本仅2个标量 |
| Phase 4 Gradient Sync 梯度同步 |
2×|W| ≈ 38 GB Full AllReduce 全量 AllReduce |
505 MB AllReduce feature grads AllReduce 特征梯度 |
488 MB AllReduce sampled grads AllReduce 采样梯度 |
| Total Comm / step 总通信量 / step |
~38 GB | ~1.01 GB | ~0.98 GB |
| Throughput 吞吐量 | OOM Cannot train 10M classes 10M 类无法训练 |
4,840 img/s 64 GPUs |
17,819 img/s 64 GPUs · 3.7× speedup 3.7× 加速 |
* Based on N=64 GPUs, BS=64/GPU, feature_dim=512, C=10M, sample_rate=0.1
以 N=64 GPUs, BS=64/GPU, feature_dim=512, C=10M, sample_rate=0.1 为例