Xiang An - PartialFC

Prophet
Exclusive

PartialFC: Training 10 Million Identities on a Single Machine

Phase 1: AllGather Embeddings

Each GPU computes feature embeddings for its local batch. To compute the global loss, embeddings from all GPUs are gathered together.

Phase 2: Class Center Partitioning

The massive 10M-class weight matrix is partitioned across GPUs. Each GPU holds approximately 2.5M class centers.

Phase 3: PartialFC Sampling

Instead of computing logits against all 2.5M local classes, each GPU samples the positive classes plus a small random negative subset.

Phase 4: Forward Pass

The similarity between normalized embeddings and sampled weight centers is calculated, followed by ArcFace angular margin.

Phase 5: Backward & Update

Gradients flow backward, are synchronized via AllReduce, and only the sampled weight subset is updated.

10M+ IdentitiesScalable Training

4× Memory ReductionPartial Sampling

No Accuracy LossProven at Scale

# PartialFC Training Step embeddings = AllGather(local_embeddings) # Phase 1 sampled_W = sample(W_local, positive_ids) # Phase 2 logits = normalize(E) @ normalize(W_s).T # Phase 3 loss = ArcFace_CrossEntropy(logits, labels) # Phase 3 loss.backward() # Phase 4 AllReduce(gradients) # Phase 4 optimizer.step() # only sampled W updated # Phase 4

References:
- "Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC" — arXiv:2203.15565 (CVPR 2022)
- "Partial FC: Training 10 Million Identities on a Single Machine" — arXiv:2010.05222

Communication & Memory Comparison · 通信量 & 显存对比

	Data Parallel	Model Parallel	PartialFC (r=0.1)
Weight Memory 显存 / GPU	W: 19.07 GB Full C×d on every GPU 每张卡存全部 C×d	W: 298 MB C/N columns per GPU 每卡存 C/N 列	W: 298 MB Same as Model Parallel 同 Model Parallel
Logits Memory 显存	BS×C = 156 GB All classes 全量类别	BS×C/N = 2.44 GB Sharded but still large 分片后仍大	BS×C·r/N = 250 MB Sampled subset only 仅采样子集
Phase 1 AllGather Features AllGather 特征	—	384 KB / GPU	384 KB / GPU
Phase 3 Softmax Comm Softmax 通信	—	32 KB 2 scalars / sample 每样本仅2个标量	32 KB 2 scalars / sample 每样本仅2个标量
Phase 4 Gradient Sync 梯度同步	2×\|W\| ≈ 38 GB Full AllReduce 全量 AllReduce	505 MB AllReduce feature grads AllReduce 特征梯度	488 MB AllReduce sampled grads AllReduce 采样梯度
Total Comm / step 总通信量 / step	~38 GB	~1.01 GB	~0.98 GB
Throughput 吞吐量	OOM Cannot train 10M classes 10M 类无法训练	4,840 img/s 64 GPUs	17,819 img/s 64 GPUs · 3.7× speedup 3.7× 加速

DPPFC

W / GPU19 GB298 MB

Logits156 GB250 MB

AllGather—384 KB

Softmax—32 KB

Grad Sync38 GB488 MB

Total / step~38 GB~0.98 GB

ThroughputOOM17.8k img/s

* Based on N=64 GPUs, BS=64/GPU, feature_dim=512, C=10M, sample_rate=0.1
以 N=64 GPUs, BS=64/GPU, feature_dim=512, C=10M, sample_rate=0.1 为例