Prophet
Exclusive

PartialFC

The Leading Voice in Magical Tech

PartialFC: Training 10 Million Identities on a Single Machine

Phase 1: AllGather Embeddings

Each GPU computes feature embeddings for its local batch. To compute the global loss, embeddings from all GPUs are gathered together.

Phase 2: Class Center Partitioning

The massive 10M-class weight matrix is partitioned across GPUs. Each GPU holds approximately 2.5M class centers.

Phase 3: PartialFC Sampling

Instead of computing logits against all 2.5M local classes, each GPU samples the positive classes plus a small random negative subset.

Phase 4: Forward Pass

The similarity between normalized embeddings and sampled weight centers is calculated, followed by ArcFace angular margin.

Phase 5: Backward & Update

Gradients flow backward, are synchronized via AllReduce, and only the sampled weight subset is updated.

10M+ IdentitiesScalable Training
4× Memory ReductionPartial Sampling
No Accuracy LossProven at Scale
# PartialFC Training Step embeddings = AllGather(local_embeddings) # Phase 1 sampled_W = sample(W_local, positive_ids) # Phase 2 logits = normalize(E) @ normalize(W_s).T # Phase 3 loss = ArcFace_CrossEntropy(logits, labels) # Phase 3 loss.backward() # Phase 4 AllReduce(gradients) # Phase 4 optimizer.step() # only sampled W updated # Phase 4

References:
- "Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC" — arXiv:2203.15565 (CVPR 2022)
- "Partial FC: Training 10 Million Identities on a Single Machine" — arXiv:2010.05222

Communication & Memory Comparison · 通信量 & 显存对比

Data Parallel Model Parallel PartialFC (r=0.1)
Weight Memory 显存 / GPU W: 19.07 GB
Full C×d on every GPU
每张卡存全部 C×d
W: 298 MB
C/N columns per GPU
每卡存 C/N 列
W: 298 MB
Same as Model Parallel
同 Model Parallel
Logits Memory 显存 BS×C = 156 GB
All classes
全量类别
BS×C/N = 2.44 GB
Sharded but still large
分片后仍大
BS×C·r/N = 250 MB
Sampled subset only
仅采样子集
Phase 1
AllGather Features
AllGather 特征
384 KB / GPU 384 KB / GPU
Phase 3
Softmax Comm
Softmax 通信
32 KB
2 scalars / sample
每样本仅2个标量
32 KB
2 scalars / sample
每样本仅2个标量
Phase 4
Gradient Sync
梯度同步
2×|W| ≈ 38 GB
Full AllReduce
全量 AllReduce
505 MB
AllReduce feature grads
AllReduce 特征梯度
488 MB
AllReduce sampled grads
AllReduce 采样梯度
Total Comm / step
总通信量 / step
~38 GB ~1.01 GB ~0.98 GB
Throughput 吞吐量 OOM
Cannot train 10M classes
10M 类无法训练
4,840 img/s
64 GPUs
17,819 img/s
64 GPUs · 3.7× speedup
3.7× 加速
DPPFC
W / GPU19 GB298 MB
Logits156 GB250 MB
AllGather384 KB
Softmax32 KB
Grad Sync38 GB488 MB
Total / step~38 GB~0.98 GB
ThroughputOOM17.8k img/s

* Based on N=64 GPUs, BS=64/GPU, feature_dim=512, C=10M, sample_rate=0.1
以 N=64 GPUs, BS=64/GPU, feature_dim=512, C=10M, sample_rate=0.1 为例