IPDPS 2026
Collective communication is becoming increasingly important in supercomputer workloads with an increase in AI-related jobs. However, existing libraries that provide collective support — such as NCCL, RCCL, and Cray-MPICH — exhibit several performance and scalability limitations on modern GPU supercomputers. To address these challenges, we introduce the Performant Collective Communication Library (PCCL), specifically targeted for distributed deep learning (DL) workloads. PCCL provides highly optimized implementations of key collectives used in distributed DL: all-gather, reduce-scatter, and all-reduce. PCCL uses hierarchical algorithms and adaptive dispatching to scale efficiently to thousands of GPUs. It achieves substantial performance speedups over RCCL on 2048 GCDs of Frontier — up to 168× for reduce-scatter, 33× for all-gather and 10× for all-reduce. More modest but still significant gains up to 5.7× are observed on Perlmutter. These gains translate directly to end-to-end performance improvement for production DL workloads: up to 4.9× speedup over RCCL in DeepSpeed ZeRO-3 training, and up to 2.4× speedup over RCCL in DDP training.
Communication overheads in parallel applications become an increasing fraction of the overall execution time as applications scale to more nodes and GPUs. In deep learning (DL), communication is typically composed of collective operations such as all-gather, all-reduce and reduce-scatter.
The message sizes in DL applications are significantly larger than in traditional HPC — from tens to hundreds of MBs — and current implementations of collective libraries struggle to maintain high performance in this regime.
RCCL and Cray-MPICH rely on ring algorithms whose latency grows linearly with process count, causing degradation at scale.
Existing libraries fail to fully utilize all available Network Interface Cards, leaving inter-node bandwidth on the table.
MPICH performs reductions on the CPU, creating a compute bottleneck that limits effective throughput.
PCCL introduces three core innovations to overcome the limitations of existing communication libraries for large-message, large-scale DL workloads.
A two-level design separates collective operations into inter-node and intra-node phases using sub-communicators. GPU-vendor libraries (NCCL/RCCL) handle intra-node communication, while Cray-MPI handles inter-node phases, maximizing NIC utilization by scheduling all inter-node transfers concurrently.
Because Cray-MPICH only provides the ring algorithm, PCCL implements its own inter-node algorithms:
PCCL_ring — optimized ring-based all-gather/reduce-scatter
with pipeline overlap and CUDA kernels for GPU-side reduction.
PCCL_rec — recursive doubling/halving with
O(log P) latency, dramatically improving scalability at
large process counts.
An SVM-based classifier selects the best-performing backend
(PCCL_ring, PCCL_rec, or the vendor library)
at runtime, based on message size, GPU count, and observed system state.
This ensures optimal performance across the full parameter space —
from bandwidth-bound large messages to latency-sensitive scenarios
at high process counts.
On Frontier, PCCL dramatically outperforms RCCL and Cray-MPICH for large-message all-gather, reduce-scatter, and all-reduce operations. Speedups increase with process count, reaching up to 33× over RCCL for all-gather and 168× for reduce-scatter at 2048 GCDs.
On Perlmutter with NVIDIA A100 GPUs, PCCL achieves up to 5.7× speedup over NCCL and 15× over Cray-MPICH for all-gather and reduce-scatter. All-reduce performance is comparable to NCCL as both use log-latency algorithms.
The collective speedups translate directly to faster large-scale DL training. We benchmark GPT-3 style models (7B, 13B parameters) with DeepSpeed ZeRO-3 and a 1.3B model with PyTorch DDP on both Frontier and Perlmutter.
| Workload | System | Model | Max Speedup over RCCL/NCCL |
|---|---|---|---|
| DeepSpeed ZeRO-3 | Frontier | GPT-3 13B | 4.9× |
| DeepSpeed ZeRO-3 | Frontier | GPT-3 7B | 3.6× |
| DeepSpeed ZeRO-3 | Perlmutter | GPT-3 7B | 1.37× |
| PyTorch DDP | Frontier | GPT-3 1.3B | 2.4× |
Systematic analysis of Cray-MPICH, NCCL, and RCCL on Perlmutter and Frontier for all-gather and reduce-scatter in parallel deep learning workloads, revealing specific performance bottlenecks.
Highly optimized implementations of all-gather, reduce-scatter, and all-reduce in a new library focused on large messages (>10 MB) and large GPU counts, using hierarchical algorithms and CUDA/ROCm kernels.
6–160× speedups over RCCL and 28–70× over Cray-MPICH for all-gather on 2048 GCDs of Frontier, with consistent improvements across message sizes and GPU architectures.
Large-scale benchmarking of multi-billion-parameter LLM training with DeepSpeed ZeRO-3 and PyTorch DDP, demonstrating real-world impact of collective communication improvements on training throughput.
@inproceedings{singh2026pccl,
title = {The Big Send-off: Scalable and Performant
Collectives for Deep Learning},
author = {Singh, Siddharth and Pradeep, Keshav
and Singh, Mahua and Wei, Cunyang
and Bhatele, Abhinav},
booktitle = {Proceedings of the IEEE International Parallel and
Distributed Processing Symposium (IPDPS)},
year = {2026},
}