PCCL – The Big Send-off: Scalable and Performant Collectives for Deep Learning

Abstract

Collective communication is becoming increasingly important in supercomputer workloads with an increase in AI-related jobs. However, existing libraries that provide collective support — such as NCCL, RCCL, and Cray-MPICH — exhibit several performance and scalability limitations on modern GPU supercomputers. To address these challenges, we introduce the Performant Collective Communication Library (PCCL), specifically targeted for distributed deep learning (DL) workloads. PCCL provides highly optimized implementations of key collectives used in distributed DL: all-gather, reduce-scatter, and all-reduce. PCCL uses hierarchical algorithms and adaptive dispatching to scale efficiently to thousands of GPUs. It achieves substantial performance speedups over RCCL on 2048 GCDs of Frontier — up to 168× for reduce-scatter, 33× for all-gather and 10× for all-reduce. More modest but still significant gains up to 5.7× are observed on Perlmutter. These gains translate directly to end-to-end performance improvement for production DL workloads: up to 4.9× speedup over RCCL in DeepSpeed ZeRO-3 training, and up to 2.4× speedup over RCCL in DDP training.

Motivation

Communication overheads in parallel applications become an increasing fraction of the overall execution time as applications scale to more nodes and GPUs. In deep learning (DL), communication is typically composed of collective operations such as all-gather, all-reduce and reduce-scatter.

The message sizes in DL applications are significantly larger than in traditional HPC — from tens to hundreds of MBs — and current implementations of collective libraries struggle to maintain high performance in this regime.

Ring

Ring Algorithm Bottleneck

RCCL and Cray-MPICH rely on ring algorithms whose latency grows linearly with process count, causing degradation at scale.

NIC

NIC Underutilization

Existing libraries fail to fully utilize all available Network Interface Cards, leaving inter-node bandwidth on the table.

CPU

CPU-bound Reductions

MPICH performs reductions on the CPU, creating a compute bottleneck that limits effective throughput.

All-gather scaling comparison — Performance comparison of all-gather using RCCL (Frontier), Cray-MPICH (Frontier), NCCL (Perlmutter) for 64 and 128 MB output buffer sizes. The ideal scaling behavior (flat horizontal line) is not achieved by any library.

Design of PCCL

PCCL introduces three core innovations to overcome the limitations of existing communication libraries for large-message, large-scale DL workloads.

Hierarchical Algorithms

A two-level design separates collective operations into inter-node and intra-node phases using sub-communicators. GPU-vendor libraries (NCCL/RCCL) handle intra-node communication, while Cray-MPI handles inter-node phases, maximizing NIC utilization by scheduling all inter-node transfers concurrently.

1 Inter-node all-gather across sub-communicators
2 Intra-node all-gather within each node
3 Device-local transpose / shuffle kernel

Custom Ring & Recursive Algorithms

Because Cray-MPICH only provides the ring algorithm, PCCL implements its own inter-node algorithms:

PCCL_ring — optimized ring-based all-gather/reduce-scatter with pipeline overlap and CUDA kernels for GPU-side reduction.
PCCL_rec — recursive doubling/halving with O(log P) latency, dramatically improving scalability at large process counts.

ML-guided Adaptive Dispatching

An SVM-based classifier selects the best-performing backend (PCCL_ring, PCCL_rec, or the vendor library) at runtime, based on message size, GPU count, and observed system state. This ensures optimal performance across the full parameter space — from bandwidth-bound large messages to latency-sensitive scenarios at high process counts.

Hierarchical All-gather Design — Two-level hierarchical all-gather in PCCL for a system with N nodes and M GPUs per node. Step 1: concurrent inter-node all-gathers; Step 2: intra-node all-gather; Step 3: device-local transpose.

ML-guided Adaptive Dispatch — The ML-guided selection mechanism in PCCL chooses the best-performing backend (PCCL ring, PCCL rec, or vendor library) at runtime based on message size and GPU count.

Experimental Platforms

Frontier AMD

AMD MI250X GPUs (8 GCDs / node)
Dragonfly network topology (Slingshot)
Scale: 32 – 2048 GCDs (4 – 256 nodes)
Libraries: RCCL, Cray-MPICH vs. PCCL

Perlmutter NVIDIA

NVIDIA A100 GPUs (4 GPUs / node)
Dragonfly network topology (Slingshot)
Scale: 32 – 2048 GPUs (8 – 512 nodes)
Libraries: NCCL, Cray-MPICH vs. PCCL

Performance Results

On Frontier, PCCL dramatically outperforms RCCL and Cray-MPICH for large-message all-gather, reduce-scatter, and all-reduce operations. Speedups increase with process count, reaching up to 33× over RCCL for all-gather and 168× for reduce-scatter at 2048 GCDs.

All-gather on Frontier (256 & 512 MB buffers)

Reduce-scatter on Frontier (256 & 512 MB buffers)

PCCL vs RCCL all-gather heatmap (Frontier) — Speedup heatmap: PCCL vs. RCCL all-gather (Frontier)

PCCL vs RCCL reduce-scatter heatmap (Frontier) — Speedup heatmap: PCCL vs. RCCL reduce-scatter (Frontier)

All-gather vs RCCL

33×

Reduce-scatter vs RCCL

168×

All-reduce vs RCCL

10×

On Perlmutter with NVIDIA A100 GPUs, PCCL achieves up to 5.7× speedup over NCCL and 15× over Cray-MPICH for all-gather and reduce-scatter. All-reduce performance is comparable to NCCL as both use log-latency algorithms.

PCCL vs NCCL all-gather heatmap (Perlmutter) — Speedup heatmap: PCCL vs. NCCL all-gather (Perlmutter)

PCCL vs NCCL all-reduce heatmap (Perlmutter) — Speedup heatmap: PCCL vs. NCCL all-reduce (Perlmutter)

All-gather vs NCCL

5.7×

All-gather vs Cray-MPICH

15×

The collective speedups translate directly to faster large-scale DL training. We benchmark GPT-3 style models (7B, 13B parameters) with DeepSpeed ZeRO-3 and a 1.3B model with PyTorch DDP on both Frontier and Perlmutter.

DeepSpeed ZeRO-3 GPT-3 7B & 13B on Frontier — DeepSpeed ZeRO-3: GPT-3 7B & 13B strong scaling on Frontier

DeepSpeed ZeRO-3 GPT-3 7B & 13B on Perlmutter — DeepSpeed ZeRO-3: GPT-3 7B & 13B strong scaling on Perlmutter

PyTorch DDP GPT-3 1.3B on Frontier — PyTorch DDP: GPT-3 1.3B strong scaling on Frontier

Workload	System	Model	Max Speedup over RCCL/NCCL
DeepSpeed ZeRO-3	Frontier	GPT-3 13B	4.9×
DeepSpeed ZeRO-3	Frontier	GPT-3 7B	3.6×
DeepSpeed ZeRO-3	Perlmutter	GPT-3 7B	1.37×
PyTorch DDP	Frontier	GPT-3 1.3B	2.4×

Key Contributions

01

Empirical Analysis of Existing Libraries

Systematic analysis of Cray-MPICH, NCCL, and RCCL on Perlmutter and Frontier for all-gather and reduce-scatter in parallel deep learning workloads, revealing specific performance bottlenecks.

02

PCCL: New Collective Communication Library

Highly optimized implementations of all-gather, reduce-scatter, and all-reduce in a new library focused on large messages (>10 MB) and large GPU counts, using hierarchical algorithms and CUDA/ROCm kernels.

03

Substantial Performance Improvements

6–160× speedups over RCCL and 28–70× over Cray-MPICH for all-gather on 2048 GCDs of Frontier, with consistent improvements across message sizes and GPU architectures.

04

End-to-end DL Training Validation

Large-scale benchmarking of multi-billion-parameter LLM training with DeepSpeed ZeRO-3 and PyTorch DDP, demonstrating real-world impact of collective communication improvements on training throughput.

BibTeX

@inproceedings{singh2026pccl,
  title     = {The Big Send-off: Scalable and Performant
               Collectives for Deep Learning},
  author    = {Singh, Siddharth and Pradeep, Keshav
               and Singh, Mahua and Wei, Cunyang
               and Bhatele, Abhinav},
  booktitle = {Proceedings of the IEEE International Parallel and
               Distributed Processing Symposium (IPDPS)},
  year      = {2026},
}