Auto-generated from paper PDF by Claude Code

IPDPS 2026

The Big Send-off: Scalable and Performant
Collectives for Deep Learning

Siddharth Singh
University of Maryland
ssingh37@umd.edu
Keshav Pradeep
University of Maryland
keshprad@umd.edu
Mahua Singh
IIT Guwahati
s.mahua@iitg.ac.in
Cunyang Wei
University of Maryland
cunyang@umd.edu
Abhinav Bhatele
University of Maryland
bhatele@cs.umd.edu
Collective Communication Distributed Deep Learning GPU Supercomputers Hierarchical Algorithms Adaptive Dispatching LLM Training

Abstract

Collective communication is becoming increasingly important in supercomputer workloads with an increase in AI-related jobs. However, existing libraries that provide collective support — such as NCCL, RCCL, and Cray-MPICH — exhibit several performance and scalability limitations on modern GPU supercomputers. To address these challenges, we introduce the Performant Collective Communication Library (PCCL), specifically targeted for distributed deep learning (DL) workloads. PCCL provides highly optimized implementations of key collectives used in distributed DL: all-gather, reduce-scatter, and all-reduce. PCCL uses hierarchical algorithms and adaptive dispatching to scale efficiently to thousands of GPUs. It achieves substantial performance speedups over RCCL on 2048 GCDs of Frontier — up to 168× for reduce-scatter, 33× for all-gather and 10× for all-reduce. More modest but still significant gains up to 5.7× are observed on Perlmutter. These gains translate directly to end-to-end performance improvement for production DL workloads: up to 4.9× speedup over RCCL in DeepSpeed ZeRO-3 training, and up to 2.4× speedup over RCCL in DDP training.

168× speedup over RCCL
reduce-scatter (Frontier)
33× speedup over RCCL
all-gather (Frontier)
4.9× end-to-end speedup
DeepSpeed ZeRO-3
2048 GPUs / GCDs
evaluated at scale

Motivation

Communication overheads in parallel applications become an increasing fraction of the overall execution time as applications scale to more nodes and GPUs. In deep learning (DL), communication is typically composed of collective operations such as all-gather, all-reduce and reduce-scatter.

The message sizes in DL applications are significantly larger than in traditional HPC — from tens to hundreds of MBs — and current implementations of collective libraries struggle to maintain high performance in this regime.

Ring
Ring Algorithm Bottleneck

RCCL and Cray-MPICH rely on ring algorithms whose latency grows linearly with process count, causing degradation at scale.

NIC
NIC Underutilization

Existing libraries fail to fully utilize all available Network Interface Cards, leaving inter-node bandwidth on the table.

CPU
CPU-bound Reductions

MPICH performs reductions on the CPU, creating a compute bottleneck that limits effective throughput.

All-gather scaling comparison
Performance comparison of all-gather using RCCL (Frontier), Cray-MPICH (Frontier), NCCL (Perlmutter) for 64 and 128 MB output buffer sizes. The ideal scaling behavior (flat horizontal line) is not achieved by any library.

Design of PCCL

PCCL introduces three core innovations to overcome the limitations of existing communication libraries for large-message, large-scale DL workloads.

Hierarchical Algorithms

A two-level design separates collective operations into inter-node and intra-node phases using sub-communicators. GPU-vendor libraries (NCCL/RCCL) handle intra-node communication, while Cray-MPI handles inter-node phases, maximizing NIC utilization by scheduling all inter-node transfers concurrently.

  • 1 Inter-node all-gather across sub-communicators
  • 2 Intra-node all-gather within each node
  • 3 Device-local transpose / shuffle kernel

Custom Ring & Recursive Algorithms

Because Cray-MPICH only provides the ring algorithm, PCCL implements its own inter-node algorithms:

  • PCCL_ring — optimized ring-based all-gather/reduce-scatter with pipeline overlap and CUDA kernels for GPU-side reduction.
  • PCCL_rec — recursive doubling/halving with O(log P) latency, dramatically improving scalability at large process counts.
ML

ML-guided Adaptive Dispatching

An SVM-based classifier selects the best-performing backend (PCCL_ring, PCCL_rec, or the vendor library) at runtime, based on message size, GPU count, and observed system state. This ensures optimal performance across the full parameter space — from bandwidth-bound large messages to latency-sensitive scenarios at high process counts.

Hierarchical All-gather Design
Two-level hierarchical all-gather in PCCL for a system with N nodes and M GPUs per node. Step 1: concurrent inter-node all-gathers; Step 2: intra-node all-gather; Step 3: device-local transpose.
ML-guided Adaptive Dispatch
The ML-guided selection mechanism in PCCL chooses the best-performing backend (PCCL ring, PCCL rec, or vendor library) at runtime based on message size and GPU count.

Experimental Platforms

Frontier AMD
  • AMD MI250X GPUs (8 GCDs / node)
  • Dragonfly network topology (Slingshot)
  • Scale: 32 – 2048 GCDs (4 – 256 nodes)
  • Libraries: RCCL, Cray-MPICH vs. PCCL
Perlmutter NVIDIA
  • NVIDIA A100 GPUs (4 GPUs / node)
  • Dragonfly network topology (Slingshot)
  • Scale: 32 – 2048 GPUs (8 – 512 nodes)
  • Libraries: NCCL, Cray-MPICH vs. PCCL

Performance Results

On Frontier, PCCL dramatically outperforms RCCL and Cray-MPICH for large-message all-gather, reduce-scatter, and all-reduce operations. Speedups increase with process count, reaching up to 33× over RCCL for all-gather and 168× for reduce-scatter at 2048 GCDs.

All-gather on Frontier
All-gather on Frontier (256 & 512 MB buffers)
Reduce-scatter on Frontier
Reduce-scatter on Frontier (256 & 512 MB buffers)
PCCL vs RCCL all-gather heatmap (Frontier)
Speedup heatmap: PCCL vs. RCCL all-gather (Frontier)
PCCL vs RCCL reduce-scatter heatmap (Frontier)
Speedup heatmap: PCCL vs. RCCL reduce-scatter (Frontier)
All-gather vs RCCL
33×
Reduce-scatter vs RCCL
168×
All-reduce vs RCCL
10×

On Perlmutter with NVIDIA A100 GPUs, PCCL achieves up to 5.7× speedup over NCCL and 15× over Cray-MPICH for all-gather and reduce-scatter. All-reduce performance is comparable to NCCL as both use log-latency algorithms.

PCCL vs NCCL all-gather heatmap (Perlmutter)
Speedup heatmap: PCCL vs. NCCL all-gather (Perlmutter)
PCCL vs NCCL all-reduce heatmap (Perlmutter)
Speedup heatmap: PCCL vs. NCCL all-reduce (Perlmutter)
All-gather vs NCCL
5.7×
All-gather vs Cray-MPICH
15×

The collective speedups translate directly to faster large-scale DL training. We benchmark GPT-3 style models (7B, 13B parameters) with DeepSpeed ZeRO-3 and a 1.3B model with PyTorch DDP on both Frontier and Perlmutter.

DeepSpeed ZeRO-3 GPT-3 7B & 13B on Frontier
DeepSpeed ZeRO-3: GPT-3 7B & 13B strong scaling on Frontier
DeepSpeed ZeRO-3 GPT-3 7B & 13B on Perlmutter
DeepSpeed ZeRO-3: GPT-3 7B & 13B strong scaling on Perlmutter
PyTorch DDP GPT-3 1.3B on Frontier
PyTorch DDP: GPT-3 1.3B strong scaling on Frontier
Workload System Model Max Speedup over RCCL/NCCL
DeepSpeed ZeRO-3 Frontier GPT-3 13B 4.9×
DeepSpeed ZeRO-3 Frontier GPT-3 7B 3.6×
DeepSpeed ZeRO-3 Perlmutter GPT-3 7B 1.37×
PyTorch DDP Frontier GPT-3 1.3B 2.4×

Key Contributions

01

Empirical Analysis of Existing Libraries

Systematic analysis of Cray-MPICH, NCCL, and RCCL on Perlmutter and Frontier for all-gather and reduce-scatter in parallel deep learning workloads, revealing specific performance bottlenecks.

02

PCCL: New Collective Communication Library

Highly optimized implementations of all-gather, reduce-scatter, and all-reduce in a new library focused on large messages (>10 MB) and large GPU counts, using hierarchical algorithms and CUDA/ROCm kernels.

03

Substantial Performance Improvements

6–160× speedups over RCCL and 28–70× over Cray-MPICH for all-gather on 2048 GCDs of Frontier, with consistent improvements across message sizes and GPU architectures.

04

End-to-end DL Training Validation

Large-scale benchmarking of multi-billion-parameter LLM training with DeepSpeed ZeRO-3 and PyTorch DDP, demonstrating real-world impact of collective communication improvements on training throughput.

BibTeX

@inproceedings{singh2026pccl,
  title     = {The Big Send-off: Scalable and Performant
               Collectives for Deep Learning},
  author    = {Singh, Siddharth and Pradeep, Keshav
               and Singh, Mahua and Wei, Cunyang
               and Bhatele, Abhinav},
  booktitle = {Proceedings of the IEEE International Parallel and
               Distributed Processing Symposium (IPDPS)},
  year      = {2026},
}