Elusive Performance on GPU Supercomputers

Abstract

Modern HPC facilities increasingly rely on GPU-accelerated clusters to drive both scientific computing and AI workloads. Performance variability is a critical issue in these systems, undermining efficiency and performance reproducibility. While prior studies have extensively analyzed variability in CPU-centric supercomputers, similar large-scale investigations on GPU clusters are lacking.

To address this gap, we set up a longitudinal experiment on two state-of-the-art GPU-based supercomputers: NERSC's Perlmutter and ORNL's Frontier. We benchmark several representative HPC and AI applications and collect detailed performance data including network counters, profiling output, and job scheduler logs. We analyze this data to identify the impact of compute performance variations, allocated node topology, and network conditions on the overall runtime variability. We also use a machine learning based approach to identify potential correlations between these factors, and to forecast performance variability. We provide actionable insights for both system administrators and users to mitigate or predict performance variations in GPU-accelerated HPC environments.

Contributions

📊

First GPU-Scale Longitudinal Study

Longitudinal study on two major GPU-based supercomputers (Perlmutter and Frontier), yielding the first comprehensive dataset of performance measurements across diverse applications and system states over an extended period at large scale.

🔍

In-Depth Variability Analysis

Key observations providing in-depth analysis of inherent hardware differences and the impacts of concurrent jobs, job placement, and network conditions on performance across both HPC and AI workloads.

🤖

ML-Based Performance Prediction

Machine learning methods to both identify critical metrics that influence performance variability and predict performance variations based on system status across traditional HPC workloads and deep learning applications.

💡

Actionable Insights

Practical guidance for system administrators and users to predict and mitigate performance variations, including strategies for network-aware scheduling and early job cancellation.

Study Design

Longitudinal experiments on two flagship GPU supercomputers over ~4 months in 2024–2025

Perlmutter

NERSC

Nodes1,792 GPU nodes

GPUs4× NVIDIA A100 (40/80 GB) per node

CPUAMD EPYC 7763 Milan (64-core)

NetworkHPE Slingshot-11, 3-hop Dragonfly

NIC4× Cassini NICs (100 GB/s per node)

Frontier

OLCF — First Exascale

Nodes9,408 compute nodes

GPUs4× AMD MI250X (8 GCDs, 64 GB/GCD) per node

CPUAMD EPYC Trento (64-core)

NetworkHPE Slingshot-11, Dragonfly

NIC4× Slingshot NICs (100 GB/s per node)

Applications Studied

HPC

AMG2023

Parallel algebraic multigrid solver from the Hypre library for 3D linear systems. High MPI communication density (~74–84% of runtime in MPI).

MPI (Waitall, Isend, Irecv, Allreduce)

HPC

MILC

MIMD Lattice Computation code for quantum chromodynamics (QCD) gauge generation. Uniform stencil computation and communication patterns.

MPI (Allreduce, Test)

AI

DeepCAM

Exascale deep learning for climate analytics using CNNs to segment extreme weather patterns. Gordon Bell Prize-recognized workload, MLPerf benchmark.

NCCL/RCCL (Allreduce)

AI

nanoGPT

Optimized GPT implementation using AxoNN 4D tensor parallelism framework with sub-communicator collectives for reduced variability.

NCCL/RCCL (Allgather, Allreduce)

Runtime Variability Over Time

Performance of four HPC and AI applications relative to their best observed runtime, measured over four months on 64 nodes of each system.

Perlmutter (NERSC): Run-to-run variability of AMG2023, MILC, DeepCAM, and nanoGPT relative to each application's best observed runtime. nanoGPT reaches up to 1.4× and AMG2023 up to 1.3× variability.

Frontier (OLCF): DeepCAM shows up to 2.6× variability (with outliers reaching 3.2×) and MILC up to 1.8×. AMG2023 variability dropped significantly after a Slingshot Host Software upgrade (v11.0.2) on January 14, 2025, which fixed a libfabric regression.

Takeaway: All four applications show substantial run-to-run variability. Frontier generally exhibits higher and more erratic variability than Perlmutter, especially for AI workloads. Some variability is traceable to specific system events (e.g., software updates).

Application Breakdown Analysis

Execution time dissected into compute vs. communication components, from fastest to slowest observed runs. Communication time — not compute — drives variability across all applications.

HPC Applications: AMG2023 & MILC

AMG2023 — Perlmutter: ~74% of runtime in MPI; slowest runs have ~40% more communication

AMG2023 — Frontier: ~84% of runtime in MPI; improvement visible post Jan 14 patch

MILC — Perlmutter: ~56% in MPI; Allreduce and Test are the most variable routines

MILC — Frontier: slowest communication up to 122% higher than fastest runs

AI Applications: DeepCAM & nanoGPT

DeepCAM — Perlmutter: Allreduce up to 4× slower in worst runs vs. best

DeepCAM — Frontier: Allreduce up to 24× slower in worst vs. best; compute is stable

nanoGPT — Perlmutter: Allreduce 3× slower in worst vs. best; Allgather also variable

nanoGPT — Frontier: 4D sub-communicator design reduces Allreduce variability significantly

Allreduce Distribution — Long Tails on Perlmutter

Allreduce is the single most variable routine across all applications. These violin plots show the full distribution of MPI and NCCL Allreduce times — including the long-tail outlier runs that can be dramatically slower than typical.

MPI Allreduce distribution across AMG2023 and MILC on Perlmutter — note long upper tails

NCCL Allreduce distribution across DeepCAM and nanoGPT on Perlmutter

Takeaway: Performance variability arises primarily due to slowdowns in collective communication — in particular Allreduce, Test, and Waitall. Compute time is stable. Some routines exhibit strong long-tail effects.

Impact of Concurrent Jobs

Performance degradation correlates with a specific subset of "Top Users" running communication-intensive jobs — not the overall number of concurrent users or jobs.

Total concurrent nodes vs AMG2023 runtime on Perlmutter

AMG2023 on Perlmutter: runtime vs. total concurrent allocated nodes. Weak overall correlation — raw system load is not the driver.

Total concurrent nodes vs DeepCAM runtime on Frontier

DeepCAM on Frontier: runtime vs. total concurrent allocated nodes. Similar pattern — total load alone doesn't predict degradation.

Top-user concurrent nodes vs AMG2023 runtime on Perlmutter

AMG2023 on Perlmutter: runtime vs. nodes allocated to communication-heavy "Top Users." Clear positive correlation emerges — these users are the culprits.

Top-user concurrent nodes vs AMG2023 runtime on Frontier

AMG2023 on Frontier: same pattern holds — concurrent Top-User jobs correlate strongly with runtime degradation.

Takeaway: A small subset of "Top Users" running communication-intensive applications creates network congestion that degrades performance for other jobs. Administrators should monitor and throttle such jobs — not rely solely on system-wide utilization metrics.

Impact of Job Placement

Does allocating nodes across more dragonfly groups (fragmented placement) hurt performance? The answer is mostly no — temporal network conditions matter far more than topology.

Dragonfly groups vs nanoGPT runtime on Perlmutter

nanoGPT on Perlmutter: number of dragonfly groups assigned to the job vs. runtime. No strong trend — placement fragmentation does not significantly hurt performance.

Dragonfly groups vs DeepCAM runtime on Frontier

DeepCAM on Frontier: same analysis. High variance exists within any given number of groups, indicating other factors (network congestion) dominate.

Takeaway: The number of dragonfly groups allocated to a job does not significantly influence application performance. The dragonfly network's adaptive routing absorbs topology-related variability; temporal network congestion is the dominant factor.

GPU Compute Variability

GEMM (FP16 matrix multiply) benchmarks reveal inherent GPU performance differences — but these do NOT explain application-level variability.

Per-GPU: most GPUs within 2.5% of their own best (only 0.066% outliers)

Within-node: GPUs on the same node can differ by up to 10–17%

System-wide GEMM variability on Perlmutter

System-wide: up to 28%; A100 80GB nodes ~7% faster with lower variance

Takeaway (Perlmutter): Individual GPU performance is stable over time, but system-wide variability reaches 28%. Despite this, GPU heterogeneity shows weak correlation (Spearman r ≈ 0.07) with application runtime — the network is the culprit.

Per-GCD: up to 12% fluctuation, higher than Perlmutter's A100s

Within-node variability across MI250X GCDs

System-wide GEMM variability on Frontier

System-wide: up to 15% variability with fewer extreme outliers than Perlmutter

Takeaway (Frontier): MI250X GPUs show greater per-GPU variability than Perlmutter's A100s, but with fewer extreme outliers system-wide. Critically, GPU performance does not correlate with application-level slowdowns — network conditions dominate.

Do Slow GPUs in an Allocation Hurt Performance?

We examined whether having more "slow" GPUs (bottom 1% system-wide GEMM performance) in a job's allocation correlates with longer application runtime.

Slow GPU count vs MILC runtime on Perlmutter

MILC on Perlmutter: Spearman r ≈ 0.07 — number of slow GPUs in allocation has negligible correlation with application runtime.

Slow GPU count vs MILC runtime on Frontier

MILC on Frontier: same weak correlation (r ≈ 0.08) — hardware heterogeneity is not responsible for application-level slowdowns.

Takeaway: Even with notable system-wide GPU heterogeneity, the number of slow GPUs in an allocation does not predict application slowdowns. Performance variability stems from network congestion, not compute hardware differences.

ML-Based Performance Prediction

XGBoost regression trained on system metrics accurately predicts runtime variability

Step 1 — NIC Counter Correlation with Runtime

Spearman's correlation coefficients between application execution times and NIC counter values (mean and max). Positive = higher counter values → slower runtime. Key counters: rh:sct_timeouts (retries from packet loss), hni_rx_paused_0 (receive path backpressure).

NIC counter correlation heatmap Perlmutter

Spearman correlation heatmap between application runtime and NIC counter values on Perlmutter. While individual counters show partial signal, multi-counter ML models are needed to capture the full complexity of variability.

Step 2 — XGBoost Feature Importances

After training the XGBoost model on placement, GEMM, Allreduce, and NIC counter features, we examine which features matter most for runtime prediction on each system.

Perlmutter (left): hni_rx_paused_0_mean and NCCL Allreduce (2GB) dominate — NIC receive-path stalls and network-wide congestion are the key signals.
Frontier (right): lpe_net_match_request_0, atu_cache_hit_derivative1, and parbs_tarb_pi_non_posted_blocked_cnt dominate — local data movement bottlenecks drive Frontier variability.

Step 3 — Prediction Accuracy (MAPE & Direction Accuracy)

We evaluate prediction quality with two metrics: MAPE (lower is better) and Direction Accuracy (DA ≈ 1.0 means the model correctly identifies when performance is improving vs. degrading). Results shown for incremental feature subsets.

MAPE and Direction Accuracy across feature sets

MAPE and DA for all four applications on both systems, across four feature combinations: Placement only → + GEMM → + Allreduce → + NIC Counters (full). Adding NIC counters yields the largest accuracy gains, especially for high-variability apps like DeepCAM. Without NIC counters, DA can fall near chance level; with them it approaches 1.0.

Step 4 — Predicted vs. Actual Runtime

Scatter plots of predicted vs. actual runtimes using the full feature set (placement + GEMM + Allreduce + NIC counters). Points close to the diagonal indicate accurate predictions.

Perlmutter: predicted vs. actual runtime for all four applications. The model tracks absolute performance well for most runs; some deviations arise from lack of router (Rosetta) counters.

Frontier: predicted vs. actual runtime. Model generalizes well even for MILC with only 7 training samples — demonstrating cross-application transferability.

Takeaway: NIC counters are the most critical feature class for accurate prediction. A single cross-application model outperforms per-application models because all workloads share the same underlying network bottlenecks. Traffic saturation stalls Perlmutter NICs; local data movement blockage drives Frontier variability.

Implications & Recommendations

⚙️

For System Administrators

Proactive monitoring: Collect NIC counters (e.g., hni_rx_paused, rh:sct_timeouts) in real time via LDMS. Apply predictive models to warn performance-sensitive users of impending degradation.
Limit concurrent heavy jobs: Monitor "Top Users" running communication-intensive applications. Throttle or isolate them to dedicated dragonfly groups once concurrent node thresholds are approaching.
Universal prediction model: Train a cross-application model using system-wide network counter data to predict ongoing degradation for any workload without per-application tuning.
Keep system software updated: A Slingshot Host Software update on Frontier (v11.0.2) fixed a libfabric regression that was responsible for significant variability in AMG2023.

🧑‍💻

For Users

Early variability detection: At the start of each job allocation, run a brief Allreduce benchmark and GEMM probe. Feed these measurements into a pre-trained model to predict whether the current network conditions will cause degraded performance.
Cancel early to save node-hours: If significant performance degradation is predicted from the initial probe, cancel and requeue rather than waste node-hours on a slow run.
Training data is not required per-application: Our cross-application model demonstrates that even users with few profiling samples can benefit from a shared system model.
Allreduce is your canary: The Allreduce micro-benchmark is the best single proxy for expected application variability. Monitoring it at scale is lightweight and informative.

BibTeX

@inproceedings{wei2026elusiveperformance,
  title     = {The Case of the Elusive Application Performance on
               Production {GPU} Supercomputers},
  author    = {Wei, Cunyang and Pradeep, Keshav and Bhatele, Abhinav},
  booktitle = {Proceedings of the IEEE International Parallel and
               Distributed Processing Symposium (IPDPS)},
  year      = {2026},
}