Modern HPC facilities increasingly rely on GPU-accelerated clusters to drive both scientific computing and AI workloads. Performance variability is a critical issue in these systems, undermining efficiency and performance reproducibility. While prior studies have extensively analyzed variability in CPU-centric supercomputers, similar large-scale investigations on GPU clusters are lacking.
To address this gap, we set up a longitudinal experiment on two state-of-the-art GPU-based supercomputers: NERSC's Perlmutter and ORNL's Frontier. We benchmark several representative HPC and AI applications and collect detailed performance data including network counters, profiling output, and job scheduler logs. We analyze this data to identify the impact of compute performance variations, allocated node topology, and network conditions on the overall runtime variability. We also use a machine learning based approach to identify potential correlations between these factors, and to forecast performance variability. We provide actionable insights for both system administrators and users to mitigate or predict performance variations in GPU-accelerated HPC environments.
Longitudinal study on two major GPU-based supercomputers (Perlmutter and Frontier), yielding the first comprehensive dataset of performance measurements across diverse applications and system states over an extended period at large scale.
Key observations providing in-depth analysis of inherent hardware differences and the impacts of concurrent jobs, job placement, and network conditions on performance across both HPC and AI workloads.
Machine learning methods to both identify critical metrics that influence performance variability and predict performance variations based on system status across traditional HPC workloads and deep learning applications.
Practical guidance for system administrators and users to predict and mitigate performance variations, including strategies for network-aware scheduling and early job cancellation.
Longitudinal experiments on two flagship GPU supercomputers over ~4 months in 2024–2025
Parallel algebraic multigrid solver from the Hypre library for 3D linear systems. High MPI communication density (~74–84% of runtime in MPI).
MIMD Lattice Computation code for quantum chromodynamics (QCD) gauge generation. Uniform stencil computation and communication patterns.
Exascale deep learning for climate analytics using CNNs to segment extreme weather patterns. Gordon Bell Prize-recognized workload, MLPerf benchmark.
Optimized GPT implementation using AxoNN 4D tensor parallelism framework with sub-communicator collectives for reduced variability.
Performance of four HPC and AI applications relative to their best observed runtime, measured over four months on 64 nodes of each system.
Perlmutter (NERSC): Run-to-run variability of AMG2023, MILC, DeepCAM, and nanoGPT relative to each application's best observed runtime. nanoGPT reaches up to 1.4× and AMG2023 up to 1.3× variability.
Frontier (OLCF): DeepCAM shows up to 2.6× variability (with outliers reaching 3.2×) and MILC up to 1.8×. AMG2023 variability dropped significantly after a Slingshot Host Software upgrade (v11.0.2) on January 14, 2025, which fixed a libfabric regression.
Execution time dissected into compute vs. communication components, from fastest to slowest observed runs. Communication time — not compute — drives variability across all applications.
AMG2023 — Perlmutter: ~74% of runtime in MPI; slowest runs have ~40% more communication
AMG2023 — Frontier: ~84% of runtime in MPI; improvement visible post Jan 14 patch
MILC — Perlmutter: ~56% in MPI; Allreduce and Test are the most variable routines
MILC — Frontier: slowest communication up to 122% higher than fastest runs
DeepCAM — Perlmutter: Allreduce up to 4× slower in worst runs vs. best
DeepCAM — Frontier: Allreduce up to 24× slower in worst vs. best; compute is stable
nanoGPT — Perlmutter: Allreduce 3× slower in worst vs. best; Allgather also variable
nanoGPT — Frontier: 4D sub-communicator design reduces Allreduce variability significantly
Allreduce is the single most variable routine across all applications.
These violin plots show the full distribution of MPI and NCCL Allreduce times — including
the long-tail outlier runs that can be dramatically slower than typical.
MPI Allreduce distribution across AMG2023 and MILC on Perlmutter — note long upper tails
NCCL Allreduce distribution across DeepCAM and nanoGPT on Perlmutter
Allreduce, Test, and Waitall.
Compute time is stable. Some routines exhibit strong long-tail effects.
Performance degradation correlates with a specific subset of "Top Users" running communication-intensive jobs — not the overall number of concurrent users or jobs.
AMG2023 on Perlmutter: runtime vs. total concurrent allocated nodes. Weak overall correlation — raw system load is not the driver.
DeepCAM on Frontier: runtime vs. total concurrent allocated nodes. Similar pattern — total load alone doesn't predict degradation.
AMG2023 on Perlmutter: runtime vs. nodes allocated to communication-heavy "Top Users." Clear positive correlation emerges — these users are the culprits.
AMG2023 on Frontier: same pattern holds — concurrent Top-User jobs correlate strongly with runtime degradation.
Does allocating nodes across more dragonfly groups (fragmented placement) hurt performance? The answer is mostly no — temporal network conditions matter far more than topology.
nanoGPT on Perlmutter: number of dragonfly groups assigned to the job vs. runtime. No strong trend — placement fragmentation does not significantly hurt performance.
DeepCAM on Frontier: same analysis. High variance exists within any given number of groups, indicating other factors (network congestion) dominate.
GEMM (FP16 matrix multiply) benchmarks reveal inherent GPU performance differences — but these do NOT explain application-level variability.
Per-GPU: most GPUs within 2.5% of their own best (only 0.066% outliers)
Within-node: GPUs on the same node can differ by up to 10–17%
System-wide: up to 28%; A100 80GB nodes ~7% faster with lower variance
Per-GCD: up to 12% fluctuation, higher than Perlmutter's A100s
Within-node variability across MI250X GCDs
System-wide: up to 15% variability with fewer extreme outliers than Perlmutter
We examined whether having more "slow" GPUs (bottom 1% system-wide GEMM performance) in a job's allocation correlates with longer application runtime.
MILC on Perlmutter: Spearman r ≈ 0.07 — number of slow GPUs in allocation has negligible correlation with application runtime.
MILC on Frontier: same weak correlation (r ≈ 0.08) — hardware heterogeneity is not responsible for application-level slowdowns.
XGBoost regression trained on system metrics accurately predicts runtime variability
Spearman's correlation coefficients between application execution times and NIC counter values
(mean and max). Positive = higher counter values → slower runtime.
Key counters: rh:sct_timeouts (retries from packet loss),
hni_rx_paused_0 (receive path backpressure).
Spearman correlation heatmap between application runtime and NIC counter values on Perlmutter. While individual counters show partial signal, multi-counter ML models are needed to capture the full complexity of variability.
After training the XGBoost model on placement, GEMM, Allreduce, and NIC counter features, we examine which features matter most for runtime prediction on each system.
Perlmutter (left): hni_rx_paused_0_mean and NCCL Allreduce (2GB)
dominate — NIC receive-path stalls and network-wide congestion are the key signals.
Frontier (right): lpe_net_match_request_0,
atu_cache_hit_derivative1, and parbs_tarb_pi_non_posted_blocked_cnt
dominate — local data movement bottlenecks drive Frontier variability.
We evaluate prediction quality with two metrics: MAPE (lower is better) and Direction Accuracy (DA ≈ 1.0 means the model correctly identifies when performance is improving vs. degrading). Results shown for incremental feature subsets.
MAPE and DA for all four applications on both systems, across four feature combinations: Placement only → + GEMM → + Allreduce → + NIC Counters (full). Adding NIC counters yields the largest accuracy gains, especially for high-variability apps like DeepCAM. Without NIC counters, DA can fall near chance level; with them it approaches 1.0.
Scatter plots of predicted vs. actual runtimes using the full feature set (placement + GEMM + Allreduce + NIC counters). Points close to the diagonal indicate accurate predictions.
Perlmutter: predicted vs. actual runtime for all four applications. The model tracks absolute performance well for most runs; some deviations arise from lack of router (Rosetta) counters.
Frontier: predicted vs. actual runtime. Model generalizes well even for MILC with only 7 training samples — demonstrating cross-application transferability.
hni_rx_paused,
rh:sct_timeouts) in real time via LDMS. Apply predictive models to warn
performance-sensitive users of impending degradation.
@inproceedings{wei2026elusiveperformance,
title = {The Case of the Elusive Application Performance on
Production {GPU} Supercomputers},
author = {Wei, Cunyang and Pradeep, Keshav and Bhatele, Abhinav},
booktitle = {Proceedings of the IEEE International Parallel and
Distributed Processing Symposium (IPDPS)},
year = {2026},
}