HPE Cray Supercomputing Programming Environment Getting Started Administrator Guide: HPE Cray Supercomputing EX Systems (26.03-Rev. A) S-9935

Record of revision

This chapter provides a record of updates and revisions to this guide.

Release updates

New in the CPE 26.03 (Rev. A) publication

New in this 26.03 release

New in this 25.09 (Rev. A) release

New in this 25.09 release

  • Issued the first version of HPE Cray Supercomputing Programming Environment Getting Started Administrator Guide: HPE Cray Supercomputing EX Systems (25.09) S-9935.

Revision history

Publication Title

Date

HPE Cray Supercomputing Programming Environment Getting Started Administrator Guide: HPE Cray Supercomputing EX Systems (26.03) S-9935

April 2026

HPE Cray Supercomputing Programming Environment Getting Started Administrator Guide: HPE Cray Supercomputing EX Systems (25.09-Rev. A) S-9935

December 2025

HPE Cray Supercomputing Programming Environment Getting Started Administrator Guide: HPE Cray Supercomputing EX Systems (25.09) S-9935

September 2025

Document conventions

This section defines the documentation conventions used throughout the guide, including typographic styles for code, commands, paths, and the backslash as the shell line-continuation character. It explains command-prompt notation, showing how the host and account are indicated (root prompts end with #, non-root prompts use account@hostname>) and lists node abbreviations (CN, NCN, AN, UAN) with example prompts for specific node types and Kubernetes contexts. This section also provides a simple three-step workflow and a reminder to verify pasted commands.

Typographical and command prompt conventions

This section provides background information about typographical and command prompts used in this guide and describes how they are delineated throughout this guide.

Typographical conventions

Type

Convention Description

This style

Indicates program code, reserved words, library functions, command-line prompts, screen output, file/path names, variables, and other software constructs.

\ (backslash)

When inserted at the end of a command line, indicates the Linux shell line continuation character (lines joined by a backslash are parsed as a single line).

Command prompt conventions

Host name and account in command prompts: The host name in a command prompt indicates where the command must be run. The account that must run the command is also indicated in the prompt.

  • The root or super-user account always has the # character at the end of the prompt.

  • Any non-root account is indicated with account@hostname>. A user account that is not root or crayadm is seen as user.

Command Prompt

Definition

user@login>

Run the command on any login node as any non-root user.

hostname#

Run the command on the specified system as root.

user@hostname>

Run the command on the specified system as any non-root user.

Copying and pasting text from this document

Using the Copy and Paste functions from a PDF is unreliable. Although copying and pasting a command line typically works, copying and pasting formatted file content (for example, JSON, YAML) typically fails. To ensure that file content is copied and pasted correctly while performing the procedures in this guide:

  1. Copy the content from the PDF.

  2. Paste it to a neutral editing form and add the necessary formatting.

  3. Copy the content from the neutral form and paste it into the console.

Tip: As a best practice, double-check copied/pasted commands for correctness, as some commands may not render correctly in the PDF.

About the HPE Cray Supercomputing Programming Environment

Welcome to the HPE Cray Supercomputing Programming Environment (CPE) Software suite, a complete application development and application development lifecycle software solution. CPE, offered in an integrated and user-friendly environment, provides a suite of programmer tools and libraries that support the development, optimization, and execution of high performance computing (HPC) applications for HPE Cray Supercomputing EX systems. These systems comprise multiple components. They include compute nodes, high-speed interconnects, storage solutions, cooling and power infrastructure, comprehensive system management software, security features, and other integral components and tools. CPE enables scientists, researchers, engineers, and other users to effectively leverage the advanced capabilities of these systems. Combined, CPE and its compatible systems provide for the computational needs of developed applications. Furthermore, these solutions deliver the performance, scalability, and flexibility required for HPC applications.

This administrator guide provides details for installing, configuring, updating, maintaining, monitoring, and troubleshooting CPE. As administrator, it will also be integral to manage and maintain system security, licensing, system tuning, benchmarking, planning, and various user support tasks. This guide assists you with these and other relative tasks.

For the latest version and revisions of this CPE guide, go to the HPE Support Center website, and perform a search on the part number of this document (S-9935). For additional information on how to use CPE or details regarding CPE components and modules, see the CPE Online Documentation website web page. See also the Documentation and support chapter for additional CPE resources and information.

About the CPE Software suite

CPE comprises a set of tools and toolkits that collectively provides a comprehensive environment for developing, optimizing, and running high performance applications on HPE Cray Supercomputing EX systems. The CPE Software suite includes:

  • HPE Cray Compiling Environment (CCE)

  • HPE Cray Debugging Support Tools

  • HPE Cray Environment (CENV) Setup and Compiling Support Tools

  • HPE Cray Message Passing Toolkit (MPT)

  • HPE Cray Performance, Measurement, and Analysis Tools (CPMAT)

  • HPE Cray Scientific and Math Libraries (CSML)

Tool/Toolkit Name

What it is

Description

CCE

A suite of compilers optimized for HPE Cray Supercomputing EX systems, including support for languages such as C, C++, and Fortran.

Compiles your code into executable programs that take full advantage of the architecture and capabilities of HPE Cray Supercomputing EX systems. CCE is designed to generate highly optimized code, ensuring that your applications run efficiently.

Debugging Tools

A set of tools for diagnosing and troubleshooting issues in your code.

Helps you identify and fix bugs in your applications. These tools provide features, such as breakpoints, variable inspection, and call stack tracing, which are essential for debugging complex parallel applications.

CENV

Tools and utilities designed to help users configure their programming environment and manage the compilation of their applications.

Simplifies the setup and configuration of CPE on HPE Cray Supercomputing EX systems. These tools help ensure that the necessary libraries, compilers, and environment variables are correctly set up, making it easier for users to compile and run their applications efficiently.

MPT

A set of libraries and tools that assist in the development of parallel applications using the Message Passing Interface (MPI) standard.

Enables efficient communication between multiple processes running on different nodes of HPE Cray Supercomputing EX system, which is crucial for HPC applications. This toolkit supports scalable and high performance data exchange, essential for tasks that require coordination and data sharing among numerous processors.

CPMAT

A collection of tools designed to help you measure, analyze, and optimize the performance of your applications running on HPE Cray systems.

Ensures that your applications are running efficiently by identifying performance bottlenecks and providing insights into how to improve computational performance. This suite includes tools for profiling, tracing, and in-depth performance analysis.

CSML

A collection of high performance mathematical and scientific libraries.

Provides pre-optimized routines for common mathematical and scientific computations, such as linear algebra, fast Fourier transforms, and more. These libraries help you achieve better performance and accuracy in your scientific applications without having to develop complex algorithms from scratch.

After developing code in programming languages like Fortran, C, or C++, you can then use HPE-optimized compilers to convert your code into executable programs. Additionally, you can use CPE tools for testing the performance, streamlining, and debugging your applications. With CPE, you manage your software environment by:

  • Using its various modules,

  • Submitting jobs to the job scheduler and running applications on HPE Cray EX supercomputers, and

  • Using debugging and performance analysis tools.

CPE components allow you to run applications efficiently and correctly.

Understanding the key CPE components

The CPE Software suite comprises specific components tools designed to maximize developer productivity, application scalability, and code performance. It includes compilers, analyzers, optimized libraries, and debuggers.

CPE Components

CPE and third-party components

The CPE Software suite also provides a variety of parallel programming models that allow you to make appropriate choices based on the nature of existing and new applications. CPE uses build environment containers, providing the ability to compile, and launch and track job status. Containers enable you to store and retrieve files from both the local and shared system storage.

CPE components (by category) include:

Compilers

  • HPE Cray Compiling Environment (CCE): High-performance compilers for Fortran, C, and C++ that are optimized for HPE Cray Supercomputing EX system architectures. These compilers include advanced optimization features and support for parallel programming models, such as Open Multi-Processing (OpenMP), Open Accelerators (OpenACC), Heterogeneous-Compute Interface for Portability (HIP), and Partitioned Global Address Space (PGAS) languages (such as Coarray Fortran, Unified Parallel C).

  • Third-Party Compilers: Support for other industry-standard compilers, such as GNU Compiler Collection (GCC), Intel, NVIDIA, and AMD compilers.

Programming models

Model Name

Description

HPE Cray Message Passing Toolkit (MPT)

Libraries and tools for parallel programming using the Message Passing Interface (MPI) standard, which is widely used for distributed memory parallelism.

OpenMP

Support for shared memory parallelism and GPU offloading using the OpenMP standard, which allows developers to parallelize and offload code using directives and APIs.

OpenACC

Support for GPU offloading using the OpenACC standard, which allows developers to parallelize and offload code using directives and APIs.

CUDA

Support for NVIDIA GPU offloading using the CUDA programming model.

HIP

Support for AMD GPU offloading using the HIP programming model.

Partitioned Global Address Space (PGAS)

Support for PGAS languages like Coarray Fortran and Unified Parallel C (UPC).

OpenSHMEM

As a programming library, simplifies and enhances the way you write parallel programs and allows you to manage data efficiently across multiple processors, ensuring that your high-performance applications run as fast and effectively as possible.

Scientific and mathematical libraries

Library Name

Library Description

HPE Cray LibSci (cray-libsci)

A library providing highly optimized and scalable mathematical routines, such as BLAS, LAPACK, and ScaLAPACK, aimed at enhancing the performance of linear algebra and other numerical computations on HPE Cray Supercomputing EX systems.

HPE Cray FFTW (cray_fftw)

Libraries for performing Fast Fourier Transforms (FFTs), based on FFTW3.

HPE Cray LibSci ACC (cray-libsci_acc)

An extension of HPE Cray LibSci that includes GPU-accelerated versions of mathematical routines, designed to leverage GPU hardware for improved performance in scientific computations on HPE Cray EX supercomputing systems with GPUs.

HPE Cray HDF5 (cray-hdf5 and cray-hdf5-parallel)

Libraries for managing and storing large scientific data sets in Hierarchical Data Format (HDF5), with parallel I/O capabilities to enhance performance and scalability on distributed HPE Cray Supercomputing EX systems.

HPE Cray NetCDF (cray-netcdf and cray-netcdf-hdf5parallel)

Libraries supporting the creation, access, and sharing of array-oriented scientific data in the Network Common Data Form (NetCDF), with parallel I/O support to improve scalability and performance on large-scale HPE Cray Supercomputing EX systems.

HPE Cray Parallel NetCDF (cray-parallel-netcdf)

A high-performance parallel I/O library for NetCDF files, enabling efficient handling and management of large, distributed data sets in scientific applications on Cray systems.

Environment setup tools

HPE Cray Environment Setup and Compilation Support (CENV) is a CPE software package with tools and libraries specifically designed to support compilation and environment setup. It includes compiler drivers and CPE API (craype-api).

Performance Analysis Tools

  • HPE Cray Performance Measurement & Analysis Tools (CPMAT)/HPE Cray Performance Analysis Tools (CrayPAT): A suite of tools for profiling and analyzing the performance and behavior of applications and a Performance API (PAPI). This includes pat_build for instrumenting applications, pat_report for generating performance reports, and HPE Cray Apprentice3 for visualizing performance data.

  • HPE Cray Apprentice3: Provides performance analysis with event tracing and graphical data visualization. HPE Cray Apprentice3 provides enhanced scalability, an improved user interface, and advanced metrics for more detailed and efficient performance analysis.

Debugging Tools

  • HPE Cray Distributed Debugging Tool (DDT): An advanced debugging tool for parallel applications, supporting MPI, OpenMP, and hybrid applications.

  • gdb4hpc: GNU Debugger (GDB)-based HPC debugger with support for debugging serial and parallel applications.

  • Valgrind4hpc - A parallel debugging tool used to detect memory leaks and parallel application errors.

  • Sanitizers4hpc - A parallel debugging tool used to detect memory access or leak issues at runtime using information from LLVM sanitizers.

  • Stack Trace Analysis Tool (STAT) - A single merged stack backtrace tool used to analyze application behavior at the function level. Helps trace down the cause of crashes.

  • Abnormal Termination Processing (ATP) - A scalable core file generation and analysis tool for analyzing crashes, with a selection algorithm to determine which core files to dump. ATP helps to determine the cause of crashes.

  • Cray Comparative Debugger (CCDB) - Not a traditional debugger, but rather a tool to run and step through two versions of the same application side-by-side to help determine where they diverge.

All CPE debugger tools support C/C++, Fortran, and Universal Parallel C (UPC).

Development Environment

  • Environment Modules: A system for managing and configuring the user environment, allowing you to easily load and switch between different software packages and versions.

  • Build and Configuration Tools: Tools for building and configuring applications, including support for makefiles and CMake.

Application Porting and Optimization

  • HPE Parallel Application Launch Service (PALS): An automation tool for starting, managing, and optimizing the placement of parallel applications on HPE Cray Supercomputing EX systems, ensuring efficient resource utilization.

  • CrayPAT-lite: A lightweight version of CrayPAT for quick performance assessments and application tuning.

Understanding CPE modules

CPE modules are used in conjunction with RHEL and SLES to streamline and manage the software development environment on HPE Cray Supercomputing EX systems. As part of the CPE environment, you can load, unload, and switch one or more modules to efficiently manage the software stack required for your specific applications and development tasks. Modules can comprise CPE base, library-related, or tools-related modules. Loading a module automatically sets environment variables, paths, and other settings, allowing you to focus on development rather than environment configuration. Modules allow you to easily switch between different versions of compilers, libraries, and tools, enabling you to test and validate your applications against multiple configurations. Compiler and library compatibility and dependencies is assured through the use of modules:

  • Library Compatibility: Many high-performance computing (HPC) applications depend on specific versions of libraries and tools. Understanding which modules to load ensures that all dependencies are compatible, reducing runtime errors and conflicts.

  • Compiler Consistency: Different modules may provide different versions of compilers. Ensuring you use consistent compilers across your development and production environments can prevent compatibility issues.

Debugging tools are also available for diagnosing and optimizing your applications and providing critical insights into your application’s performance and behavior. Performance analysis tools help identify bottlenecks and optimize code, which is crucial in high-performance computing environments where efficiency is paramount.

Modules are essential for several reasons and are used to:

  • Simplify Environment Management

    HPE Cray Supercomputing EX systems often have complex software stacks with multiple compilers, libraries, and tools. Modules simplify the process of configuring the environment by allowing you to easily load and unload different software components without manual configuration of environment variables.

  • Allow for consistency

    Modules ensure that all users on the system have a consistent environment. This consistency is crucial for reproducibility of results, especially in a research or scientific computing context.

  • Offer flexibility

    Different applications and development tasks might require different versions of compilers, libraries, or tools. Modules provide a flexible way to switch between these versions without conflicts.

  • Provide optimization

    CPE modules are optimized for the underlying hardware. Different programming environments and compilers are optimized for specific architectures and workloads. Loading appropriate modules helps to ensure that your code and applications are making the best use of system architecture and running efficiently. Properly loading and unloading modules helps manage system resources, ensuring that you are balancing the load and not overloading the system with unnecessary tools and libraries.

  • Are easy to use

    Modules abstract away the complexity of setting up and managing the environment. You can focus on development rather than expending excess time on configuration issues.

As you use CPE modules, keep in mind that many high-performance computing applications depend on specific versions of libraries and tools. Understanding which modules to load ensures that all dependencies are compatible, reducing runtime errors and conflicts. Understanding CPE modules and module commands is crucial for maximizing performance, ensuring compatibility, simplifying development and debugging, maintaining reproducibility, staying current with technological advancements, and fostering effective collaboration in high-performance computing environments.

The following subsections provide information on commonly-used CPE modules, libraries, and tools.

Commonly-used CPE modules, module command names, and module compiler commands

Commonly-used CPE modules and module commands include:

Module name

Module command name

CPE driver commands

AMD compilers

PrgEnv-amd, rocm

ftn, cc, CC

AOCC

PrgEnv-aocc

ftn, cc, CC

CCE*

PrgEnv-cray

ftn, cc, CC

GCC**

PrgEnv-gnu

ftn, cc, CC

Intel compilers

PrgEnv-intel

ftn, cc, CC

NVIDIA

PrgEnv-nvhpc

ftn, cc, CC

CPE driver commands are used in conjunction with module commands to construct build configurations.

Commonly-used CPE library commands include:

Library name

Module command name

Compiler commands

DSMML

cray-dsmml

Fast Fourier Transforms

cray-fftw

HDF5

cray-hdf5

HPE Cray LibSci***

cray-libsci, cray-fftw3,

cray-libsci_acc

HPE Cray MPICH

cray-mpich

mpicc

Parallel NetCDF

cray-parallel-netcdf

gcc

Commonly-used CPE tools and their commands include:

Tool name

Module command name

Apprentice 3

app3

Debuggers

gdb4hpc, valgrind4hpc, sanitizers4hpc

Distributed Debugging Tool

ddt

HPE CrayPAT

perftools

HPE CrayPAT Base

perftools-base

Huge pages

craype-hugepages****

Perforce TotalView

totalview

Commonly-used CPE performance analysis commands include:

Tool name

Module command name

ATP

atp

Clang/Low Level Virtual Machine (LLVM)

clang, llvm

CrayPAT

craypat

TensorFlow

tensorflow

Commonly-used CPE specialized environment commands include:

Specialized environment name

Environment command name

OpenMPI

openmpi

OpenSHMEMX

cray-openshmemx

ROCM

cray-rocm

** - Compiler-specific manpages include crayftn(1), craycc(1), and crayCC(1). Available only when the compiler module is loaded*

*** - Compiler-specific manpages include gcc(1), gfortran(1), and g++(1). Available only when the compiler module is loaded.*

**** - Compiler-specific manpages include intro_libsci(3s), and intro_fftw3(3). Available only when the compiler module is loaded. When the module for a CSML package (such as cray-libsci or cray-fftw) is loaded, all relevant headers and libraries for these packages are added to the compile and link lines of the cc, ftn, and CC CPE drivers. You must load the cray-hdf5 module (a dependency) before loading the cray-netcdf module.*

***** - In addition to the default module systems, CPE offers, as an alternate module management system, Lmod. Lmod, a Lua-based module system, can load and unloads modulefiles, handle path variables, and manage library and header files. (If you are using another Linux distribution, use the huge pages implementation appropriate for that distribution.) To use huge pages, load the appropriate craype-hugepages at link time. Possible values include:*

  • craype-hugepages128K

  • craype-hugepages512K

  • craype-hugepages2M

  • craype-hugepages4M

  • craype-hugepages8M

  • craype-hugepages16M

  • craype-hugepages32M

  • craype-hugepages64M

  • craype-hugepages128M

  • craype-hugepages256M

  • craype-hugepages512M

  • craype-hugepages1G

  • craype-hugepages2G

Viewing loaded modules

To view, for example, loaded modules and their versions:

user@hostname> module list
Currently Loaded Modules:
1) craype-x86-rome              5) xpmem/2.6.2-2.5_2.27__gd067c3f.shasta   9) cray-mpich/8.1.28
2) libfabric/1.15.2.0           6) cce/17.0.0                             10) cray-libsci/23.12.5
3) craype-network-ofi           7) craype/2.7.30                          11) PrgEnv-cray/8.5.0
4) perftools-base/23.12.0       8) cray-dsmml/0.2.2

Module versions are for example purposes only and may vary from those on the system.

Viewing available modules

To view, for example, available modules and their versions:

user@hostname> module avail PrgEnv
------------------------------------ /opt/cray/pe/modulefiles ------------------------------------
PrgEnv-amd/8.3.3      PrgEnv-cray-amd/8.4.0 (D)  PrgEnv-gnu/8.3.3        PrgEnv-nvhpc/8.4.0 (D)
PrgEnv-amd/8.4.0 (D)  PrgEnv-cray/8.3.3          PrgEnv-gnu/8.4.0 (D)    PrgEnv-nvidia/8.3.3
PrgEnv-aocc/8.3.3     PrgEnv-cray/8.4.0 (L,D)    PrgEnv-intel/8.3.3      PrgEnv-nvidia/8.4.0 (D)
PrgEnv-aocc/8.4.0 (D) PrgEnv-gnu-amd/8.3.3       PrgEnv-intel/8.4.0 (D)  PrgEnv-gnu-amd/8.4.0 (D)

Module versions are for example purposes only and may vary from those on the system.

Administrator responsibilities

The CPE administrator is responsible for managing emerging CPE software suite needs continuously. It encompass a wide range of tasks throughout the lifecycle of the HPE Supercomputing EX system, from initial installation to end-of-life. These tasks involve system setup, configuration, maintenance, performance optimization, user support, and decommissioning. As such, a key CPE administrator requirement is to thoroughly understand CPE-related product areas. An understanding of these areas ensures smooth operations, optimization, and support for the CPE users and their working supercomputing environment. Retaining CPE-related system knowledge equips the administrator to administer and optimize the HPE Cray Supercomputing EX environment effectively. Moreover, this knowledge base helps administrators to enable researchers and engineers to maximize their productivity and scientific output.

Administrator Focus Areas

Description

Details

CPE Architecture

- Understand how CPE integrates with hardware including interconnect (HPE Slingshot) and storage subsystems.

- Remain well-informed of the components of the CPE software suite, including:

• Compilers (HPE-specific Fortran, C, C++ compilers)

• Performance analysis tools (CrayPAT, Performance Tools)

• Debuggers (DDT, Perforce TotalView)

• Libraries (LibSci, MPI, Lustre)

• Environment management tools (module or Spack for environment variables/software versions)

Software Installation and Updates

- Install, configure, and update the CPE software suite to match system hardware and user requirements.

- Familiarize with HPE Cray EX package management and repositories.

- Stay informed on patches, bug fixes, and HPE updates.

- Set up and manage licensing for proprietary software.

- Ensure compliance with licensing agreements.

- Manage dependencies:

• Resolve compatibility issues.

• Test and validate updates to avoid workflow disruption.

System Configuration and Customization

- Configure compilers, libraries, and tools to optimize performance.

- Customize environment modules for easy compiler/library access.

- Manage compiler flags, optimizations, and linking for architectures (x86, ARM, GPUs).

- Validation/testing:

• Perform system validation and benchmarking.

• Run test jobs and verify performance.

Performance Tuning and Optimization

- Use Cray performance tools (HPE CrayPAT) for analysis and optimization.

- Identify bottlenecks in MPI, OpenMP, hybrid parallel apps.

- Assist in optimizing apps for HPE Cray Supercomputing EX architecture (NUMA, memory hierarchy, interconnect).

Parallel Programming Models and Best Practices

- Familiarity with models: MPI, PGAS (Coarray Fortran), and GPU models (OpenMP, OpenACC, CUDA, HIP)

- Learn best practices for writing/compiling parallel code on HPE Cray EX systems.

- Reference: Implementing and supporting parallel application best practices.

Monitoring and Logging

- Monitor health, usage, performance using HPE/third-party tools.

- Analyze logs/diagnostic outputs to resolve issues.

- Track usage statistics for planning/upgrades.

Security and Compliance

- Manage accounts, permissions, authentication (LDAP, Kerberos).

- Apply patches/updates to address vulnerabilities.

- Implement data protection/compliance measures.

Debugging and Troubleshooting

- Proficiency with debugging tools (DDT, Perforce TotalView).

- Troubleshoot job failures, compiler errors, runtime issues.

- Resolve hardware-software integration issues and bugs.

Job Scheduling and Resource Management

- Understand scheduler integration (Slurm, PBS Pro).

- Configure job submission scripts.

- Manage user priorities/resource allocation.

- Monitor load and optimize scheduling policies.

Documentation and Reporting

- Maintain documentation on configurations, software versions, customizations.

- Create user guides/cheat sheets.

- Generate reports on usage, performance, and maintenance activities.

User Support and Training

- Guide users on effective use of CPE tools.

- Assist in debugging/performance analysis.

- Organize/deliver training sessions and documentation.

Vendor and Community Collaboration

- Collaborate with HPE support.

- Participate in HPE/community training, webinars, conferences.

- Stay updated on HPC trends, practices, advancements.

End-of-Life (EOL) Management

- Assist migrating workflows/apps to new platforms.

- Ensure compliance/disposal of hardware/software licenses.

- Decommissioning:

• Plan execution when system reaches EOL.

• Archive user data/system configurations.

Training and resources

  • HPE Cray official documentation and user guides. See Documentation and support for more information.

  • Online HPE Cray Supercomputing EX system training courses and certifications.

  • HPC community forums and mailing lists. For example, the Cray User Group (CUG).

  • Vendor support and knowledge base (HPE customer portal). See Documentation and support for more information.

Implementing and supporting parallel application best practices

Deploying parallel coding best practices ensure it is optimized for the unique architecture and capabilities of the HPE Cray Supercomputing EX system, enabling high performance and scalability for demanding computational workloads. Understanding and sharing among team members these practices is integral to CPE administrator responsibilities. Best practices include:

  • Understanding the HPE Cray Supercomputing EX system architecture

  • Using the CPE tools

  • Writing efficient parallel code

  • Employing compiler optimization

  • Leveraging HPE Cray performance tools

  • Debugging parallel code

  • Scaling and testing code

  • Implementing hybrid parallelism

  • Planning efficient I/O functions

  • Documenting and controlling versions

  • Staying updated on HPE Cray-specific features

Understanding the HPE Cray Supercomputing EX system architecture

  • Know the hardware: Understand the architecture of the HPE Cray EX system, including:

    • Processor details (for example, AMD, Intel Xeon, or ARM-based processors).

    • GPU accelerators (if present, for example NVIDIA or AMD GPUs).

    • The high-speed HPE Slingshot interconnect.

    • NUMA (Non-Uniform Memory Access) characteristics.

  • Optimize for the interconnect: Take advantage of the low-latency, high-bandwidth Slingshot interconnect by optimizing communication patterns in your parallel code.

Using CPE tools

  • Compilers: Use the provided compilers optimized for HPE Cray systems:

    • HPE Cray Compilers: HPE Cray Fortran, C, C++.

    • Third-party compilers: GCC, Intel, AMD ROCm, NVIDIA HPC SDK (for GPU programming).

  • Libraries: Use pre-optimized libraries for scientific computing:

    • HPE Cray LibSci: Provides optimized BLAS, LAPACK, ScaLAPACK, FFT, and sparse solvers.

  • HPE Cray MPI: Optimized MPI implementation for inter-process communication.

  • Environment modules: Use the module command to load specific compiler versions, libraries, and tools:

    module load PrgEnv-cray
    module load cray-mpich
    module load cray-libsci
    

Writing efficient parallel code

  • Programming Models

    Choose the appropriate parallel programming model depending on deployed workload:

    • MPI: For distributed-memory parallelism across nodes.

    • OpenMP: For shared-memory parallelism on a single node.

    • Hybrid MPI + OpenMP: To leverage both inter-node and intra-node parallelism.

    • CUDA / OpenACC / CUDA / HIP: For GPU programming (if GPUs are present).

    • UPC: For PGAS programming if the workload benefits from one-sided communication.

  • Optimize Communication

    • Minimize communication overhead by reducing the frequency and size of MPI messages or other communication operations.

    • Use collectives (for example, MPI_Reduce, MPI_Bcast) instead of point-to-point communication wherever possible.

    • Overlap computation and communication using asynchronous communication (for example, MPI_Isend and MPI_Irecv).

  • Load Balancing

    • Ensure workloads are evenly distributed across processes and threads to minimize idle time.

    • Use domain decomposition or other problem-specific techniques to balance workloads.

  • Memory Usage

    • Optimize memory access patterns to minimize cache misses and NUMA penalties.

    • Use proper memory alignment and avoid false sharing in shared-memory programming.

    • Leverage the HPE Cray MEMKIND library for managing memory on nodes with High-Bandwidth Memory (HBM).

Employing compiler optimization

Best practices for compiler optimization involves, for example, employing compiler optimization flags, enabling auto-vectorization and manual vectorization where possible, and profiling and analyzing compiler-generated reports to identify missed optimizations and taking corrective action. As an administrator, you should educate users on compiler flags and provide performance feedback for using profiling tools. Options for compiler optimization include:

Option

Description

Compiler Flags

- Always enable optimization flags to take advantage of compiler optimizations for Cray EX systems.

- HPE Cray compilers: Use -O2 or -O3 for optimization, and -hfp3 for aggressive floating-point optimizations. For C/C++, use -O or -Ofast for optimization. Fortran defaults to -O2, use -hfp to control the floating point optimization levels.

- Debugging: Use -g to enable debugging symbols.

- Vectorization: Use -hvector to control CPU vectorization.

- Example: ftn -O3 -hfp3 -hvector my_program.f90 -o my_program

GPU-Specific Flags

- For GPU-accelerated codes, use compiler directives and flags to offload loops or computations to GPUs.

- Cray compilers: Use -hacc for OpenACC, -hcuda for CUDA. Use -fopenmp for OpenMP (C/C++/Fortran); use -hacc for OpenACC (Fortran); use CC -x hip for HIP.

- NVIDIA compilers: Use -gpu flags with NVIDIA HPC SDK.

Profile-Driven Optimization

- Use CrayPAT to collect performance data and feed it back into the compiler for profile-guided optimization (PGO).

Leveraging HPE Cray performance tools

CrayPAT: Profile and analyze your application to identify bottlenecks in computation, memory access, and communication. Example usage:

module load perftools
pat_build -g mpi my_program
aprun -n 64 ./my_program+pat
pat_report my_program+pat

Debugging parallel code

  • Use HPE Cray-supported debuggers (for example, Cray DDT or Perforce TotalView) to debug MPI, OpenMP, or hybrid applications.

  • Debug runtime errors such as deadlocks, data races, and out-of-bounds memory accesses.

  • Use Cray’s statistical debugging tools to debug large-scale runs efficiently.

Scaling and testing code

  • Strong and Weak Scaling

    • Test your code for both strong scaling (fixed problem size, increasing cores) and weak scaling (problem size grows with core count).

    • Identify scaling limits and investigate bottlenecks (for example, communication or I/O).

  • Use Smaller Test Cases - Develop smaller test cases to validate correctness before scaling up to the full system.

Implementing hybrid parallelism

  • Take advantage of hybrid programming models (for example, MPI + OpenMP) to maximize the use of node-level shared memory and inter-node communication.

  • Use one MPI process per NUMA domain and multiple OpenMP threads per process to optimize performance.

Planning efficient I/O functions

  • Use parallel I/O libraries, such as HDF5, NetCDF, or MPI-IO to handle input/output efficiently at scale.

  • Avoid frequent small I/O operations; batch I/O to reduce overhead.

  • Optimize I/O patterns for the Lustre file system commonly used in Cray systems.

Documenting and controlling versions

  • Document compiler flags, runtime parameters, and environment settings for reproducibility.

  • Use version control systems (for example, Git) to track changes in your codebase.

Staying updated on HPE Cray-specific features

  • Regularly check for updates to the Cray Programming Environment and learn about new optimizations and tools.

  • Attend HPE-hosted webinars or training sessions to stay current with best practices.

CPE software download and installation

As administrator, it is important to be aware of and CPE updates and related systems. Also, understanding how these updates impact your environment and strategizing an implementation plan is important. CPE software can be obtained from the:

Before you download and install CPE software or updates, be sure to carefully plan and ensure all prerequisites are met to avoid disruptions and ensure a smooth update process. Ensure you also install only supported systems with the appropriate CPE version. Supported systems for this CPE release are detailed in Supported systems.

Prerequisites

  • You must have a HPE passport account to access software from the My HPE Software Center.

  • You must retain the appropriate administrator privileges to upload and install CPE software into your site system.

  • You should review information under the Prerequisites and Release Information tabs on the My HPE Software Center and ensure that you understand installation requirements and contents. Cited supporting software must be compatible with your HPC system environment.

  • Review release announcements/notes before installing new updates.

  • Ensure you have access to software repositories or the appropriate distribution channels for downloading updates. Also, ensure that the system’s network configuration allows access to the required repositories or download locations.

  • Verify that your credentials or entitlement keys are valid and properly configured.

  • Ensure the system meets the minimum hardware and software requirements for the new version of the CPE suite.

  • Verify that all dependencies (for example, specific versions of operating systems, compilers, or libraries) are in place before proceeding with the update. See Supported systems for CPE dependencies relative to systems with CSM or HPCM, or where CPE is installed on HP Cray XD2000 systems.

  • Confirm that firewalls or security settings do not block access to update servers.

  • Verify that there is adequate disk space for the downloaded files and the installation process.

Key considerations when downloading updates

Before executing the installation of any new update, consider:

Compatibility with System Configuration

  • Ensure the new version of the CPE software is compatible with the specific hardware, operating system version, and workload manager in use on your HPE Cray Supercomputing EX system. See Supported systems for CPE dependencies relative to systems with CSM or HPCM, or where CPE is installed on HP Cray XD2000 systems.

  • Check the release notes or documentation for any hardware or software dependencies.

Carefully Review Release Announcements and Notes

  • Carefully examine the release notes for the new version to understand new features, bug fixes, and known issues.

  • Look for deprecated features or tools that might affect existing workflows.

Backup and System State

Create a backup of the current programming environment, including module files, configurations, and user applications. Document the current environment setup to facilitate rollback, if needed.

Change Management

  • Communicate the planned update with users and stakeholders, as the update may introduce changes to compilers, libraries, or tools that could affect user workflows.

  • Schedule updates during a maintenance window to minimize the impact on users.

Test Updates

  • Test the updated software in a controlled environment or on a test system before deploying it to production.

  • Verify that critical applications and workflows operate as expected with the new version.

Network and Download Requirements

  • Ensure that the system has stable internet connectivity for downloading updates from HPE repositories.

  • Verify that you have sufficient storage space for the downloaded software and any temporary files created during the installation.

Modules and User Environment

  • Check for changes in module files or naming conventions, as they may affect user scripts or workflows.

  • Update documentation or user guides if there are changes in the way modules are loaded or used.

Licensing Compliance

  • Confirm that you have valid licenses for the updated software components.

  • Ensure that any license servers or keys required for the CPE suite are properly configured and up to date.

Deploy Documentation Tools:

  • Have the relevant installation and upgrade documentation readily available for reference.

  • Ensure that required tools (for example, package managers, installation scripts) are installed and functional.

System Downtime Planning

  • Prepare for system downtime during the update process.

  • Allow for downtime, especially if the update requires restarting services or rebooting the system.

Addressing these considerations ensures a smooth and reliable update process for the CPE suite software.

Downloading CPE from the HPE My Software Center website

  1. Go to HPE Support Center to access CPE software updates from the HPE Support Center.

  2. Enter the name of the software needed (for example, Cray Programming Environment).

  3. Click Drivers and Software (either the tab near the top of right pane or from the left pane).

  4. Locate the software needed in the listed results.

  5. Click Obtain software. You are directed to the My HPE Software Center.

Downloading unofficial updates

HPE intermittently downloads unofficial and unsupported pre-release updates. These unsupported releases occasionally address minor system bugs and can be downloaded from the CPE Online Documentation website under the How to Access our Token-Authenticated Package Repository page. Follow the instructions provided at the site to download this intermittent software.

CAUTION: Downloads from the CPE Online Documentation website are unofficial and unsupported by HPE. Use caution if downloading this pre-released software or software components.

Contact HPE support for additional details regarding software downloads from this site, as necessary. See Documentation and Support for information on contacting HPE support.

Determining system administrator status

Knowing who has administrator status and what privileges they hold in the HPE Cray Supercomputing Programming Environment is critical for safe, secure, and reliable operation. Use the procedures in this section to determine administrator status.

Prerequisite

You must have:

  • Root or an existing administrator account to make changes to user privileges,

  • Familiarity with the specific HPE Cray Supercomputing EX system, its architecture, and how administrative roles are managed, and

  • An understanding about CPE and it components,

  • An understanding of differing administrator team roles,

  • The ability to log into a system management workstation or HPE service node. All administrative tasks require the system management workstation of HPE service node.

Procedure

  1. Ensure you have administrator access. Determining whether a new CPE user has administrator privileges for managing CPE involves checking their access to specific nodes, groups, commands, and files. This process varies slightly depending on the system architecture, terminals, and operating systems (for example, SLES or RHEL).

    a. Acquire SSH access to the appropriate node using a terminal application (for example ssh on Linux/macOS or PuTTY on Windows) to log into the system.

    a. Log into the management node. For example:

    ssh username@<hostname>

    Replace <hostname> with the name of the management or login node (for example, smw01 for the SMW, or login01 for a login node).

    b. Verify the node type:

    hostname

  2. Check group memberships, and interpret results:

    groups

    If your username is part of the wheel, sudo, or crayadmin group, you likely have administrator privileges. If none of these groups are listed, you likely do not have administrative access.

  3. Test sudo access by running a test command:

    sudo ls /root

  4. If prompted, enter your user password. If the command succeeds, you have administrator privileges. If Permission denied or User is not in the sudoers file appears, you do not have administrative access.

  5. Test access to specific directories. For example:

    ls /opt/cray
    ls /etc/opt/cray
    

    If you can view the contents of these directories, you likely have administrative privileges.

  6. Check permissions. If you encounter Permission denied errors, you likely do not have the necessary privileges.

  7. Check module access:

    module list

    Administrative users should be able to load CPE modules.

  8. Load administrative modules:

    module load cpe

    If the module loads successfully, you likely have access to administrative tools. If you encounter errors, you may lack administrative privileges or the proper configuration.

  9. Verify access the CPE tools:

    xtstat

    Administrative users generally have access to CPE-specific tools and commands. If the command works without errors, you likely have administrative privileges.

  10. Exam tool configuration, and check if you can access Cray-specific configuration files under /etc/opt/cray:

    ls /etc/opt/cray

Setting up the initial administrator account

During initial installation of new HPE Cray Supercomputing EX system with CPE, HPE sets up initial administrator privileges based on customer input. Assigning administrator privileges for CPE on an HPE Cray Supercomputing EX supercomputing system is a key step during the installation process. CPE provides development tools and software for high-performance computing (HPC) workloads, and administrators need appropriate privileges to manage and configure this environment.

Note: The exact groups, permissions, and configuration steps may vary based on the specific version of CPE and the organization’s policies.

Prerequisite

Administrator access is required to initiate this initial procedures in this chapter.

Procedure

To set up an initial administrator account, HPE installers:

  1. Access the management node or system management interface to be able to manage the HPE Cray Supercomputing EX system. The installer uses secure credentials to access the management environment for configuration purposes.

  2. Identify or create a user account. This involves identifying the user account that will serve as the CPE Administrator. If an appropriate account does not already exist, the installer creates a new user account specifically for this role. This is typically done using standard Linux account management tools (useradd, passwd, and so forth) or through management scripts provided by HPE.

  3. Assign privileges to the user. Privilege assignment ensures that the user account has access to the necessary tools, software modules, and configuration files. This process may include:

    • Adding the user to administrator groups. Installer adds the user to specific system groups required for managing CPE. Common groups may include:

      • pe-admin: A group often associated with administrative access to CPE.

      • root or other system-level groups if broader administrative access is required.

      Commands to add a user to a group might include:

      usermod -aG pe-admin <username>

    • Granting access to CPE tools. This step ensures that a user can access and configure the CPE tools, such as compilers, libraries, and debugging utilities. This step might involve modifying environment variables, module paths, or configuration files located in directories, such as /opt/cray/pe/ or similar.

    • Access to file systems. The administrator must have access to relevant file systems where CPE tools and modules are installed. This may involve setting appropriate permissions on directories like:

      /opt/cray/pe/
      /etc/opt/cray/pe/
      
  4. Configure secure authentication for the administrator account to ensure that only authorized personnel can access CPE tools. This step can include setting up SSH key-based access, enforcing strong passwords, or enabling multi-factor authentication (MFA).

  5. Validate privileges. The installer tests the administrator account to ensure it has the required access and functionality to manage CPE. This validation step involves:

    a. Loading and unloading software modules (for example, using module load and module unload commands).

    b. Configuring compiler settings and library paths.

    c. Accessing debugging tools and performance analysis utilities.

  6. After the privileges are validated, the installer documents the setup process and provides the administrator credentials and relevant instructions to the designated CPE administrator. This documentation typically includes:

    • Account details.

    • Steps for managing PE tools and modules.

    • Paths to configuration files and installed software.

  7. The installer ensures that the privilege assignment aligns with HPE documentation, best practices, and security requirements. Specific details may depend on the software version and organizational policies.

Setting up, managing, and maintaining CPE users and user groups

Prerequisites

  • Retain root or an existing administrator privileges to set up or make changes to user privileges.

  • Retain login access to the system where CPE is installed, typically through SSH or another secure method.

  • Maintain familiarity with:

    • Specific HPE Cray Supercomputing EX systems, its architecture, and how administrative roles are managed.

    • System authentication.

    • Job scheduling tools.

    • Role-based access control (RBAC) through configuration files or centralized authentication systems (such as LDAP, Active Directory)

    • Linux/Unix groups on the system (for example, crayadmin or similar groups)

Important: Before making system modifications, be sure to back up any configuration files or settings, particularly if modifying system-level applications-specific configurations.

Prerequisites

  • Retain root or an existing administrator privileges to set up or make changes to user privileges.

  • Retain login access to the system where CPE is installed, typically through SSH or another secure method.

  • Maintain familiarity with:

    • Specific HPE Cray Supercomputing EX systems, its architecture, and how administrative roles are managed.

    • System authentication.

    • Job scheduling tools.

    • Role-based access control (RBAC) through configuration files or centralized authentication systems (such as LDAP, Active Directory)

    • Linux/Unix groups on the system (for example, crayadmin or similar groups)

Important: Before making system modifications, be sure to back up any configuration files or settings, particularly if modifying system-level applications-specific configurations.

Adding, deleting, and modifying configurations for CPE users

The following section provides instructions for:

Setting up a new CPE user

This procedure details how to set up a new CPE user. As you are completing this procedure, note that the exact commands and procedures may vary depending on the HPE Cray Supercomputing EX system configuration, authentication method (for example, LDAP, Kerberos), and job scheduler (for example, SLURM, PBS).

  1. Gather new user information and access requirements:

    • Username

    • Full name

    • Email address

    • Group or project association

    • Home directory requirements

    • Shell preferences (for example, bash, zsh)

  2. Use standard Linux commands to create the new user account. For example:

    sudo useradd -m -s /bin/bash -G <group> <username>

    Note:

    • -m: Creates a home directory for the user.

    • -s /bin/bash: Sets the user’s default shell.

    • -G <group>: Adds the user to a specific group (for example, a project group).

    • <username>: The system access name for the user.

  3. Set the user’s password:

    sudo passwd <username>

  4. Ensure the user has appropriate permissions to access the necessary directories and files:

    • Verify access to their home directory (for example, /home/<username>).

    • Configure access to shared project directories, if applicable.

  5. If the system uses resource allocation and limits (for example, through SLURM or PBS), assign the user to the correct groups for job scheduling and resource utilization.

    For SLURM-based systems update the SLURM configuration to include the user in the appropriate account or partition. For example:

    sacctmgr add user <username> DefaultAccount=<account>

  6. Validate environment setup by ensuring that the user has access to CPE tools and modules. For example:

    • Check if the user can load CPE modules (for example, PrgEnv-cray, gcc, and so forth) through module load.

    • Verify paths to compilers and libraries are set correctly in their environment.

  7. Test and verify the account by logging in as the new user or ask them to log in to ensure:

    • Successful authentication.

    • Proper access to their home directory.

    • Ability to load necessary modules and submit jobs.

  8. Communicate credentials and guidelines. Provide the user with their login credentials, initial password, and instructions for accessing the system. Include guidelines for:

    • Changing their password.

    • Using modules to load tools.

    • Submitting jobs via the scheduler.

  9. After the account is active, monitor new user activity to ensure that:

    • They can run jobs successfully.

    • Their resource usage is within expected limits.

  10. Update system documentation or user management records to include new user information.

Deleting an existing CPE user

This procedure details instructions for deleting an existing CPE user. Deleting an existing user from an HPE Cray Supercomputing EX system involves several steps to ensure that the user’s account is disabled, their files are handled appropriately, and any system records (for example, job scheduler configurations) are updated. To delete a CPE user:

  1. Confirm that the username of the user to be removed. Gather any additional information about their account, such as:

    • Home directory location

    • Group memberships

    • Active jobs or queued jobs in the job scheduler

  2. If the system uses a job scheduler (for example, SLURM), check if the user has any active or pending jobs. For example, in SLURM:

    squeue -u <username>

    If active or queued jobs exist, coordinate with the user or relevant stakeholders to cancel them. To cancel jobs:

    scancel -u <username>

  3. Before permanently deleting the user, as a best practice, disable the account to prevent access while you handle their files and configurations. You can lock the account by running:

    sudo usermod -L <username>

    Alternatively, you can expire the account immediately:

    sudo chage -E 0 <username>

  4. If the user’s home directory or files need to be retained for archival or transfer purposes, back them up before deletion:

    tar -czvf /backup/location/<username>.tar.gz /home/<username>

  5. If you are ready to delete the account, remove the user and their home directory:

    sudo userdel -r <username>

    • The -r option removes the user’s home directory and mail spool.

    • If you do not want to delete their home directory, omit the -r flag.

  6. Remove the user from the job scheduler’s configuration. For example, in SLURM, you can remove the user from accounts or associations:

    sacctmgr delete user name=<username>

  7. If the user was part of specific groups (for example, project groups), remove their association from those groups:

    sudo gpasswd -d <username> <group>

    If the user was the only member of a specific group, consider deleting the group:

    sudo groupdel <group>

  8. Check for and remove any custom configurations or traces of the user, such as:

    • Entries in /etc/exports for NFS shares.

    • SSH keys in /etc/ssh/authorized_keys.

    • Resource allocation or quota configurations.

  9. Ensure that the user account and associated data have been removed:

    • Check for the username in the system:

      getent passwd <username>

    • Verify that the home directory or other files are no longer present.

  10. Update system documentation or user management records to reflect the removal of the user.

Modifying configurations for an existing CPE user

Modifying the configurations of an existing CPE user on an HPE Cray Supercomputing EX system involves several steps, depending on the specific changes required. As a CPE administrator, you can adjust user settings related to account details, group memberships, job scheduler configurations, permissions, or environment variables.

Note: Always follow your organization’s policies and the official CPE documentation when performing user management tasks.

To modify an existing CPE user account:

  1. Determine the username of the user whose configurations need to be modified. Identify the specific changes required, such as:

    • Updating account information (for example, shell, home directory).

    • Modifying group memberships.

    • Adjusting job scheduling/resource allocations.

    • Changing environment variables or module configurations.

  2. To update basic user details like shell or home directory, use the usermod command:

    a. To change the user’s shell:

    sudo usermod -s /bin/zsh <username>

    b. To change the user’s home directory:

    sudo usermod -d /new/home/directory <username>

    c. To move the home directory, ensure the old files are transferred to the new location:

    sudo mv /home/<username> /new/home/directory
    sudo chown -R <username>:<group> /new/home/directory
    
  3. Add or remove the user from specific groups:

    a. To add the user to a group:

    sudo usermod -aG <group> <username>

    b. To remove the user from a group:

    sudo gpasswd -d <username> <group>

  4. If the system uses SLURM or another job scheduler, modify the user’s resource allocation, account, or partition access. For example, with SLURM:

    a. Change the user’s default account:

    sacctmgr modify user name=<username> set DefaultAccount=<new_account>

    b. Add the user to a new account:

    sacctmgr add user name=<username> Account=<new_account>

    c. Update resource limits by modifying resource limits associated with the user, such as CPU hours or memory allocations, through the SLURM database or configuration files.

  5. If the user needs access to new directories or files, adjust file system permissions using chmod or chown, grant access to a shared project directory:

    sudo chown <username>:<group> /path/to/project
    sudo chmod 770 /path/to/project
    
  6. If the user needs changes to their environment setup (for example, custom paths, module loading behavior), update their shell configuration files:

    • For bash: Modify /home/<username>/.bashrc or /home/<username>/.bash_profile.

    • For zsh: Modify /home/<username>/.zshrc.

    For example, add custom paths or default module loads:

    echo 'export PATH=/custom/software/bin:$PATH' >> /home/<username>/.bashrc
    echo 'module load PrgEnv-cray' >> /home/<username>/.bashrc
    

    If the system uses centralized module configurations, adjust the relevant files or scripts that define user-specific module loading behavior.

  7. Log in as the user or ask them to log in and verify that the changes are working as intended:

    • Check updated environment variables.

    • Confirm group memberships.

    • Test job submission and resource allocations.

  8. Update system documentation or user management records to reflect the changes made to the user’s configuration.

  9. Inform the user about the modifications and provide instructions if needed (for example, on new resource allocations or updated environment settings).

Setting up a user as an administrator

The following procedure details how to add administrative privileges to a new or existing CPE user. Before completing this procedure consider:

  • Security Impacts. Ensure that issuing administrative privileges in in line with your organization’s security policy.

  • CPE-specific Groups. Some HPE Cray EX Supercomputing systems may have specific administrative groups or roles. Be sure to check appropriate system documentation for any custom groups or access requirements. For documentation information, see Documentation and support.

  • Testing. After setup, test the user’s system capabilities to ensure that they have the required permissions without unnecessary access.

  1. Determine which node the user is to use:

    • Management Node (for example, SMW or HPE Service Node): Used for administrative tasks, such as system configuration, HPE Cray Supercomputer EX system software installation, and system monitoring.

    • Login Node: Used for accessing the programming environment and running user-level development tasks.

  2. Verify the hostname and confirm the node type:

    hostname

  3. Check if the user already exists:

    id username

  4. Do one the following:

    • If the user does not exist, create the user by issuing:

      useradd -m -s /bin/bash username
      passwd username
      
    • If the user exists, go to step ##.

  5. Log into the appropriate node using an account with root or administrator privileges:

    ssh root@<hostname>, where <hostname> is the management or login node.

    For example, to log in to the management node (SMW):

    root@<hostname>
    ssh root@smw01
    

    To log in to the login node:

    root@<hostname>
    ssh root@login01
    

    <hostname> is the name of the management node. smw is a management node. login01 is a login node.

  6. Add the new or existing user to the appropriate administrative group. Do one of the following:

    • For SUSE Linux Enterprise Server (SLES):

      sudo usermod -aG groupname username
      

      In the above example, groupname is the name of the administrative group. SLES systems typically use the sudo or wheel group to grant administrative privileges. If the wheel group is not enabled for sudo access, enter /etc/sudoers file contains:

    • For Red Hat Enterprise Linux (RHEL):

      a. Determine the appropriate administrative group. RHEL systems primarily use the wheel group to grant administrative privileges.

      b. Add the user to the wheel group:

      usermod -aG wheel username

      c. Ensure the /etc/sudoers file contains (edit with visudo):

  7. Ensure that the user has access to CPE-specific administrative tools and configuration files. This action may involve granting permission to directories like /opt/cray, /etc/opt/cray, or other system directories where CPE is installed.

    sudo chown -R username /path/to/cray/directory
    sudo chmod -R 750 /path/to/cray/directory
    

    In the above example, /path/to/cray/directory is a specific path to a directory on the system.

  8. If additional HPE Cray-specific groups are defined (such as crayadmin, add the user to those groups:

    sudo -aG crayadmin username

  9. Ensure that the user’s environment is set up to use CPE by updating shell configuration files (for example, .bashrc or .profile), including necessary module commands:

    module load cpe
    

    Check CPE-specific documentation for additional environment variables or modules that must be loaded.

  10. Test the user’s administrative access by switching to their account:

    su - username
    
  11. Confirm that the user can execute administrative commands, such as:

    sudo ls /root
    
  12. Confirm that the user can access and use CPE-specific tools.

  13. Document and record the user’s new administrative privileges for auditing and troubleshooting purposes.

Setting up and managing user groups

Setting up user groups in CPE involves creating and managing Linux groups, configuring access to shared resources, and integrating groups with the job scheduler (such as, PBS or SLURM). User groups are essential for organizing users by projects, roles, or resource access requirements.

Note: If your system uses centralized authentication (for example, LDAP, Active Directory), group creation and management may need to be performed at the directory service level. Always refer to your organization’s policies and the official HPE Cray Supercomputing EX system and PBS documentation for best practices.

Planning for and setting up the user group with Slurm

  1. Plan the user group structure beforehand by determining the purpose and structure of the groups. For instance:

    • Are groups organized by projects, departments, or roles?

    • Will groups control access to specific directories, files, or resources?

    • Are there job scheduler partitions or accounts linked to these groups?

  2. Document the group names, their members, and their intended purpose.

  3. Create a new group:

    a. Use the groupadd command to create a new Linux group for the users:

    sudo groupadd <groupname>

    For example, to create a group for a project called, astro-research, enter:

    sudo groupadd astro-research

    b. (Optional) Assign a specific Group ID (GID). You can specify a GID during group creation:

    sudo groupadd -g <GID> <groupname>

  4. Add users to the group:

    a. Add users to the group using the usermod command:

    sudo usermod -aG <groupname> <username>

    For example, to add a user jdoe to the astro-research group:

    sudo usermod -aG astro-research jdoe

    b. Use the groups command to verify the user’s group memberships:

    groups <username>

  5. If you are using PBS, configure queues based on user groups to control access to resources. To create or Modify a PBS Queue:

    a. Edit the PBS queue configuration to restrict access to a specific group:

    qmgr -c "create queue climate_queue queue_type=execution"
    qmgr -c "set queue climate_queue acl_user_enable=True"
    qmgr -c "set queue climate_queue acl_groups=climate-research"
    qmgr -c "set queue climate_queue enabled=True"
    qmgr -c "set queue climate_queue started=True"
    

    Note: In the above example, a queue named climate_queue is created, and access is restricted to users in the climate-research group by enabling the acl_groups attribute.

    b. Set default queue resource limits for the queue:

    qmgr -c "set queue climate_queue resources_max.walltime=48:00:00"
    qmgr -c "set queue climate_queue resources_max.ncpus=64"
    qmgr -c "set queue climate_queue default_chunk.ncpus=1"
    

    c. Set global PBS server policies to control group-based resource access by enabling Group Access Control at the Server Level:

    qmgr -c “set server acl_group_enable=True”

    d. Define global resource limits for groups. For example, to restrict the climate-research group to a maximum of 100 CPUs across the system, enter:

    qmgr -c “set server resources_available.ncpus=100”

  6. (Optional) If the group is to share a directory (for example, for project files), create the directory and configure permissions:

    sudo mkdir /shared/projects/astro-research
    sudo chown :astro-research /shared/projects/astro-research
    sudo chmod 770 /shared/projects/astro-research
    

    Note: In the above example, 770 grants full access to the group and the directory owner but denies access to others.

  7. (Optional) Enable group sticky bit by ensuring that files created in the directory inherit the group ownership:

    sudo chmod g+s /shared/projects/astro-research

  8. If using Slurm, for example, configure the groups to control resource access and allocations. For Slurm:

    a. Create or Update an Account in Slurm:

    sacctmgr add account name=<accountname> description=”Astro Research Project”

    b. Associate the group with the Slurm account:

    c. Add users in the group to the corresponding Slurm account:

    sacctmgr add user name=<username> account=<accountname>

  9. Set resource limits for the group either at the system level (for example, through ulimit or cgroups) or in the job scheduler. For example, set Slurm QoS for a group to limit resource usage:

    sacctmgr add qos name=<qosname> maxtres=cpu=1000 maxtresperuser=cpu=100
    sacctmgr modify account name=<accountname> set qos=<qosname>
    
  10. Ensure that the group is functioning as intended:

    • Verify group memberships with groups <username>.

    • Check access to shared directories and resources.

    • Confirm users can submit jobs with the correct group/account settings in Slurm:

      sbatch –account=<accountname> jobscript.slurm

      or

      qsub -q climate_queue jobscript.pbs

      Search Slurm Workload Manager documentation for more information on sbatch and batch scripts.

    • Confirm that users not in the group are denied access to the queue or resources.

  11. Document the configuration, and maintain a record of the groups, their members, and their configurations:

    • Group names and members.

    • Associated PBS queues and resource limits.

    • Shared directories and file permissions.

    Where applicable, maintain a record of the groups, their members, and their purpose. Include:

    • Group name and GID.

    • Members of the group.

    • Associated SLURM accounts, QoS, or resource limits.

    • Shared directories and access permissions.

  12. Inform users about their group memberships, shared directory locations, and PBS queue configurations. Provide instructions on how to submit jobs to the appropriate queue.

Managing users for user groups

As you are managing user access within groups, note that:

  • If the system uses centralized authentication (such as LDAP, Active Directory), group membership changes may need to be made in the directory service instead of directly on the system. Consult your organization’s policies for managing groups in such environments.

  • You should always verify that users are added to the correct groups and that group permissions align with your organization’s security and resource access policies.

Adding users to a user group

  1. Identify the croup and users:

    • Determine the name of the existing group to which users will be added.

    • Identify the usernames of the users to be added to the group.

    • Confirm the purpose of the group (e.g., file sharing, job scheduler access) and ensure the users require access.

  2. Use the usermod or gpasswd command to add users to the group.

    • For a single user, enter:

      sudo usermod -aG <groupname> <username>

      Note: In the above example:

      • aG: Adds the user to the group without removing them from other groups.

      • <groupname>: The name of the group.

      • <username>: The name of the user.

      Example:

      sudo usermod -aG research-group jdoe

    • For multiple users, enter:

      sudo gpasswd -M <user1>,<user2>,<user3> <groupname>

      This replaces the group’s membership with the specified users. If you want to add users without overriding current members, add them individually using usermod.

      Example:

      sudo gpasswd -M jdoe,asmith,kwong research-group

  3. After adding users, verify that they have been successfully added to the group.

    a. Check a user’s group memberships:

    groups <username>

    Example:

    groups jdoe

    b. Check Members of a specific group. To see all members of a group, inspect the /etc/group file, enter:

    getent group <groupname>

    Example:

    getent group research-group

  4. If the group controls access to resources, such as shared directories or job scheduler queues, ensure that new users can access those resources:

    • Shared Directories: Verify the users have the appropriate permissions for any shared directories associated with the group.

      ls -ld /path/to/shared/directory

      If needed, update directory permissions:

      sudo chown :<groupname> /path/to/shared/directory
      sudo chmod 770 /path/to/shared/directory
      
    • Job Scheduler (PBS): If the group is associated with a PBS queue, confirm that new users can access the queue. You may need to update PBS access controls:

      qmgr -c "set queue <queue_name> acl_user_enable=True"
      qmgr -c "set queue <queue_name> acl_users+=<username>"
      
  5. Test Access by asking users to test their access to shared directories, PBS queues, or other resources to confirm they have been properly added to the group.

  6. Document the changes. Update system documentation or user management records to reflect the changes to group membership. Record:

    • Group name.

    • Users added to the group.

    • Purpose of the group and associated resources.

  7. Inform the users of their new group membership and provide instructions if necessary (for example, how to access shared directories, submit jobs to specific queues, and so forth).

Removing users from groups

To remove a user from a group:

  1. Understand the context in the user(s) are being removed. In Slurm, user groups are typically managed through Linux user groups on the system. Slurm uses these groups to control access to resources through accounts and associations. To remove a user from a group associated with Slurm, primarily interact with the system’s user and group management tools.

  2. Identify the group and users

    • Confirm the group name from which the users will be removed.

    • Identify the usernames of the users to be removed.

    • Verify that the users no longer require access to resources associated with the group (for example, shared directories, PBS queues).

  3. Remove the user(s) from the Linux group. If the user group in Slurm is linked to a Linux group, remove users from that group using standard Linux commands.

    • To remove one user, enter:

      sudo gpasswd -d <username> <groupname>

      In the above example, <username> with the username of the user you want to remove, and <groupname> with the name of the group.

      Example:

      sudo gpasswd -d jdoe research-group

    • To replace the entire membership of a group while excluding certain users, use the gpasswd command with the -M option to redefine the group membership:

      sudo gpasswd -M <remaining_user1>,<remaining_user2> <groupname>

      Example:

      sudo gpasswd -M asmith,kwong research-group

      The above command overwrites the group membership with only the specified users.

    • To edit the /etc/group file directly, use:

      sudo nano /etc/group

  4. Locate the <groupname> entry, and remove the user’s name from the list of group members.

    sudo gpasswd -d <username> <groupname>

  5. Save and exit the file.

  6. If Slurm is configured to use its own account and association system, ensure the user’s association with the group is removed in Slurm. Use the sacctmgr command to manage Slurm accounts and associations. To remove a user from a Slurm account or group, enter:

    sacctmgr remove user where user=<username> account=<accountname>

    In the above example, replace <username> with the username of the user and <accountname> with the name of the Slurm account or group.

    Example:

    sacctmgr remove user where user=johndoe account=research

  7. For PBS, verify that they are no longer part of the group.

    a. Check the user’s group memberships to confirm that a user is no longer a member of the group by issuing:

    groups <username>

    Example:

    groups jdoe

    b. Check members of a specific group to confirm the current membership of a group:

    getent group <groupname>

    Example:

    getent group research-group

    c. If the group is associated with specific resources, ensure that the user’s access to those resources is revoked, as applicable.

    For shared directories, verify that the user can no longer access shared directories associated with the group. If necessary, update permissions:

    sudo chmod 770 /shared/projects/research-group
    udo chown :research-group /shared/projects/research-group
    

    For the PBS job scheduler, if the group is tied to a PBS queue, ensure that the user’s access to the queue is also revoked. For example:

    qmgr -c "set queue <queue_name> acl_users-=<username>"

    In the above example, replace <queue_name> with the name of the queue and <username> with the user being removed.

  8. Confirm the changes when prompted.

  9. Verify that the user has been removed from the group:

    • Check the Linux group membership by entering:

      groups <username>

      The above command should no longer list the specified <groupname>.

    • Check Slurm account associations, as applicable:

      sacctmgr show associations where user=<username>

      The system should no longer list the association with the specified <accountname> after issuing the above command.

  10. If the user has been removed from a group or account that controls access to compute resources, notify them to prevent confusion.

  11. Document updates. Include:

    • Group name.

    • Users removed.

    • Associated resources affected.

  12. Inform the affected users of the changes, especially if their access to specific resources (for example, directories or job queues) has been revoked.

Deleting unused groups

If a group is no longer needed, use:

sudo groupdel <groupname>

Common administrator customization and configuration tasks

CPE is designed to optimize and simplify the development, debugging, and execution of applications on HPE Cray supercomputers. As a system administrator, managing the CPE involves configuring, customizing, and maintaining the environment to meet the needs of users and workloads. Common administrative procedures include managing software modules, configuring compilers and libraries, customizing job environment settings, and optimizing system performance.

Prerequisites

To perform the procedures in this chapter, you must have:

  • Root or administrative access privilege to the HPE Cray Supercomputing EX system,

  • Access to the configuration files for Lmod (typically in /etc/modulefiles or /opt/modulefiles),

  • Access to HPE Cray Supercomputing EX system library directories.

  • Access to the module system configuration files (for example, /etc/profile.d/modules.sh or /etc/profile.d/lmod.sh),

  • Access to sample applications for profiling.

  • HPE CrayPAT module installed.

  • Root access to the Slurm configuration files (/etc/slurm/slurm.conf), and/or

  • Familiarity with the module system (module command), wrapper scripts (such as cc, CC, ftn), and Slurm commands (for example, sbatch, srun).

Managing software modules

The CPE uses the Lmod module system to manage software environments, allowing users to load and unload specific versions of tools, compilers, and libraries. To manage software modules:

  1. Verify the module system configuration by checking the location of the modulefiles directory:

    module path

  2. Verify that the module system is operational:

    module avail

  3. Add a new module by copying or creating a modulefile for the software:

    nano /opt/modulefiles/software_name/version

    Example: gcc modulefile content:

    #%Module1.0
    proc ModulesHelp { } {
        puts stderr "This module loads GCC version x.y.z"
    }
    module-whatis "GCC Compiler x.y.z"
    setenv GCC_HOME /opt/gcc/x.y.z
    prepend-path PATH /opt/gcc/x.y.z/bin
    prepend-path LD_LIBRARY_PATH /opt/gcc/x.y.z/lib64
    
  4. Update the module cache by refreshing the module cache, as applicable:

    module –ignore-cache avail

  5. Load the module, and verify that it works:

    module load software_name/version
    software_name --version
    

Configuring default modules

Default modules can be configured to ensure that users have access to essential tools and libraries when they log in. To do so:

  1. Edit the default module configuration by modifying the system-wide module initialization

    nano /etc/profile.d/modules.sh

  2. Add commands to load default modules:

    module load cpe/22.10
    module load gcc/11.2.0
    
  3. Log in as a non-administrative user and verify that the modules are loaded by default:

    module list

If issues arise, revert the changes by editing the file again or restoring the previous configuration.

Configuring compiler and library defaults

CPE includes compilers (such as the HPE Cray Compiler Environment (CCE), GCC, Intel, and so forth) and libraries (such as Cray LibSci, MPI, and so forth). Administrators can configure default versions or customize compiler options.

  1. Set the default compiler version. For module-based configurations:

    module unload cce gcc intel
    module load gcc/11.2.0
    
  2. Verify the default compiler:

    cc –version

  3. Customize compiler flags by editing the system-wide compiler wrapper configuration file (usually found in /opt/cray/pe):

    nano /opt/cray/pe/compilers/default/compiler_flags

  4. Add custom flags:

    export CFLAGS="-O3 -march=native"
    export LDFLAGS="-L/opt/cray/lib"
    
  5. Compile a sample program to verify the configuration:

    cc test_program.c -o test_program
    ./test_program
    

Customizing job scheduler integration

HPE Cray Supercomputing EX systems often use Slurm as the workload manager. To customize Slurm settings to optimize job submission and resource usage for the CPE:

  1. Edit the Slurm configuration

    nano /etc/slurm/slurm.conf

    Example: Adding constraints for high-bandwidth memory (HBM):

    NodeName=cray[1-4] RealMemory=64000 Gres=hbm:16

  2. Apply the changes by restarting Slurm:

    systemctl restart slurmctld
    systemctl restart slurmd
    
  3. Submit a test job requesting HBM:

    sbatch –gres=hbm:4 test_job.sh

Managing HPE Cray-specific libraries

HPE Cray LibSci is a key library for scientific computing. To configure or update it, for example:

  1. Check installed versions by listing available versions of HPE Cray LibSci:

    module avail cray-libsci

  2. Set default LibSci version by loading the desired version. For example:

    module load cray-libsci/25.03.1

  3. Verify the configuration by compiling and linking a sample program using HPE Cray LibSci:

    cc -o test_program test_program.c -lsci

Optimizing performance using HPE CrayPAT

HPE Cray Performance Measurement and Analysis Tools (CrayPAT) are used to profile and optimize applications.

  1. Load the HPE CrayPAT module:

    module load perftools

  2. Compile the application with instrumentation:

    cc -h profile_generate -o test_program test_program.c

  3. Run and collect data by executing the program to generate performance data:

    srun ./test_program

  4. Use HPE CrayPAT tools to analyze the collected data:

    pat_build -O test_program
    pat_report test_program.xf
    

Common security protocols

Setting up common CPE-related security protocols is a critical task for administrators to ensure the system is secure, logs are properly analyzed, and unauthorized access or malicious activity is detected and mitigated. This chapter details some of the basic areas that an administrator should focus on, the steps to set up and analyze security protocols, and the tools required for each procedure. This process includes securing user authentication, configuring logging and auditing, setting up network security, and monitoring for suspicious activity.

Basic and key areas where security protocols need to be established include:

Area

Description

User Authentication and Access Control

- Ensure secure login mechanisms (for example, SSH with key-based authentication). Restrict user access using PAM (Pluggable Authentication Module) and account policies.

- Enforce strong password policies.

Logging and Auditing

- Centralize system logs.

- Enable detailed auditing of system and user activities.

- Monitor logs for anomalies.

Network Security

- Restrict network access using firewalls or iptables.

- Configure secure communication protocols.

- Monitor network traffic for unauthorized access.

Software and Environment Security

- Protect CPE-related modules and libraries.- Ensure system updates and patches are applied.

Monitoring and Intrusion Detection

- Set up intrusion detection systems (for example, fail2ban, auditd).- Analyze logs for unusual activity.

User authentication and access control

The procedure in this section provides steps for securing user authentication and limit access to authorized users only.

Prerequisites

You must have:

  • Root or administrator access to the system.

  • SSH installed and configured.

  • Access to ssh-keygen, passwd, privileged access management (PAM) configuration files.

Procedure

To secure user authentication and limit access to authorized users:

  1. Log in to the management node:

    ssh admin@<hostname>

  2. Set Up key-based SSH authentication

    a. Generate SSH keys on the client machine:

    ssh-keygen -t rsa -b 4096

    b. Copy the public key to the HPE Cray Supercomputing EX system:

    ssh-copy-id user@<hostname>

    c. Disable password-based authentication in /etc/ssh/sshd_config:

    PasswordAuthentication no

    d. Restart the SSH service:

    sudo systemctl restart sshd

  3. Restrict root login by editing /etc/ssh/sshd_config to disable root login:

    PermitRootLogin no

  4. Enforce strong password policies by editing /etc/security/pwquality.conf to set password requirements:

     minlen = 12
     dcredit = -1
     ucredit = -1
     lcredit = -1
     ocredit = -1
    
  5. Test the changes:

    passwd <username>

  6. Limit user access by restricting system access to specific users using /etc/security/access.conf:

    -:ALL EXCEPT admin_user:ALL

Enabling logging and auditing

This procedures details steps for enabling logging and auditing of user and system activity.

Prerequisites

You must have:

  • Administrative access to the system.

  • Access to rsyslog, journalctl, and auditd tools.

Procedure

To enable logging and auditing:

  1. Enable system logging:

    sudo systemctl enable rsyslog sudo systemctl start rsyslog

  2. Check the configuration file /etc/rsyslog.conf to ensure log files are written to /var/log:

    tail -f /var/log/messages

  3. Enable persistent journal logs by configuring the journal to persist logs across reboots:

    sudo mkdir -p /var/log/journal sudo systemctl restart systemd-journald

  4. Ensure auditd is installed:

    sudo yum install audit
    sudo systemctl enable auditd
    sudo systemctl start auditd
    
  5. Add audit rules to /etc/audit/rules.d/audit.rules:

    -w /etc/passwd -p wa -k passwd_changes
    -w /etc/shadow -p wa -k shadow_changes
    -w /var/log/secure -p wa -k auth_logs
    
  6. Analyze the logs for anomalies. Use journalctl to view recent logs:

    journalctl -xe

  7. Search for specific keywords (such as, error, failed) after issuing:

    grep -i “error” /var/log/secure

Setting up network security

This CPE-related procedures provides instructions for setting up network security.

Prerequisites

You must have access to:

  • Network configuration files.

  • iptables, firewalld, iftop, and sar.

Procedure

  1. Use firewalld to allow only specific ports:

    sudo firewall-cmd --add-service=ssh --permanent
    sudo firewall-cmd --add-service=slurm --permanent
    sudo firewall-cmd --reload
    
  2. Use iptables to limit SSH attempts:

    sudo iptables -A INPUT -p tcp –dport 22 -m connlimit –connlimit-above 5 -j DROP

  3. Use iftop to monitor real-time network activity:

    sudo iftop -i eth0

    The iftop command is a real-time network traffic monitoring tool commonly used to observe and analyze network activity on a specific interface. By running sudo iftop -i <interface>, you can monitor bandwidth usage, including inbound and outbound traffic between nodes or external systems.

    Example Normal Report:

    This example normal report is for a supercomputing environment where nodes are exchanging data for workloads like parallel computations or file transfers. No anomalies are present.

            10.0.0.1                  =>       10.0.0.2        5.5Mb    5.6Mb    5.5Mb
                                        <=                    4.0Mb    4.1Mb    4.0Mb
            10.0.0.3                  =>       10.0.0.4        1.2Mb    1.0Mb    1.1Mb
                                        <=                    0.8Mb    0.7Mb    0.8Mb
            10.0.0.5                  =>       10.0.0.6        0.5Mb    0.5Mb    0.5Mb
                                       <=                    0.4Mb    0.4Mb    0.4Mb
    
    ----------------------------------------------------------------------------
    TX:       7.2Mb                 RX:         5.2Mb       TOTAL: 12.4Mb
    

    Report Explanation:

    • Traffic Patterns:

      • The source (10.0.0.1, 10.0.0.3, 10.0.0.5) and destination (10.0.0.2, 10.0.0.4, 10.0.0.6) nodes are communicating normally.

      • Bandwidth usage is proportional to expected workload, with no significant spikes.

    • Traffic Volume:

      • Outbound (TX) traffic is 7.2Mb, and inbound (RX) traffic is 5.2Mb.

      • The total bandwidth usage on the interface is 12.4Mb, which is reasonable for moderate workloads.

    • Steady Traffic: Bandwidth usage is consistent across time intervals (2s, 10s, 40s averages are similar).

    If anomalies are detected, resolve:

    • Irregular traffic patterns by investigating the process or application on 10.0.0.1, or check job logs, network configurations, or system resource usage for anomalies.

    • Unusual traffic patterns by using tools like netstat, tcpdump, or ss to identify the processes generating traffic, or checking application logs or job scheduler activity for anomalies.

    • Idle or under-utilized network activity by verifying whether the interface is correctly configured and active with the ip link show eth0, or checking if jobs or applications are running that should generate traffic.

  4. Use sar to view historical network data:

    sar -n DEV 1 5

    Example Normal Report:

    In the example normal report, the system is handling steady network traffic with no apparent anomalies.

    12:00:01 AM IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
    12:00:02 AM eth0      1200      1100      3200      3100      0.00      0.00      0.00
    12:00:03 AM eth0      1250      1150      3300      3200      0.00      0.00      0.00
    12:00:04 AM eth0      1300      1200      3400      3300      0.00      0.00      0.00
    12:00:05 AM eth0      1290      1190      3380      3250      0.00      0.00      0.00
    Average:     eth0      1260      1160      3320      3210      0.00      0.00      0.00
    

    Report Heading

    Explanation

    rxpck/s, txpck/s

    - Packets received/transmitted per second. Normal values depend on the workload but should remain consistent during steady traffic.

    rxkB/s, txkB/s

    - Received and transmitted kilobytes per second. Normal values depend on the expected data transfer rates for the application.

    rxcmp/s, txcmp/s

    - Compressed packets. Typically 0.00 unless compression is enabled.

    Example Abnormal Report:

    For the example abnormal report, investigate the source of excessive outbound traffic (for example, application logs, intrusion detection). Check for network congestion or malicious activity.

     12:00:01 AM IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
     12:00:02 AM eth0        10     10000        50     10000      0.00      0.00      0.00
     12:00:03 AM eth0        15     12000        60     12000      0.00      0.00      0.00
     12:00:04 AM eth0        10     15000        50     15000      0.00      0.00      0.00
      12:00:05 AM eth0        12     14000        55     14000      0.00      0.00      0.00
    

    Report Heading

    Explanation

    High txpck/s and txkB/s

    - Extremely high transmission rates suggest excessive outbound traffic, possibly caused by a misconfigured application or a denial-of-service (DoS) attack.

    Low rxpck/s and rxkB/s**

    - Very low inbound traffic may indicate a connectivity issue or an imbalance in communication.

Setting up software and environment security

The procedure in this section provides details on securing CPE-related modules and software, and ensure updates are applied.

Prerequisites

You must have:

  • Administrative access.

  • Access to module, system package manager (yum, zypper).

Procedure

  1. Restrict access to critical modules by modifying permissions for sensitive modulefiles:

    chmod 750 /opt/modulefiles/cce
    chgrp admin_group /opt/modulefiles/cce
    
  2. Update HPE Cray Supercomputing EX software and dependencies:

    sudo yum update

  3. Check for missing or corrupted modules:

    module avail

Monitoring and intrusion detection

The procedure in this section provides details for detecting and mitigating unauthorized access or activity using monitoring and intrusion detection tools.

Prerequisites

You must have:

  • Administrative access.

  • Access to fail2ban and auditd.

Procedure

  1. Install fail2ban:

    sudo yum install fail2ban
    sudo systemctl enable fail2ban
    sudo systemctl start fail2ban
    
  2. Configure jail rules in /etc/fail2ban/jail.local:

    sshd]
    enabled = true
    port = ssh
    filter = sshd
    logpath = /var/log/secure
    maxretry = 5
    
  3. Monitor intrusion attempts, and view the fail2ban log for blocked IPs:

    sudo fail2ban-client status sshd

  4. Use auditctl to inspect suspicious activity:

    sudo ausearch -k auth_logs

Common CPE monitoring tasks

Maintaining the health of CPE involves monitoring system health, analyzing log files for diagnostics, and tracking software usage to ensure optimal functionality and usage by users. This chapter details the most common administrative procedures for these tasks.

Analyzing log files and diagnostic outputs

Analyzing system and application logs is critical for diagnosing CPE issues, such as module failures, job errors, or hardware malfunctions.

Prerequisites

  • Administrative access to system logs.

  • Familiarity with log locations and tools (less, grep, journalctl).

Required tools/systems

  • System logs (/var/log), Slurm logs (/var/log/slurm/slurmctld.log), and CPE-specific logs (such as craype.log).

  • Log analysis tools: grep, less, journalctl.

Reviewing system logs

To review system logs:

  1. Locate general system logs:

    ls /var/log

  2. List CPE-specific logs (for example, craype.log):

    ls /opt/cray/logs

  3. Use grep to filter errors or warnings:

    grep -i error /var/log/messages

  4. Analyze recent system events:

    journalctl -xe

    Example: Problematic Output

    Oct 30 12:47:10 smw01 kernel: eth0: Link is Down
    Oct 30 12:47:12 smw01 kernel: eth0: Link is Up - 1Gbps/Full - flow control off
    Oct 30 12:47:12 smw01 systemd-networkd[112]: eth0: Lost carrier
    Oct 30 12:47:12 smw01 systemd-networkd[112]: eth0: Configured
    Oct 30 12:48:30 smw01 kernel: eth0: Link is Down
    Oct 30 12:48:45 smw01 kernel: eth0: Transmit queue timeout
    Oct 30 12:48:45 smw01 kernel: eth0: Reset adapter
    Oct 30 12:49:00 smw01 systemd-networkd[112]: eth0: Could not configure: Network unreachable
    

    The above example suggests intermittent connectivity issues:

    • Transmit Queue Timeout: Indicates that packets are queued for transmission but the system is unable to send them:

      Oct 30 12:48:45 smw01 kernel: eth0: Transmit queue timeout

      This issue could be caused by hardware issues (such as faulty NIC or cable) or excessive network congestion.

    • Reset Adapter: The kernel resets the network adapter to recover from the timeout:

      Oct 30 12:48:45 smw01 kernel: eth0: Reset adapter

    • Network Unreachable: Indicates that the system could not configure the network interface due to a lack of connectivity:

      Oct 30 12:49:00 smw01 systemd-networkd[112]: eth0: Could not configure: Network unreachable

To troubleshoot problematic output:

  1. Investigate hardware issues:

    a. Check the physical connection (for example, cables, switches, NICs).

    b. Use ip link to check the status of the interface:

    ip link show eth0

  2. Verify that the network interface is configured correctly:

    ip addr show eth0

  3. Restart the network service:

    sudo systemctl restart systemd-networkd

  4. Monitor for intermittent issues by continuously logging network-related events:

    journalctl -f -u systemd-networkd

  5. Investigate packet loss by using tools like ping or iperf to test connectivity and bandwidth.

  6. If the issue persists, replace the network adapter, cable, or switch connected to the affected interface.

Analyzing Slurm logs

  1. Check and inspect Slurm controller logs for job errors:

    less /var/log/slurm/slurmctld.log

  2. Filter for job errors, and search for specific job IDs:

    grep <JobID> /var/log/slurm/slurmctld.log

  3. Check logs from compute nodes for hardware or software issues.

Analyzing CPE-specific logs

To analyze logs:

  1. Locate and review HPE Cray-specific logs. CPE logs are typically located in /opt/cray/logs or /var/log/cray.

    ls /var/log/cray

  2. Search for issues in craype.log or craypat.log:

    grep -i error /var/log/craype.log

  3. Record errors and determine potential causes for troubleshooting.

    Example: Normal Output

    Oct 30 12:45:01 smw01 craype[1234]: ERROR: Failed to load module 'cray-mpich': Module not found
    

    Example: Problematic Output #1

    Oct 30 12:45:01 smw01 craype[1234]: ERROR: Failed to load module 'cray-mpich': Module not found
    

To resolve Module not found issues (see issue directly above):

  1. Verify the availability of the missing module:

    module avail

  2. Check the modulefiles directory for the cray-mpich module:

    ls /opt/modulefiles/cray-mpich

  3. If the module is missing, reinstall the HPE Cray MPI library or restore its modulefile.

    Example: Problematic Output #2

    Oct 30 13:00:15 smw01 craype[5678]: ERROR: Compiler 'cc' failed with error code 127
    Oct 30 13:00:15 smw01 craype[5678]: ERROR: Unable to compile test program for compatibility check
    

To resolve module error issues (see issue directly above):

  1. Ensure the compiler module is loaded:

    module load cce

  2. Verify the compiler version:

    cc –version

  3. Check if the cc binary is installed and accessible in the PATH:

    which cc

  4. If the cc module is broken, consider reinstalling the HPE Cray compiler suite (CCE).

Tracking CPE usage

Tracking how users interact with CPE modules, compilers, and libraries is important for resource planning and identifying underutilized or problematic software.

Prerequisites

  • Administrative access to the system.

  • Familiarity with Slurm accounting and module usage tracking mechanisms.

Required tools/systems

• Slurm accounting (sacct). • Environment module usage logs (if enabled). • Performance tools (such as, HPE CrayPAT).

Tracking module usage

  1. Enable module logging, and add the module initialization file (for example, /etc/profile.d/modules.sh):

    export MODULE_LOGFILE=/var/log/module_usage.log

  2. Review the log file for module usage:

    less /var/log/module_usage.log

  3. Search for a specific module:

    grep “cce” /var/log/module_usage.log

    Example: Module Usage Log Output

    Oct 30 12:00:01 user1 module: load cce/14.0.0
    Oct 30 12:00:05 user1 module: load cray-mpich/8.1.9
    Oct 30 12:00:10 user1 module: unload cce/14.0.0
    Oct 30 12:05:20 user2 module: load gcc/11.2.0
    Oct 30 12:10:15 user3 module: load cray-libsci/21.03.1
    Oct 30 12:15:00 user1 module: load perftools/21.08.0
    

    The above output reports:

    • Timestamp: The date and time when the module operation occurred.

    • Username: The user who executed the module command.

    • Operation: The module operation performed (for example, load, unload, swap).

    • Module Name and Version: The full name of the module (including version) being loaded, unloaded, or swapped.

Analyzing job resource usage

  1. Use Slurm accounting to view job resource usage:

    sacct –format=JobID,User,Partition,AllocCPUs,Elapsed

  2. Create a usage summary:

    sacct -S 2023-10-01 -E 2023-10-31 –format=User,JobName,AllocCPUs -P

Profiling application performance

  1. Enable HPE CrayPAT:

    module load perftools

  2. To instrument and run an application, recompile the application with profiling:

    cc -h profile_generate -o app app.c
    srun ./app
    
  3. Access and analyze the performance report:

    pat_report app.xf

Troubleshooting CPE

This section provides information on resolving common CPE issues. Should you encounter issues not included in this section, see Documentation and Support for additional resources and information on contacting HPE support.

Resolving the CCE PGAS error and dependency issue resulting in failed image builds

CPE releases previous to the CPE 26.03 release supported OpenSHMEM libraries on HPCM systems. However, with CPE 26.03 (and later), OpenSHMEM libraries are no longer supported on HPCM-based systems. This support limitation results in a potential issue during the installation procedure on HPCM-based systems.

Symptom

While building the CPE image on a system with HPCM, the image fails to build after the resulting error appears. For example:

Problem: conflicting requests                                                                                                               - nothing provides libsma.so.0()(64bit) needed by cce-21.0.0-pgas-ofi

Cause

The cce-21.0.0-pgas-ofi package requires libsma.so.0 which was previously provided by cray-dsmml, but cray-dsmml is no longer a part of CPE 26.03 (or later).

Resolution

To remediate the issue, either:

  • To resolve the package dependency for cce-21.0.0-pgas-ofi from the new OpenSHMEM package, acquire the new OpenSHMEM release media, install it, and enable the repository. Also ensure that the new OpenSHMEM library has been installed into the base compute image.

  • To resolve the package dependency for cce-21.0.0-pgas-ofi from an older OpenSHMEM package, acquire the CPE 25.09 release media, install it, and enable the repository.

  • Do not install cce-21.0.0-pgas-ofi.

Resolving an issue where MPICH generates an OFI failure

An attempt to register a memory buffer for off-node MPI communication results in an MPICH error message.

Symptom

If this issue occurs, the following error message appears:

MPICH ERROR
...
OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Bad address)

Cause

This error occurs because:

  • An invalid address was issued. The validity of the memory region passed to MPI communication should be verified.

  • An attempt to use a GPU buffer in an MPI call was made without setting MPICH_GPU_SUPPORT_ENABLED to 1.

  • An application error occurred, particularly, when using GPU-aware MPI.

  • Too short of a buffer was passed to an MPI call.

Resolution

To resolve this error, perform one of the following:

  • Ensure the code is valid. The MPICH backtrace may often include the memory address in the call signature, and obvious issues may appear.

  • Verify that the validity of the memory region passed to MPI communication properly,

  • Set MPICH_GPU_SUPPORT_ENABLED to 1 (MPICH_GPU_SUPPORT_ENABLED=1) if you are using GPU memory, or

  • Establish the validity of the location and length of the buffer passed to MPI, if you are not using GPU memory.

Resolving issue where an incorrect MPICH version is linked in an application

The incorrect HPE Cray MPICH version is dynamically-linked in by the application.

Symptom

The output of the module list shows one version of cray-mpich but shows another version is being used after the program is executed.

Cause

The CPE module environment reflects the programming libraries that are used at build time.

Resolution

To make the runtime environment reflect the modules that are currently loaded, either:

  • Set LD_LIBRARY_PATH to:

    LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:LD_LIBRARY_PATH

  • “Hard-code” the CPE library version into the executable. Note that compiler driver (CC/ftn/CC) -add-rpath and -add-runpath options can be used.

  • Change the HPE Cray MPICH default version, an administrator can execute the /opt/cray/pe/admin-pe/set_default_files/set_default_mpicj_<VERSION> script in the appropriate CPE image.

Resolving a bad address or bus error during MPI operations

An application encounters errors during an on-node MPI operation.

Symptom

An application encounters one of the following errors during an MPI call:

process_vm_readv: Bad address
Assertion failed in file ../src/mpid/ch4/shm/cray_common/cray_common_memops.c at line 461: 0

or

Bus error

Cause

This issue can occur if:

  • A bad memory address is encountered during an on-node MPI operation.

  • Attempting to use a GPU buffer in an MPI call without setting MPICH_GPU_SUPPORT_ENABLED to 1.

  • Passing too short of a buffer to an MPI call.

  • MPICH_SMP_SINGLE_COPY_MODE=CMA is used, which is the default on RHEL.

  • MPICH_SMP_SINGLE_COPY_MODE=XPMEM is used, which is the default in USS. Because bus errors can occur for other reasons, a debugger or core file may be necessary to confirm that this occurs in an MPI call.

Resolution

If you are not using GPU memory, set MPICH_GPU_SUPPORT_ENABLED to:

MPICH_GPU_SUPPORT_ENABLED=1

Handling an MPICH MPIDI OF handle cg error

An MPICH error occurs. While a seemingly MPICH error message appears, it is not. Rather, it is likely a fatal system error.

Symptom

The following error message appears:

MPIDI OF handle cg error (1059): OFI poll failed 

(ofi_events.c:1061:MPIDI_OFI_handle_cg_error:Input/output error - CANCELED)

Cause

This issue occurs if either:

  • A node failure occurs, or

  • A link failure occurs, or

  • An invalid routing parameter (empty_route) is issued, or

  • The Retry Handler (RH), unable to resend a message, cancels a job. The CXI provider RH process running on each node monitors traffic in and out of the Network Interface Cards (NICs). The RH process resends dropped or discarded messages for various reasons. If the RH cannot resend a message, it eventually cancels the job and issues the error message.

Resolution

Contact HPE support for additional assistance.

Resolving MPICH MPIDI OFI error

Symptom

One of the following error messages appears:

MPICH ERROR
....
MPIDI_OFI_handle_cq_error(1062).....: OFI poll failed (ofi_events.c:1064:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND)

or

MPICH ERROR
....
MPIDI_OFI_handle_cq_error(1062).....: OFI poll failed (ofi_events.c:1064:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND)

Cause

These typically secondary errors occur if one or more ranks die (for example, segfault, out of resources, and so forth). If the dying rank is communicating with other ranks simultaneously, errors occurs.

Resolution

Debugging this issue requires you to locate and investigate the initial error signature and ignore secondary error signatures. Contact HPE support for additional assistance.

MPICH MPIDI OFI with a PKTBUG_ERROR error occurs

An MPICH MPIDI OFI/PKTBUG_ERROR error occurs as a result of a system configuration issue.

Symptom

The following error message appears:

MPICH ERROR
....
MPIDI_OFI_handle_cq_error(1059): OFI poll failed (ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Input/output error - PKTBUF_ERROR)

Cause

PKTBUF_ERROR usually signifies a system configuration issue, such as when a Rosetta switch is not programmed correctly or a mismatch between the Cassini and Rosetta device settings exist after an upgrade. If a node hits a PKTBUF_ERROR, it is generally not safe to run a job on that node again without rebooting it, as this error may leave the NIC in an unstable state.

Resolution

Contact HPE support for additional assistance.

Handling an issue where the Cassini Event Queue Overflows into CXI provider

In HPC environments utilizing the HPE Cray MPI and CXI provider, users might encounter a critical error related to the Cassini Event Queue overflow. This error is tied to the configuration of the CXI event queue, which plays a vital role in handling hardware-level communication events.

Symptom

The following error message appears:

libfabric:88194:cxi:core:cxip_cq_eq_progress():544<warn> Cassini Event Queue overflow detected.

Cause

This fatal error indicates that the job exceeded the capacity of the CXI provider’s event queue during execution. The Cassini Event Queue is directly connected to hardware and has a fixed maximum size specified at job launch. By default, the HPE Cray MPI sets this maximum size to 32,768 events. If the job generates more events than the queue can handle, an overflow occurs, leading to this error. Resizing the event queue dynamically is not feasible due to its hardware-level integration, making proper configuration essential during job initialization.

Resolution

To resolve the Cassini Event Queue overflow error, users can increase the maximum queue size at job launch by setting the FI_CXI_DEFAULT_CQ_SIZE environment variable to a higher value. For example, doubling the queue size to 65,536 can help accommodate larger workloads that exceed the default capacity. Use the following command to set the environment variable:

export FI_CXI_DEFAULT_CQ_SIZE=65536

Ensure this adjustment is made before launching the job to avoid encountering the overflow error.

Resolving an issue where the CXI provider flow control is triggered due to an LE depletion

Users might encounter a fatal error stemming from the depletion of a hardware resource known as List Entries (LE). This issue forces the NIC into Software Endpoint (SE) mode, changing how tag-matching and rendezvous processing are handled.

Symptom

The error is indicated by the following warning:

libfabric:44928:1640991101:cxi:core:cxip_recv_pending_ptlte_disable():1135<warn> RXC (0x8b0:30:0): Flow control triggered due to failure to append LE. Software endpoint mode required.

This error signifies that the job has exhausted the available LEs, triggering flow control and forcing the CXI provider to transition the NIC into SE mode.

Cause

Each CXI endpoint is allocated approximately 16,000 LEs, a hardware resource used to manage communication events. The depletion of LEs can occur due to:

  • A flood of unexpected messages.

  • A large number of pre-posted receives.

When the pool of LEs is depleted, the CXI provider automatically transitions the NIC into SE mode. While rendezvous processing remains in hardware, tag-matching is moved to software, which can impact performance.

Resolution

The CXI provider offers environment variables that allow users to manage how the system transitions into SE mode and optimize resource usage. The process has recently been updated. To address the issue, perform:

Transition Mode Configuration:

Use the updated FI_CXI_RX_MATCH_MODE environment variable to specify how tag-matching should be handled. Options include:

  • Hardware: Tag-matching is done entirely in hardware.

  • Software: Tag-matching is done entirely in software.

  • Hybrid: A combination of hardware and software is used.

To set the mode, for example:

export FI_CXI_RX_MATCH_MODE=[hardware | software | hybrid]

Optimize Buffer Resources:

Configure supporting environment variables to ensure efficient allocation of hardware resources:

  • FI_CXI_REQ_BUF_SIZE: Defines the size of the request buffer.

  • FI_CXI_REQ_BUF_MIN_POSTED: Specifies the minimum number of pre-posted receives.

  • FI_CXI_REQ_BUF_MAX_COUNT: Limits the total number of buffers that can be allocated.

Note: The older FI_CXI_MSG_OFFLOAD=0 environment variable used to switch to SE mode has been deprecated and should no longer be used.

Resolving MPICH error with MPICH_SINGLE_HOST_ENABLED=0 on HPE Slingshot-11 networks

When running MPI applications configured with MPICH_SINGLE_HOST_ENABLED=0 on systems using the HPE Slingshot-11 network, users may encounter an error during MPI_Init. This issue arises due to the network security token requirement. These tokens are managed by the workload manager (WLM). Single-node jobs often lack these tokens by default, as they typically do not require access to the NIC.

Symptom

When MPICH_SINGLE_HOST_ENABLED=0 is set, MPI_Init fails with the following error message:

OFI fi_open domain failed (ofi_init.c:616:MPIDI_OFI_mpi_init_hook:Function not implemented)  

Cause

The HPE Slingshot-11 network enforces secure access to network resources through security tokens, which are distributed by WLM. These tokens are generally not allocated by default for single-node jobs, as such jobs do not require NIC access. Note that:

  • If the error occurs only on single-node jobs, the likely cause is the lack of Virtual Network Interface (VNI) allocation by the workload manager.

  • If the error occurs on jobs spanning two or more nodes, it may indicate a system configuration problem that requires administrative intervention.

Applications typically set MPICH_SINGLE_HOST_ENABLED=0 for specific reasons, such as enabling communication between MPI processes (for example, using MPI_Comm_accept) or for debugging purposes. Understanding application intent is essential for determining the correct resolution.

Resolution

To resolve the issue, take the following steps based on the workload manager and job configuration:

  1. Confirm the error scope:

    • Ensure the error does not appear on multi-node jobs. If it only occurs on single-node jobs, proceed with the steps below.

    • If the error persists in multi-node jobs, contact the system administrator to investigate possible configuration problems.

  2. Request VNI allocation:

    For jobs requiring communication between MPI processes or across job steps, request VNI allocation from the workload manager. This ensures secure access to the NIC and resolves the issue during MPI_Init.

    • For Slurm:

      a. For single-node jobs, add the –network single_node_vni option to the salloc, srun, or sbatch command.

      b. For communicating between job steps, add the –network single_node_vni,job_vni option.

      c. Ensure the system administrator has configured the Slurm Slingshot plugin correctly to support these options. For example:

      salloc --network single_node_vni  
      srun --network single_node_vni,job_vni  
      sbatch --network single_node_vni  
      
    • For PBS/PALS:

      Use the –single-node-vni option with aprun or mpiexec commands. For example:

      aprun --single-node-vni  
      mpiexec --single-node-vni
      
    • For Flux:

      The Flux workload manager does not currently support VNI allocation or enforcement. In this case, MPICH_SINGLE_HOST_ENABLED=0 should work without additional WLM options.

  3. Verify application intent:

    If the application sets MPICH_SINGLE_HOST_ENABLED=0 intentionally (for example, for MPI_Comm_accept), confirm its requirements. Users requesting communication between job steps must ensure consistent VNI allocation across all job steps in the allocation.

Addressing Slingshot network timeouts on HPE Cray MPI systems

Slingshot systems, integral to HPC environments, are designed to facilitate efficient communication for applications running across distributed nodes. However, certain applications may encounter network timeouts, which can impact communication performance.

Symptom

Applications running on Slingshot systems may experience network timeouts during execution. If such events occur, HPE Cray MPI tracks these timeouts and summarizes Cassini hardware counters for each job. If timeouts are detected, the following error message appears during the MPI Finalize phase:

[ MPICH Slingshot Network Summary: N network timeouts ]

These events could lead to lower-than-expected MPI communication performance, depending on application communication patterns.

Cause

Network timeouts are typically caused by “flapping links” within the Slingshot network. Flapping links are intermittent disruptions in network connections, which can lead to dropped packets and delays in communication. Applications that rely heavily on specific communication patterns may be more vulnerable to the performance impacts caused by these network issues.

Resolution

The HPE Slingshot-11 network is equipped to manage timeout events by automatically re-issuing affected network packets. While this mechanism helps mitigate immediate disruptions, applications may still experience reduced communication performance. To provide additional insight into network behavior and performance, users can collect Cassini hardware counters using the MPICH_OFI_CXI_COUNTER_REPORT variable. This feature is documented in the HPE Cray MPI man pages and allows administrators and users to monitor critical hardware metrics related to network activity.

Contact HPE support for additional assistance.

Addressing issues with fork() on HPE Slingshot-11 systems

The fork() system call is commonly used by applications to create child processes. However, on HPE Slingshot-11 systems, applications that rely on fork() may encounter issues under specific circumstances. These challenges arise whenever a child process attempts to access memory regions owned by its parent process after a fork() operation.

Symptom

Applications running on HPE Slingshot-11 systems may experience unexpected behavior or errors if using the fork() system call. These issues occur if the child process attempts to access memory regions that are allocated and owned by the parent process.

Cause

The root cause of these issues lies in how memory regions are handled during the fork() operation. On Slingshot-11 systems, the child process may encounter conflicts or access violations whenever it is interacting with memory regions managed by the parent process.

Resolution

To address this issue, configure specific runtime variables that ensure compatibility with fork() on Slingshot-11 systems. The following variables should be set in the runtime environment:

export CXI_FORK_SAFE=1
export CXI_FORK_SAFE_HP=0
export FI_CXI_DISABLE_CQ_HUGETLB=1

These settings help mitigate memory access problems between the parent and child processes following a fork() operation.

For systems running SLES15 SP4 or newer Linux kernels, some of the fork()-related issues have been addressed directly in the Linux kernel. As a result, users with updated Linux environments no longer need to set the CXI_FORK_SAFE runtime variables for applications that rely on fork(). This improvement simplifies application compatibility and eliminates the need for manual configuration in many cases.

Resolving GPU application hangs on HPE Slingshot-11 systems

HPE Slingshot-11 systems are designed to support high-performance computing applications, including GPU-enabled workloads. However, some GPU-enabled applications may encounter hangs or errors during execution. These issues are often accompanied by specific error messages in the system logs (dmesg) and are typically caused by configuration or runtime environment issues.

Symptom

GPU-enabled applications running on HPE Slingshot-11 systems may experience execution hangs, and the following error message is observed in the system logs (dmesg):

cxi core:cass vma write flag:22 VMA does not have write permissions  

This error indicates a problem with memory permissions or GPU-related configuration that prevents the application from functioning correctly.

Cause

The error is generally attributed to one or more user configuration mistakes, including:

  • GPU-Aware logic not enabled:

    The HPE Cray MPI GPU-aware logic was not enabled because the required runtime variable was missing in the job submission script. The variable that needs to be specified is:

    MPICH_GPU_SUPPORT_ENABLED=1

  • Managed memory support disabled:

    The application uses GPU Managed Memory regions, but HPE Cray MPI Managed Memory support was not properly enabled. By default, HPE Cray MPI supports Managed Memory regions, so this issue might arise if the default settings were altered.

  • Incorrect linking of GPU runtime library:

    The application executable was not correctly linked against the GPU runtime library. On systems with NVIDIA GPUs, this issue often occurs if the following command was excluded from the environment or job submission script:

    module load cudatoolkit

Resolution

To resolve GPU application hangs and related errors:

  1. Ensure GPU-aware logic is enabled by setting the required runtime variable in the job submission script:

    export MPICH_GPU_SUPPORT_ENABLED=1

  2. Verify that HPE Cray MPI Managed Memory support is enabled. Since Managed Memory regions are supported by default, users should check for any modifications that may have disabled this feature.

  3. Verify the GPU runtime library. For systems with NVIDIA GPUs, ensure the environment includes the following command:

    module load cudatoolkit

Resolving gdb4hpc CTI launch errors: Issues with mpiexec and PBS

The gdb4hpc tool is used for debugging HPC applications in distributed environments. However, during application launch, users may encounter errors related to the mpiexec binary and its compatibility with the CTI (Cray Tools Interface) framework.

Symptom

If attempting to launch an application with gdb4hpc, the following error message appears:

gdb4hpc: launch ...  
Starting application, please wait...  
Failed to launch CTI app.  
CTI error: cti_launchAppBarrier: mpiexec was found at /opt/pbs/bin/mpiexec, but it is not a binary file. Tool launch requires direct access to the mpiexec binary. Ensure that the mpiexec binary is not wrapped by a script (tried HPCM / PALS).  

This error prevents the application from launching successfully and indicates that the mpiexec binary is improperly wrapped or incompatible with CTI tool requirements.

Cause

The error is caused by the following factors:

  • Incorrect mpiexec configuration:

    The mpiexec executable found in the PBS environment is not a binary file but a wrapper script. CTI tools require direct access to the mpiexec binary to properly launch applications.

  • PBS compatibility limitation:

    PBS does not support the MPIR protocol required for CTI application launches. Instead, HPE Cray Supercomputing EX environments require PALS (Process Management and Launch Services) as the launcher for tools like gdb4hpc.

Resolution

To resolve the error and enable successful application launches with gdb4hpc, users should take the following steps:

  1. Load the cray-pals module. Replace PBS’s default launcher with PALS by loading the cray-pals module. This can be done with the following command:

    module load cray-pals

  2. Verify mpiexec location. Ensure that the mpiexec binary provided by PALS is being used, rather than the wrapper script provided by PBS. This step guarantees compatibility with the CTI framework.

  3. Re-launch the application. After loading the cray-pals module, reattempt to launch the application using gdb4hpc. The tool should now have direct access to the compatible mpiexec binary, resolving the error.

If the issue persists, contact HPE support for additional assistance.

Resolving gdb4hpc launch timeout issues

The gdb4hpc debugger is a powerful tool for debugging HPC applications. However, users could encounter launch timeout issues where the debugger fails to connect to all ranks of a distributed job. This issue can stem from several configuration or system-level problems.

Symptom

If attempting to launch a job with gdb4hpc, the debugger times out while attempting to connect to application ranks, and the following messages appears:

Creating network... (timeout in 300 seconds)  
.............................  
0/100 ranks connected (timeout in 270 seconds)

The debugger fails to connect to the specified ranks, preventing the application from being debugged successfully.

Cause

Several steps must work correctly for gdb4hpc to launch a job inside the debugger. Timeout issues can arise because:

  • MPIR hooks was not enabled:

    The application must be launched with special MPIR hooks to stop the program on entry. Failure to enable these hooks can prevent the debugger from connecting.

  • Debugger processes were not started:

    A debugger process (dbgsrv) must be started for each application rank on each node. If these processes fail to initialize, the debugger cannot establish communication.

  • Communication network issues exist:

    A communication network must be built between the backend debugging processes (dbgsrv) and the gdb4hpc front-end. Network configuration issues can block this connection.

Resolution

To resolve gdb4hpc launch timeout issues, users can follow these steps:

  1. Verify environment configuration:

    a. For PBS Systems, ensure the cray-pals module is loaded. The PALS launcher is required for proper integration with gdb4hpc.

    module load cray-pals

    b. Attempt to launch the application outside of gdb4hpc, such as with the srun command, to confirm the issue is specific to the debugger.

  2. Enable debug logging:

    a. Launch the application with the –debug option to troubleshoot the issue further.

    b. For newer versions of gdb4hpc, set a logging directory directly in the launch command.

    c. For older versions, use environment variables to enable additional logging:

    export CTI_DEBUG=1  
    export CTI_LOG_DIR=<path_to_cross_mounted_directory>
    
  3. Network and launch diagnostics, and check whether network configuration is preventing the debugger from connecting. If network issues are suspected, consult system administration or HPE support to ensure proper connectivity between nodes.

  4. Conduct a workaround for connecting to running jobs. If launch issues persist, bypass the launch process, and attach to a running job using the attach command in gdb4hpc.

    a. Launch the application as usual and retrieve the job ID.

    b. Use the attach command to connect to the running job. Refer to the help attach documentation within gdb4hpc for detailed instructions.

  5. Address timing issues. If the problem being debugged occurs faster than you can attach to the job, add a sleep command at the beginning of your application to delay execution and allow time for attachment.

Resolving running process issue in gdb4hpc with multi-threaded code

If debugging multi-threaded applications using gdb4hpc, developers might encounter a situation where the debugger reports process is running even after encountering a stop. This behavior can be confusing and slows down the debugging process. Understanding the cause and applying a resolution can help streamline debugging in such scenarios.

Symptom

When encountering a breakpoint or stop while debugging with gdb4hpc, the debugger outputs, process is running, even though execution should have paused. This issue is specific to multi-threaded code.

Cause

The root of the problem lies in the default gdb4hpc configuration. This configuration uses the gdb4hpc non-stop mode. In non-stop mode, the debugger does not automatically switch focus to the thread that has stopped. As a result, the stopped thread is not selected, and debugging commands behave as though the process is still running.

Resolution

To resolve the issue:

  1. Use the command information threads to list all threads and identify the one that has stopped.

  2. Select the stopped thread manually using the command t <thread-no>, where <thread-no> corresponds to the thread number shown in the output of information threads. Manually selecting the appropriate thread ensures that the debugger focuses on the stopped thread, allowing you to proceed with debugging effectively.

Resolving a breakpoint error in gdb4hpc debugging

While using gdb4hpc to debug applications, an error stating cannot get to initial breakpoint occurs. This issue prevents debugging from starting as expected and might be difficult to address. Understanding the underlying cause and applying the appropriate resolution can help ensure smooth debugging.

Symptom

The debugger fails to reach the initial breakpoint during program launch, resulting in the cannot get to initial breakpoint error. This prevents the user from effectively interacting with the program during debugging.

Cause

This issue typically arises due to either:

  • The debugging information is incomplete or incorrect, making it difficult for the debugger to locate valid breakpoints.

  • The program being debugged is not an MPI program, which may lead to incompatibilities with gdb4hpc debugging mechanisms.

Resolution

To resolve the issue:

  1. Before launching the debugger, execute:

    maint set earlyentry on

    The maint set earlyentry on command instructs gdb4hpc to enable an early entry mode, allowing the debugger to function even when the initial breakpoint cannot be reached.

  2. Launch the program as usual.

  3. After the launch completes, manually set a breakpoint within your program at a desired location.

  4. Continue program execution, and proceed with debugging from the manually set breakpoint.

Enabling early entry mode and manually setting a breakpoint post-launch bypasses the issue.

Resolving unrecognized job ID error during a gdb4hpc attach with Slurm

While using gdb4hpc to debug applications in environments managed by Slurm, an unrecognized job id error might occur while attempting to attach to a running job. This issue can prevent the debugger from properly connecting to the desired process.

Symptom

During the process of attaching gdb4hpc to a job running on Slurm, the debugger outputs an unrecognized job id error. This issue prevents the successful attachment to the target job for debugging.

Cause

The error is typically related to how Slurm formats job IDs. In Slurm, job IDs include both a job identifier and a step identifier, formatted as <jobid>.<stepid>. If the step ID (often .0 for the first step) is not included in the attach command, gdb4hpc cannot recognize the job ID, resulting in the error.

Resolution

To resolve this issue:

  1. Find the correct job ID with its step ID by using the Slurm command:

    squeue -s

    This command lists the jobs and their associated step IDs.

  2. Update the attach command in gdb4hpc to include both the job ID and step ID. The format should be:

    attach $a{n} <jobid>.<stepid>

    Note: Replace <jobid> and <stepid> with the actual values from the squeue -s output. For additional information about the attach command, type help attach within gdb4hpc.

Including the step ID when specifying the job ID resolves the unrecognized job id error and successfully attaches gdb4hpc to the target Slurm job for debugging.

Resolving an unavailable GPU resources error for CUDA or ROCm debuggers in Slurm

If debugging GPU-enabled applications using tools like gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc in a Slurm-managed environment, an issue occurs where GPU resources are not recognized by the debugger. This problem can prevent effective debugging of CUDA or ROCm applications. Understanding the cause and applying the correct resolution ensures GPU resources are properly allocated to debugger tools.

Symptom

After launching a Slurm job with GPU resources specified using the –gres=gpu:X argument, the debugger tools fail to recognize or access the allocated GPU resources. This issue persists even if the –gpu option is used with the debugger.

Cause

Slurm manages GPU resources and determines their visibility to debugger tools. If the debugger tools are not explicitly informed about the GPU resource allocation (through the same GRES settings passed to the Slurm job), they cannot access the GPUs. This mismatch leads to debugger tools being unable to detect the GPUs.

Resolution

To resolve this issue:

  1. Ensure that your Slurm job is started with the appropriate GPU resource specification using the –gres=gpu:X argument, where X indicates the number of GPUs required.

  2. When launching debugger tools like gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc, set the environment variable CTI_SLURM_DAEMON_GRES to match the GPU resource allocation. For example:

    export CTI_SLURM_DAEMON_GRES=gpu:X

    Note: Replace X with the same GPU count specified during the Slurm job launch.

  3. Launch the debugger tool with the –gpu argument to ensure proper initialization of GPU debugging support.

By synchronizing the GPU resource settings between Slurm and debugger tools through the CTI_SLURM_DAEMON_GRES environment variable, GPU resources are correctly recognized and utilized for debugging CUDA or ROCm applications.

Resolving an undetected WLM error in debugger tools

If using debugging tools like gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc, an error occurs indicating that the WLM was not detected. This issue arises when the debugger tools cannot identify the WLM in use, which is essential for proper interaction with system utilities. Understanding the cause and providing the required configuration can resolve the error and enable the tools to function correctly.

Symptom

The debugger tools output an error similar to:

Launcher name was not found in PATH (tried system / WLM)

This error indicates the failure to detect an active WLM on the system, preventing the debugger from proceeding.

Cause

Debugger tools rely on a common library to automatically detect the WLM (for example, Slurm, ) running on the system. This detection process uses system paths and environment settings to identify the WLM. If the system is not configured with a recognized WLM or the detection process fails, the debugger cannot determine which utilities to use, resulting in the error.

Resolution

To resolve this issue, manually specify the WLM by setting the CTI_WLM_IMPL environment variable:

  1. Identify the WLM in use on the system (for example, Slurm, PALS, Flux, or ALPS).

  2. Set the CTI_WLM_IMPL environment variable to the corresponding WLM type. For example:

    export CTI_WLM_IMPL=slurm

    Note: Replace slurm with the appropriate value for your WLM (Slurm, PALS, Flux, or ALPS).

  3. Retry launching the debugger tool. It should correctly identify the workload manager and proceed without errors.

Manually specifying the WLM type using the CTI_WLM_IMPL environment variable bypasses the detection issue and ensure that debugger tools like gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc operate as intended.

Resolving an unfound WLM PATH error in debugger tools

If using debugging tools, such as gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc, it is essential for the tools to interact with the system WLM. The error Workload manager not found in PATH indicates that the debugger tools cannot locate the appropriate WLM launcher in the system environment. Proper configuration of the system environment variables ensures seamless operation of these tools.

Symptom

The debugger tools display an error message similar to:

Launcher name was not found in PATH (tried system / WLM)

This error indicates that the debugger was prevented from starting or controlling jobs on the system.

Cause

Debugger tools rely on WLM utilities (for example, srun, aprun, mpiexec) to start and manage jobs. This error can occur for two primary reasons:

  • The WLM detected by the debugging tools is incorrect.

  • The WLM launcher is not available in the PATH environment variable, making it inaccessible to the debugger tools.

Resolution

To resolve this issue, take the following steps:

  1. Verify the correct WLM is used. Determine the WLM used by your system (for example, Slurm, PALS, Flux, or ALPS). If the detected WLM is incorrect, manually specify the correct WLM type by setting the CTI_WLM_IMPL environment variable. For example:

    export CTI_WLM_IMPL=slurm

    Note: Replace slurm with the appropriate value for your WLM (Slurm, PALS, Flux, or ALPS).

  2. Check the PATH environment variable. Ensure the WLM launcher (for example, srun for Slurm, aprun for ALPS, or mpiexec for Flux) is included in the system PATH environment variable. If it is missing, update the PATH to include the directory containing the launcher. For example:

    export PATH=/path/to/launcher:$PATH

    Note: Replace /path/to/launcher with the actual directory path where the launcher resides.

  3. After updating WLM settings or PATH, retry launching the debugger tools. They should now correctly detect and use WLM utilities.

Ensuring the correct workload manager is specified and its launcher is accessible in the PATH resolves the Workload manager not found in PATH error, enabling debugging tools like gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc to function properly.

Resolving a launcher/binary file error in debugger tools

If using tools like gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc, an error occurs indicating that the launcher found is not a binary file. This issue prevents debugger tools from properly interacting with WLM utilities required to start and control jobs. Understanding the cause of this error and applying the correct solution ensures the debugger tools function seamlessly in your environment.

Symptom

The debugger tools display an error message similar to:

Launcher name was found at path, but it is not a binary file.

This error indicates that the debugger tools cannot directly access the launcher binary required for job management.

Cause

Debugger tools rely on direct access to the WLM launcher binary (for example, srun for Slurm, aprun for ALPS, or mpiexec for Flux). If the file found at the path of the launcher is a wrapper script instead of the binary, the debugger tools might fail to operate correctly. While tools natively support certain wrapper systems (XALT, Slurm), support for other custom wrapper scripts is limited, leading to this error.

Resolution

To resolve this issue, follow these steps:

  1. Verify that the correct WLM is detected. If the detected WLM is incorrect, manually specify the correct WLM type by setting the CTI_WLM_IMPL environment variable. For example:

    export CTI_WLM_IMPL=slurm

    Note: Replace slurm with the appropriate WLM type (pals, flux, or alps).

  2. Check the launcher binary path. Use the which command to locate the launcher binary. For example, on a Slurm system:

    which srun

  3. Verify that the file at the returned path is the actual launcher binary and not a wrapper script.

  4. Handle wrapper scripts. If the launcher is a wrapper script, check whether it has a loadable module that can be unloaded. For example:

    module unload <wrapper_module>

  5. If no module exists or unloading is not possible, update the PATH environment variable to prioritize the directory containing the direct launcher binary. For example:

    export PATH=/path/to/launcher_binary:$PATH

    Note: Replace /path/to/launcher_binary with the actual directory containing the binary.

  6. After ensuring the debugger has direct access to the launcher binary, retry launching the debugger tools.

By ensuring that the launcher binary is accessible and correctly prioritized in the system’s environment, you can resolve the Launcher is Not a Binary File error and enable tools like gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc to function properly with WLMs.

Debugging tools issue: Launcher lacks debug symbols

If using debugging tools like gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc, proper functionality requires launcher debug symbols to coordinate tool launches. These tools rely on system WLM utilities, such as Slurm, PALS, Flux, or ALPS, to start and manage jobs on the system. However, issues can arise if the launcher does not contain the necessary debug symbols.

Symptom

The following error message appears:

launcher name was found at path, but it does not contain debug symbols

This indicates that debugging tools cannot proceed, as they depend on the presence of debug symbols in the launcher binary for proper operation.

Cause

The problem may occur because:

  • The detected WLM is incorrect,

  • The file at the launcher path is a script rather than the direct launcher binary,

  • The launcher binary has been stripped of its debug symbols, or

  • Some installations of Slurm, for instance, might strip debugger symbols, rendering them incompatible with debugging tools.

Resolution

To resolve this issue, follow these steps:

  1. Ensure that the correct WLM is detected. If the detected WLM is incorrect, manually set the appropriate WLM by using:

    export CTI_WLM_IMPL=<wlm>

    Note: Replace <wlm> with one of the supported options: slurm, pals, flux, or alps.

  2. Check the launcher file, and confirm that the file at the specified path is the actual launcher binary and not a script.

  3. Ensure that the launcher binary has not been stripped of debug symbols. If it has been stripped, reinstall or obtain an unstripped version of the launcher.

  4. Use an alternative debug tool launcher. If your system supports passwordless access to compute nodes, bypass the default WLM by setting:

    export CTI_WLM_IMPL=ssh

    This configuration enables the use of a generic SSH-based debug tool launcher.

This resolution addresses the issue and enables debugging tools to function correctly with the system WLM.

Supported systems

This publication supports installing CPE 26.03 on HPE Cray Supercomputing EX systems with supported applicable HPE Cray Supercomputing EX systems. Depending on the HPE Cray Supercomputing EX system, supported architectures and operating systems (OS) versions vary. This chapter provides information on supported systems for this release.

IMPORTANT: In releases before the COS 25.9 release, COS comprised three components: COS Base, HPE USS, and HPE SLES. Starting with the CPE 25.09 release, COS 25.9 (and later) comprises:

  • HPE Cray Supercomputing User Services Software (USS)

  • HPE SUSE Linux Enterprise Server

This release also supports v21.0.0 of the HPE Cray Compiler Environment (CCE). See the CPE 26.03 Release Announcements on the CPE Online Documentation website for other supported dependencies.

Supported systems for CPE on CSM

This publication supports the installation of CPE 25.09 on HPE Cray Supercomputing EX systems with specific configurations:

Management Software & Version

COS Version

Operating System

Architecture

GCC Version

CSM 1.7.X

COS 25.9 (USS 1.4.X)

SLES 15 SP6

X86

14.0

CSM 1.7.X

COS 25.9 (USS 1.4.X)

SLES 15 SP6

AArch64

14.0

This release also supports v20.0.0 of the HPE Cray Compiler Environment (CCE).

IMPORTANT: In releases before the COS 25.9 release, COS comprised three components: COS Base, HPE USS, and HPE SLES. With the COS 25.9 and CPE 25.09 releases, it should be noted that COS Base has been replaced with SLES 15 SP6. Starting with this CPE 25.09 release, COS 25.9 (and later) comprises:

  • HPE Cray Supercomputing User Services Software (USS)

  • HPE SUSE Linux Enterprise Server

See the CPE 25.09 Release Announcements on the CPE Online Documentation website for other supported dependencies.

Supported systems for CPE with HPCM

This publication supports installing CPE 25.09 on HPE Cray Supercomputing EX systems with specific configurations:

Management Software & Version

COS Version

Operating System

Architecture

GCC Version

HPCM 1.14

COS 25.9 (USS 1.4.X)

SLES 15 SP7

X86

Not Applicable

HPCM 1.14

COS 25.9 (USS 1.4.X)

SLES 15 SP6

X86

Not Applicable

HPCM 1.14

COS 25.9 (USS 1.4.X)

SLES 15 SP7

AArch64

Not Applicable

HPCM 1.14

COS 25.9 (USS 1.4.X)

SLES 15 SP6

AArch64

Not Applicable

HPCM 1.14

Not Applicable

RHEL 9.6

X86

14.0

HPCM 1.14

Not Applicable

RHEL 9.5

X86

14.0

HPCM 1.14

Not Applicable

RHEL 8.10

X86

14.0

HPCM 1.14

Not Applicable

RHEL 9.6

AArch64

14.0

HPCM 1.14

Not Applicable

RHEL 9.5

AArch64

14.0

This release also supports v20.0.0 of the HPE Cray Compiler Environment (CCE).

IMPORTANT: In releases before the COS 25.9 release, COS comprised three components: COS Base, HPE USS, and HPE SLES. With the COS 25.9 and CPE 25.09 releases, it should be noted that COS Base has been removed. Starting with this CPE 25.09 release, COS 25.9 (and later) comprises:

  • HPE Cray Supercomputing User Services Software (USS)

  • HPE SUSE Linux Enterprise Server

See the CPE 25.09 Release Announcements on the CPE Online Documentation website for other support dependencies.

Supported systems for CPE on the HPE Cray XD2000

For this release, CPE is supported on HPE Cray XD2000 systems with designated operating systems and architectures:

Management Software & Version

Operating System

Architecture

HPCM 1.14

RHEL 8.10

X86

This release also supports v20.0.0 of the HPE Cray Compiler Environment (CCE).

IMPORTANT: CPE versions 25.03 (and earlier) previously supported MOFED versions 5.8 (or earlier) as directed in installation instructions. However, with the CPE 25.09 release, HPE recommends that MOFED/DOCAFED-dependent users with HPE Slingshot 10 (SS10) refrain upgrading CPE beyond the 25.03 CPE release. HPE observed a system bug, the Extended Reliable Connection (XRC) bug in MOFED. This system bug adversely affects CPE and SS10 functionality. The bug was introduced by NVIDIA in early 2023, and HPE reported details of the bug to NIDIA in April 2023. The bug is currently unresolved and is not expected to be fixed during the transition from MOFED to DOCA OFED. Until a resolution or workaround is introduced, CPE users should not upgrade past the CPE 25.03 release.

See the CPE 25.09 Release Announcements on the CPE Online Documentation website for other support dependencies.

Support matrices for previous releases

This chapter lists CPE-supported components, third-party software, and modules supported for applicable and previous releases of the CPE software. This information is provided for reference purposes.

CPE release matrices for SLES

CPE supports various SLES-based software components, including SLES for Aarch64 and x86 architectures. These components include compilers, libraries, debugging/profiling tools, programming models and so forth. Supported version of these components are updated with each release of CPE. This section lists which SLES-based component versions are supported for each CPE release.

SLES AArch64 support matrix

SLES with AArch64 systems is supported with CPE on HPE Cray Supercomputing EX systems with either CSM or HPCM. Below are product components, modules, third-party software versions supported with previous CPE releases with these configurations.

(D) represents the default version installed at installation.

* HPCM only

Release

CPE 25.09

CPE 25.09

CPE 25.03

CPE 25.03

24.11

24.11

24.07

Product

sles15sp7-aarch64 *

sles15sp6-aarch64

sles15sp6-aarch64

sles15sp5-aarch64

sles15sp5-aarch64

sles15sp6-aarch64

sles15sp5-aarch64

COS

25.9

25.9

25.3

24.7

25.1

24.7

24.7

COS Base

N/A

N/A

3.3.0

3.1.0

3.2.0

3.1.0

3.1.0

CSM

Not supported

1.7

1.6.1

1.6.1

1.6

1.6

1.5

HPCM

1.14

1.14

1.13

1.13

1.12

1.12

1.11

USS

1.4.0

1.4.0

1.3.0

1.1.0

1.2.0

1.1.0

1.1.0

amd

aocc

5.0

5.0

4.2

4.2

4.2**

atp

3.15.7 (D)

3.15.7 (D)

3.15.6 (D)

3.15.6 (D)

3.15.5 (D)

3.15.5 (D)

3.15.4 (D)

cce

20.0.0

20.0.0

19.0.0 (D)

19.0.0 (D)

18.0.1 (D)

18.0.1 (D)

18.0.0 (D)

cpe-gcc-mpfr

3.1.4 (D)

3.1.4 (D)

3.1.4 (D)

3.1.4 (D)

3.1.4 (D)

3.1.4 (D)

cpe-gcc-native

14.2 (D)

14.2 (D)

cpe-gcc-native

14 (D)

14 (D)

13.2

13.2

13.2 (D)

13.2 (D)

13.2 (D)

cpe-gcc-native

13

12.3

12.3

12.3

12.3

12.3

cpe-gcc-native

12

cpe-prgenv-amd

cpe-prgenv-aocc

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

cpe-prgenv-cray

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

cpe-prgenv-cray-amd

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

cpe-prgenv-gnu

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

cpe-prgenv-gnu-amd

cpe-prgenv-intel

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

cpe-prgenv-nvhpc

8.5.0 (D)

cpe-prgenv-nvidia

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

cray-R

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

cray-ccdb

5.0.7 (D)

5.0.7 (D)

5.0.6 (D)

5.0.6 (D)

5.0.5 (D)

5.0.5 (D)

5.0.4 (D)

cray-cdst-support

2.14.6 (D)

2.14.6 (D)

2.14.5 (D)

2.14.5 (D)

2.14.3 (D)

cray-cti

2.20.0 (D)

2.20.0 (D)

2.19.1 (D)

2.19.1 (D)

2.19.0 (D)

2.19.0 (D)

2.18.4 (D)

cray-dsmml

0.3.1 (D)

0.3.1 (D)

0.3.1 (D)

0.3.0 (D)

0.3.0 (D)

0.3.0 (D)

0.3.0 (D)

cray-dwarf

2.0.0 (D)

2.0.0 (D)

0.11.1 (D)

0.11.1 (D)

0.11.0 (D)

0.11.0 (D)

0.9.2 (D)

cray-dyninst

12.3.6 (D)

12.3.6 (D)

12.3.5 (D)

12.3.5 (D)

12.3.4 (D)

12.3.4 (D)

12.3.2 (D)

cray-fftw

3.3.10.11 (D)

3.3.10.11 (D)

3.3.10.10 (D)

3.3.10.10 (D)

3.3.10.9 (D)

3.3.10.9 (D)

3.3.10.8 (D)

cray-hdf5

1.14.3.7 (D)

1.14.3.7 (D)

1.14.3.5 (D)

1.14.3.5 (D)

1.14.3.3 (D)

1.14.3.3 (D)

1.14.3.1 (D)

cray-libsci

25.09.0 (D)

25.09.0 (D)

25.03.0 (D)

25.03.0 (D)

24.11.0 (D)

24.11.0 (D)

24.07.0 (D)

cray-libsci-acc

25.09.0 (D)

25.03.0 (D)

25.03.0 (D)

24.11.0 (D)

24.11.0 (D)

24.07.0 (D)

cray-lmod

8.7.60 (D)

8.7.60 (D)

8.7.55 (D)

8.7.55 (D)

8.7.37 (D)

8.7.37 (D)

8.7.37 (D)

cray-modules

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

cray-mpich

8.1.33

8.1.33 (D)

8.1.32 (D)

8.1.32 (D)

8.1.31 (D)

8.1.31 (D)

8.1.30 (D)

cray-mpich

9.0.1 (D)

9.0.1 (D)

9.0.0

9.0.0

cray-mpixlate

1.0.7 (D)

1.0.7 (D)

1.0.6 (D)

1.0.6 (D)

1.0.5 (D)

cray-mrnet

5.1.6 (D)

5.1.6 (D)

5.1.5 (D)

5.1.5 (D)

5.1.4 (D)

5.1.4 (D)

5.1.3 (D)

cray-netcdf

4.9.2.1 (D)

4.9.2.1 (D)

4.9.0.17 (D)

4.9.0.17 (D)

4.9.0.15 (D)

4.9.0.15 (D)

4.9.0.13 (D)

cray-open-shmemx

11.7.5 (D)

11.7.5 (D)

11.7.4 (D)

11.7.3 (D)

11.7.3 (D)

11.7.3 (D)

11.7.2 (D)

cray-papi

7.2.0.2 (D)

7.2.0.2 (D)

7.2.0.1 (D)

7.2.0.1 (D)

7.1.0.4 (D)

7.1.0.4 (D)

7.1.0.2 (D)

cray-parallel-netcdf

1.12.3.19 (D)

1.12.3.19 (D)

1.12.3.17 (D)

1.12.3.17 (D)

1.12.3.15 (D)

1.12.3.15 (D)

1.12.3.13 (D)

cray-pe-set-default

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

cray-pmi

6.1.16 (D)

6.1.16 (D)

6.1.15 (D)

6.1.15 (D)

6.1.15 (D)

6.1.15 (D)

6.1.15 (D)

cray-pmi-devel

6.1.16

6.1.16

6.1.15

6.1.15

6.1.15

6.1.15

6.1.15

cray-pmi-doc

6.1.16

6.1.16

6.1.15

6.1.15

6.1.15

6.1.15

6.1.15

cray-python

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

cray-stat

3.11.7 (D)

4.12.6 (D)

4.12.5 (D)

4.12.5 (D)

4.12.4 (D)

4.12.4 (D)

4.12.3 (D)

cray-ucx

cray-zmqnet

1.3.2 (D)

1.3.2 (D)

1.3.0 (D)

1.3.0 (D)

1.0.0 (D)

1.0.0 (D)

craype

2.7.35 (D)

2.7.35 (D)

2.7.34 (D)

2.7.34 (D)

2.7.33 (D)

2.7.33 (D)

2.7.32 (D)

craype-dl-plugin-ftr

craype-dl-plugin-py3

24.07.1 (D)

24.07.1 (D)

24.07.1 (D)

24.07.1 (D)

24.07.1 (D)

craype-targets-ex

1.16.0 (D)

1.16.0 (D)

1.15.1 (D)

1.15.1 (D)

1.15.0 (D)

1.15.0 (D)

1.13.2 (D)

craypkg-gen

1.3.36 (D)

1.3.36 (D)

1.3.35 (D)

1.3.35 (D)

1.3.34 (D)

1.3.34 (D)

1.3.33 (D)

forgesup

24.1.1

24.1.1

23.1.2

23.1.2

23.1.2

gdb4hpc

4.16.5 (D)

4.16.5 (D)

4.16.4 (D)

4.16.4 (D)

4.16.3 (D)

4.16.3 (D)

4.16.2 (D)

intel

lmod_scripts

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

nvhpc

24.3 (D)

nvidia

25.5 (D)

25.5 (D)

24.3 (D)

24.3 (D)

24.3 (D)

24.3 (D)

24.3 (D)

perftools

25.09.0 (D)

25.09.0 (D)

25.03.0 (D)

25.03.0 (D)

24.11.0 (D)

24.11.0 (D)

24.07.0 (D)

rocm

6.3.0

6.3.0

6.2.1

6.2.1

6.1.0 (D)

saniti-zers4hpc

1.1.6 (D)

1.1.6 (D)

1.1.5 (D)

1.1.5 (D)

1.1.4 (D)

1.1.4 (D)

1.1.3 (D)

total-viewsup

2024.4.0

2024.4.0

2024.1.21

2024.1.21

2024.1.21

val-grind4hpc

2.13.6 (D)

2.13.6 (D)

2.13.5 (D)

2.13.5 (D)

2.13.4 (D)

2.13.4 (D)

2.13.3 (D)

SLES X86 support matrix

SLES on X86 systems is supported with CPE on HPE Cray Supercomputing EX systems with either CSM or HPCM. Below are product components, modules, third-party software versions supported with previous CPE releases with these configurations.

(D) represents the default version installed at installation.

* HPCM only

Release

CPE 25.09

CPE 25.09

CPE 25.03

CPE 25.03

24.11

24.11

24.07

Product/Version

sles15sp7 *

sles15sp6

sles15sp6

sles15sp5

sles15sp6

sles15sp5

sles15sp5

COS

25.9

25.9

25.1

24.7

24.7

COS Base

N/A

N/A

3.3.0

3.1.0

3.2.0

3.1.0

3.1.0

CSM

Not supported

1.7

1.6.1

1.6.1

1.6

1.6

1.5

HPCM

1.14

1.14

1.13

1.13

1.12

1.12

1.11

USS

1.4.0

1.4.0

1.3.0

1.1.0

1.2.0

1.1.0

1.1.0

amd

6.4.1 (D)

6.4.1 (D)

6.2.1 (D)

6.2.1 (D)

6.2.1 (D)

6.2.1 (D)

aocc

5.0.0 (D)

5.0.0 (D)

5.0.0 (D)

5.0.0 (D)

4.2.0 (D)

4.2.0 (D)

4.2.0 (D)

atp

3.15.7 (D)

3.15.7 (D)

3.15.6 (D)

3.15.6 (D)

3.15.5 (D)

3.15.5 (D)

3.15.4 (D)

cce

20.0.0

20.0.0

19.0.0 (D)

19.0.0 (D)

18.0.1 (D)

18.0.1 (D)

18.0.0 (D)

cpe-gcc-mpfr

3.1.4 (D)

3.1.4 (D)

3.1.4 (D)

3.1.4 (D)

3.1.4 (D)

cpe-gcc-native

14.2 (D)

14.2 (D)

cpe-gcc-native

14 (D)

14

13.2

13.2

13.2 (D)

13.2 (D)

13.2 (D)

cpe-gcc-native

13

12.3

12.3

12.3

12.3

12.3

cpe-gcc-native

12

12.3

cpe-prgenv-amd

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

cpe-prgenv-aocc

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

cpe-prgenv-cray

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

cpe-prgenv-cray-amd

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

cpe-prgenv-gnu

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

cpe-prgenv-gnu-amd

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

cpe-prgenv-intel

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

cpe-prgenv-nvidia

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

4.4.0 (D)

4.4.0 (D)

8.5.0 (D)

cray-R

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

cray-ccdb

5.0.7 (D)

5.0.7 (D)

5.0.6 (D)

5.0.6 (D)

5.0.5 (D)

5.0.5 (D)

5.0.4 (D)

cray-cdst-support

2.14.6 (D)

2.14.6 (D)

2.14.5 (D)

2.14.5 (D)

2.14.3 (D)

cray-cti

2.20.0 (D)

2.20.0 (D)

2.19.1 (D)

2.19.1 (D)

2.19.0 (D)

2.19.0 (D)

2.18.4 (D)

cray-dsmml

0.3.1 (D)

0.3.1 (D)

0.3.1 (D)

0.3.1 (D)

0.3.0 (D)

0.3.0 (D)

0.3.0 (D)

cray-dwarf

2.0.0 (D)

2.0.0 (D)

0.11.1 (D)

0.11.1 (D)

0.11.0 (D)

0.11.0 (D)

0.9.2 (D)

cray-dyninst

12.3.6 (D)

12.3.6 (D)

12.3.5 (D)

12.3.5 (D)

12.3.4 (D)

12.3.4 (D)

12.3.2 (D)

cray-fftw

3.3.10.11 (D)

3.3.10.11 (D)

3.3.10.10 (D)

3.3.10.10 (D)

3.3.10.9 (D)

3.3.10.9 (D)

3.3.10.8 (D)

cray-hdf5

1.14.3.7 (D)

1.14.3.7 (D)

1.14.3.5 (D)

1.14.3.5 (D)

1.14.3.3 (D)

1.14.3.3 (D)

1.14.3.1 (D)

cray-libsci

25.09.0 (D)

25.09.0 (D)

25.03.0 (D)

25.03.0 (D)

24.11.0 (D)

24.11.0 (D)

24.07.0 (D)

cray-libsci-acc

25.09.0 (D)

25.09.0 (D)

25.03.0 (D)

25.03.0 (D)

24.11.0 (D)

24.11.0 (D)

24.07.0 (D)

cray-lmod

8.7.60 (D)

8.7.60 (D)

8.7.55 (D)

8.7.55 (D)

8.7.37 (D)

8.7.37 (D)

8.7.37 (D)

cray-modules

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

cray-mpich

8.1.33

8.1.33

8.1.32 (D)

8.1.32 (D)

8.1.31 (D)

8.1.31 (D)

8.1.30 (D)

cray-mpich

9.0.1

9.0.1

9.0.0

9.0.0

cray-mpixlate

1.0.7 (D)

1.0.7 (D)

1.0.6 (D)

1.0.6 (D)

1.0.5 (D)

cray-mrnet

5.1.6 (D)

5.1.6 (D)

5.1.5 (D)

5.1.5 (D)

5.1.4 (D)

5.1.4 (D)

5.1.3 (D)

cray-netcdf

4.9.2.1 (D)

4.9.2.1 (D)

4.9.0.17 (D)

4.9.0.17 (D)

4.9.0.15 (D)

4.9.0.15 (D)

4.9.0.13 (D)

cray-open-shmemx

11.7.5 (D)

11.7.5 (D)

11.7.4 (D)

11.7.4 (D)

11.7.3 (D)

11.7.3 (D)

11.7.2 (D)

cray-pals

1.3.2

cray-papi

7.2.0.2 (D)

7.2.0.2 (D)

7.2.0.1 (D)

7.2.0.1 (D)

7.1.0.4 (D)

7.1.0.4 (D)

7.1.0.2 (D)

cray-parallel-netcdf

1.12.3.19 (D)

1.12.3.19 (D)

1.12.3.17 (D)

1.12.3.17 (D)

1.12.3.15 (D)

1.12.3.15 (D)

1.12.3.13 (D)

cray-pe-set-default

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

cray-pmi

6.1.16 (D)

6.1.16 (D)

6.1.15 (D)

6.1.15 (D)

6.1.15 (D)

6.1.15 (D)

6.1.15 (D)

cray-pmi-devel

6.1.16

6.1.16

6.1.15

6.1.15

6.1.15

6.1.15

6.1.15

cray-pmi-doc

6.1.16

6.1.16

6.1.15

6.1.15

6.1.15

6.1.15

6.1.15

cray-python

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

cray-stat

4.12.6 (D)

4.12.6 (D)

4.12.5 (D)

4.12.5 (D)

4.12.4 (D)

4.12.4 (D)

4.12.3 (D)

cray-ucx

2.12.0 (D)

2.12.0 (D)

2.12.0 (D)

2.12.0 (D)

2.12.0 (D)

cray-zmqnet

1.3.2 (D)

1.3.2 (D)

1.3.1 (D)

1.3.1 (D)

1.0.0 (D)

1.0.0 (D)

craype

2.7.35 (D)

2.7.35 (D)

2.7.34 (D)

2.7.34 (D)

2.7.33 (D)

2.7.33 (D)

2.7.32 (D)

craype-dl-plugin-ftr

22.06.1.2 (D)

22.06.1.2 (D)

22.06.1.2 (D)

22.06.1.2 (D)

22.06.1.2 (D)

craype-dl-plugin-py3

21.04.1

21.04.1

21.04.1

21.04.1

21.04.1

craype-dl-plugin-py3

22.06.1.2

22.06.1.2

22.06.1.2

22.06.1.2

22.06.1.2

craype-dl-plugin-py3

22.08.1

22.08.1

22.08.1

22.08.1

22.08.1

craype-dl-plugin-py3

22.09.1

22.09.1

22.09.1

22.09.1

22.09.1

craype-dl-plugin-py3

22.12.1

22.12.1

22.12.1

22.12.1

22.12.1

craype-dl-plugin-py3

23.09.1

23.09.1

23.09.1

23.09.1

23.09.1

craype-dl-plugin-py3

24.03.1 (D)

24.03.1 (D)

24.03.1 (D)

24.03.1 (D)

24.03.1 (D)

craype-targets-ex

1.16.0 (D)

1.16.0 (D)

1.15.1 (D)

1.15.1 (D)

1.15.0 (D)

1.15.0 (D)

1.13.2 (D)

craypkg-gen

1.3.36 (D)

1.3.36 (D)

1.3.35 (D)

1.3.35 (D)

1.3.34 (D)

1.3.34 (D)

1.3.33 (D)

forgesup

24.1.1

24.1.1

23.1.2

23.1.2

23.1.2

gdb4hpc

4.16.5 (D)

4.16.5 (D)

4.16.4 (D)

4.16.4 (D)

4.16.3 (D)

4.16.3 (D)

4.16.2 (D)

intel

2025.1 (D)

2025.1 (D)

2025.0 (D)

2025.0 (D)

2024.2 (D)

2024.2 (D)

2024.0 (D)

lmod_scripts

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

nvhpc

24.3 (D)

nvidia

25.5 (D)

25.5 (D)

24.3 (D)

24.3 (D)

24.3 (D)

24.3 (D)

24.3 (D)

perftools

25.09.0 (D)

25.09.0 (D)

25.03.0 (D)

25.03.0 (D)

24.11.0 (D)

24.11.0 (D)

24.07.0 (D)

rocm

6.4.1 (D)

6.4.1 (D)

6.3.0

6.3.0

6.2.1

6.2.1

6.1.0 (D)

saniti-zers4hpc

1.1.6 (D)

1.1.6 (D)

1.1.5 (D)

1.1.5 (D)

1.1.4 (D)

1.1.4 (D)

1.1.3 (D)

total-viewsup

2024.4.0

2024.4.0

2024.1.21

2024.1.21

2024.1.21

val-grind4hpc

2.13.6 (D)

2.13.6 (D)

2.13.5 (D)

2.13.5 (D)

2.13.4 (D)

2.13.4 (D)

2.13.3 (D)

CPE release matrices for RHEL

CPE supports various RHEL-based software components, including SLES for Aarch64 and x86 architectures. These components include compilers, libraries, debugging/profiling tools, programming models and so forth. Supported version of these components are updated with each release of CPE. This section lists which RHEL-based component versions are supported for each CPE release.

RHEL AArch64 support matrix

RHEL on AArch64 systems is supported with CPE on HPE Cray Supercomputing EX systems with HPCM. Below are product components, modules, third-party software versions supported with previous CPE releases with these configurations.

(D) represents the default version installed at installation.

Release

CPE 25.09

CPE 25.09

CPE 25.03

CPE 25.03

24.11

24.07

rhel96

rhel95

rhel95

rhel94

rhel94

rhel94

Product

aarch64

aarch64

aarch64

aarch64

aarch64

aarch64

HPCM

1.13

1.13

1.12

1.11

amd

6.2.1 (D)

6.2.1 (D)

6.2.1 (D)

6.2.1 (D)

aocc

5.0.0 (D)

5.0.0 (D)

5.0.0 (D)

4.2.0 (D)

atp

3.15.7 (D)

3.15.7 (D)

3.15.6 (D)

3.15.6 (D)

3.15.5 (D)

3.15.4 (D)

cce

20.0.0

20.0.0

19.0.0 (D)

19.0.0 (D)

18.0.1 (D)

18.0.0 (D)

cpe-gcc-mpfr

cpe-gcc-native

cpe-gcc-native

14.2 (D)

cpe-gcc-native

14 (D)

14 (D)

13.3

13.2 (D)

13.2 (D)

13.2 (D)

cpe-gcc-native

13

12.2

12.2

12.2

12.2

cpe-gcc-native

12

cpe-prgenv-amd

cpe-prgenv-aocc

cpe-prgenv-cray

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

cpe-prgenv-cray-amd

cpe-prgenv-gnu

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

cpe-prgenv-gnu-amd

cpe-prgenv-intel

cpe-prgenv-nvidia

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

cray-R

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

cray-ccdb

5.0.7 (D)

5.0.7 (D)

5.0.6 (D)

5.0.6 (D)

5.0.5 (D)

5.0.4 (D)

cray-cdst-support

2.14.6 (D)

2.14.6 (D)

2.14.5 (D)

2.14.3 (D)

cray-cti

2.20.0 (D)

2.20.0 (D)

2.19.1 (D)

2.19.1 (D)

2.19.0 (D)

2.18.4 (D)

cray-cti

cray-dsmml

0.3.1 (D)

0.3.1 (D)

0.3.1 (D)

0.3.1 (D)

0.3.0 (D)

0.3.0 (D)

cray-dwarf

2.0.0 (D)

2.0.0 (D)

0.11.1 (D)

0.11.1 (D)

0.11.0 (D)

0.9.2 (D)

cray-dyninst

12.3.6 (D)

12.3.6 (D)

12.3.5 (D)

12.3.5 (D)

12.3.4 (D)

12.3.2 (D)

cray-fftw

3.3.10.11 (D)

3.3.10.11 (D)

3.3.10.10 (D)

3.3.10.10 (D)

3.3.10.9 (D)

3.3.10.8 (D)

cray-hdf5

1.14.3.7 (D)

1.14.3.7 (D)

1.14.3.5 (D)

1.14.3.5 (D)

1.14.3.3 (D)

1.14.3.1 (D)

cray-libsci

25.09.0 (D)

25.09.0 (D)

25.03.0 (D)

25.03.0 (D)

24.11.0 (D)

24.07.0 (D)

cray-libsci-acc

25.09.0 (D)

25.09.0 (D)

25.03.0 (D)

25.03.0 (D)

24.11.0 (D)

24.07.0 (D)

cray-lmod

8.7.60 (D)

8.7.60 (D)

8.7.55 (D)

8.7.55 (D)

8.7.37 (D)

8.7.37 (D)

cray-modules

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

cray-mpich

9.0.1 (D)

9.0.1 (D)

cray-mpich

9.0.0

9.0.0

cray-mpich

8.1.33

8.1.33

8.1.32 (D)

8.1.32 (D)

8.1.31 (D)

8.1.30 (D)

cray-mpixlate

1.0.7 (D)

1.0.7 (D)

1.0.6 (D)

1.0.5 (D)

cray-mrnet

5.1.6 (D)

5.1.6 (D)

5.1.5 (D)

5.1.5 (D)

5.1.4 (D)

5.1.3 (D)

cray-netcdf

4.9.2.1 (D)

4.9.2.1 (D)

4.9.0.17 (D)

4.9.0.17 (D)

4.9.0.15 (D)

4.9.0.13 (D)

cray-openshmemx

11.7.5 (D)

11.7.5 (D)

11.7.4 (D)

11.7.4 (D)

11.7.3 (D)

11.7.2 (D)

cray-papi

7.2.0.2 (D)

7.2.0.2 (D)

7.2.0.1 (D)

7.2.0.1 (D)

7.1.0.4 (D)

7.1.0.2 (D)

cray-parallel-netcdf

1.12.3.19 (D)

1.12.3.19 (D)

1.12.3.17 (D)

1.12.3.17 (D)

1.12.3.15 (D)

1.12.3.13 (D)

cray-pe-set-default

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

cray-pmi

6.1.16 (D)

6.1.16 (D)

6.1.15 (D)

6.1.15 (D)

6.1.15 (D)

6.1.15 (D)

cray-pmi-devel

6.1.16

6.1.16

6.1.15

6.1.15

6.1.15

6.1.15

cray-pmi-doc

6.1.16

6.1.16

6.1.15

6.1.15

6.1.15

6.1.15

cray-python

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

cray-stat

4.12.6 (D)

4.12.6 (D)

4.12.5 (D)

4.12.5 (D)

4.12.4 (D)

4.12.3 (D)

cray-ucx

cray-zmqnet

1.3.2 (D)

1.3.2 (D)

1.3.0 (D)

1.3.0 (D)

1.0.0 (D)

craype

2.7.35 (D)

2.7.35 (D)

2.7.34 (D)

2.7.34 (D)

2.7.33 (D)

2.7.32 (D)

craype-dl-plugin-ftr

craype-dl-plugin-py3

craype-targets-ex

craypkg-gen

1.3.36 (D)

1.3.36 (D)

1.3.35 (D)

1.3.35 (D)

forgesup

24.1.1

24.1.1

gdb4hpc

4.16.5 (D)

4.16.5 (D)

4.16.4 (D)

4.16.4 (D)

4.16.3 (D)

4.16.2 (D)

intel

2025.0 (D)

2025.0 (D)

2025.0 (D)

2024.2 (D)

lmod_scripts

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

nvhpc

24.3 (D)

nvidia

25.5 (D)

25.5 (D)

24.3 (D)

24.3 (D)

24.3 (D)

24.3 (D)

perftools

25.09.0 (D)

25.09.0 (D)

25.03.0 (D)

25.03.0 (D)

24.11.0 (D)

24.07.0 (D)

rocm

6.3.0

6.3.0

6.2.1

6.1.0 (D)

sanitizers4hpc

1.1.6 (D)

1.1.6 (D)

1.1.5 (D)

1.1.5 (D)

1.1.4 (D)

1.1.3 (D)

totalviewsup

2024.4.0

2024.4.0

2024.1.21

2024.1.21

valgrind4hpc

2.13.6 (D)

2.13.6 (D)

2.13.5 (D)

2.13.5 (D)

2.13.4 (D)

2.13.3 (D)

RHEL X86 support matrix

RHEL on X86 systems is supported with CPE on HPE Cray Supercomputing EX systems with HPCM. Below are product components, modules, third-party software versions supported with previous CPE releases with these configurations.

(D) represents the default version installed at installation.

Release

25.09

25.09

25.09

25.03

25.03

25.03

24.11

24.11

24.07

24.07

rhel96

rhel95

rhel810

rhel95

rhel94

rhel810

rhel94

rhel810

rhel94

rhel810

Product

(X86)

(X86)

(X86)

(X86)

(X86)

(X86)

(X86)

(X86)

(X86)

HPCM

1.14

1.14

1.14

1.13

1.13

1.13

1.12

1.12

1.12

1.11

amd

6.4.1 (D)

6.4.1 (D)

6.4.1 (D)

6.2.1 (D)

6.2.1 (D)

6.2.1 (D)

6.2.1 (D)

6.2.1 (D)

6.2.1 (D)

6.2.1 (D)

aocc

5.0.0 (D)

5.0.0 (D)

5.0.0 (D)

5.0.0 (D)

5.0.0 (D)

5.0.0 (D)

4.2.0 (D)

4.2.0 (D)

4.2.0 (D)

4.2.0 (D)

atp

3.15.7 (D)

3.15.7 (D)

3.15.7 (D)

3.15.6 (D)

3.15.6 (D)

3.15.6 (D)

3.15.5 (D)

3.15.5 (D)

3.15.4 (D)

3.15.4 (D)

cce

20.0.0

20.0.0

20.0.0

19.0.0 (D)

19.0.0 (D)

19.0.0 (D)

18.0.1 (D)

18.0.1 (D)

18.0.0 (D)

18.0.0 (D)

cpe-gcc-mpfr

cpe-gcc-native

12 (D)

13 (D)

cpe-gcc-native

14 (D)

12

10.3

cpe-gcc-native

14 (D)

13

cpe-gcc-native

12.2

12.2

10.3

12.2

10.3

12.2

10.3

cpe-gcc-native

13.3

13.2 (D)

11.2

13.2 (D)

11.2

13.2 (D)

11.2

cpe-gcc-native

14.2 (D)

12.2

cpe-prgenv-amd

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

8.5.0 (D)

cpe-prgenv-aocc

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

8.5.0 (D)

cpe-prgenv-cray

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

8.5.0 (D)

cpe-prgenv-cray-amd

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

8.5.0 (D)

cpe-prgenv-gnu

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

8.5.0 (D)

cpe-prgenv-gnu-amd

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

8.5.0 (D)

cpe-prgenv-intel

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

8.5.0 (D)

cpe-prgenv-nvhpc

8.5.0 (D)

8.5.0 (D)

cpe-prgenv-nvidia

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

8.5.0 (D)

8.5.0 (D)

cray-R

8.6.0 (D)

8.6.0 (D)

8.6.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

4.4.0 (D)

cray-ccdb

5.0.7 (D)

5.0.7 (D)

5.0.7 (D)

5.0.6 (D)

5.0.6 (D)

5.0.6 (D)

5.0.5 (D)

5.0.5 (D)

5.0.4 (D)

5.0.4 (D)

cray-cdst-support

2.14.6 (D)

2.14.6 (D)

2.14.6 (D)

2.14.5 (D)

2.14.5 (D)

2.14.3 (D)

2.14.3 (D)

cray-cti

2.20.0 (D)

2.20.0 (D)

2.20.0 (D)

2.19.1 (D)

2.19.1 (D)

2.19.1 (D)

2.19.0 (D)

2.19.0 (D)

2.18.4 (D)

2.18.4 (D)

cray-dsmml

0.3.1 (D)

0.3.1 (D)

0.3.1 (D)

0.3.1 (D)

0.3.1 (D)

0.3.1 (D)

0.3.0 (D)

0.3.0 (D)

0.3.0 (D)

0.3.0 (D)

cray-dwarf

2.0.0 (D)

2.0.0 (D)

2.0.0 (D)

0.11.1 (D)

0.11.1 (D)

0.11.1 (D)

0.11.0 (D)

0.11.0 (D)

0.9.2 (D)

0.9.2 (D)

cray-dyninst

12.3.6 (D)

12.3.6 (D)

12.3.6 (D)

12.3.5 (D)

12.3.5 (D)

12.3.5 (D)

12.3.4 (D)

12.3.4 (D)

12.3.2 (D)

12.3.2 (D)

cray-fftw

3.3.10.11 (D)

3.3.10.11 (D)

3.3.10.11 (D)

3.3.10.10 (D)

3.3.10.10 (D)

3.3.10.10 (D)

3.3.10.9 (D)

3.3.10.9 (D)

3.3.10.8 (D)

3.3.10.8 (D)

cray-hdf5

1.14.3.7 (D)

1.14.3.7 (D)

1.14.3.7 (D)

1.14.3.5 (D)

1.14.3.5 (D)

1.14.3.3 (D)

1.14.3.3 (D)

1.14.3.5 (D)

1.14.3.1 (D)

1.14.3.1 (D)

cray-libsci

25.09.0 (D)

25.09.0 (D)

25.09.0 (D)

25.03.0 (D)

25.03.0 (D)

25.03.0 (D)

24.11.0 (D)

24.11.0 (D)

24.07.0 (D)

24.07.0 (D)

cray-libsci-acc

25.09.0 (D)

25.09.0 (D)

25.09.0 (D)

25.03.0 (D)

25.03.0 (D)

25.03.0 (D)

24.11.0 (D)

24.11.0 (D)

24.07.0 (D)

24.07.0 (D)

cray-lmod

8.7.60 (D)

8.7.60 (D)

8.7.60 (D)

8.7.55 (D)

8.7.55 (D)

8.7.55 (D)

8.7.37 (D)

8.7.37 (D)

8.7.37 (D)

8.7.37 (D)

cray-modules

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

3.2.11.7 (D)

cray-mpich

8.1.33

8.1.33

8.1.33

8.1.32 (D)

8.1.32 (D)

8.1.32 (D)

8.1.31 (D)

8.1.31 (D)

8.1.30 (D)

8.1.30 (D)

cray-mpich

9.0.1 (D)

9.0.1 (D)

9.0.1 (D)

9.0.0

9.0.0

9.0.0

cray-mpixlate

1.0.7 (D)

1.0.7 (D)

1.0.7 (D)

1.0.6 (D)

1.0.6 (D)

1.0.5 (D)

1.0.5 (D)

cray-mrnet

5.1.6 (D)

5.1.6 (D)

5.1.6 (D)

5.1.5 (D)

5.1.5 (D)

5.1.5 (D)

5.1.4 (D)

5.1.4 (D)

5.1.3 (D)

5.1.3 (D)

cray-netcdf

4.9.2.1 (D)

4.9.2.1 (D)

4.9.2.1 (D)

4.9.0.17 (D)

4.9.0.17 (D)

4.9.0.17 (D)

4.9.0.15 (D)

4.9.0.15 (D)

4.9.0.13 (D)

4.9.0.13 (D)

cray-open-shmemx

11.7.5 (D)

11.7.5 (D)

11.7.5 (D)

1.12.3.17 (D)

1.12.3.17 (D)

1.12.3.17 (D)

1.12.3.15 (D)

1.12.3.15 (D)

1.12.3.13 (D)

1.12.3.13 (D)

cray-papi

7.2.0.2 (D)

7.2.0.2 (D)

7.2.0.2 (D)

7.2.0.1 (D)

7.2.0.1 (D)

7.2.0.1 (D)

7.1.0.4 (D)

7.1.0.4 (D)

7.1.0.2 (D)

7.1.0.2 (D)

cray-parallel-netcdf

1.12.3.19 (D)

1.12.3.19 (D)

1.12.3.19 (D)

1.12.3.17 (D)

1.12.3.17 (D)

1.12.3.17 (D)

1.12.3.15 (D)

1.12.3.15 (D)

1.12.3.13 (D)

1.12.3.13 (D)

cray-pe-set-default

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

3.3 (D)

cray-pmi

6.1.16 (D)

6.1.16 (D)

6.1.16 (D)

6.1.15 (D)

6.1.15 (D)

6.1.15 (D)

6.1.15 (D)

6.1.15 (D)

6.1.15 (D)

6.1.15 (D)

cray-pmi-devel

6.1.16 (D)

6.1.16 (D)

6.1.16 (D)

6.1.15

6.1.15

6.1.15

6.1.15

6.1.15

6.1.15

6.1.15

cray-pmi-doc

6.1.16 (D)

6.1.16 (D)

6.1.16 (D)

6.1.15

6.1.15

6.1.15

6.1.15

6.1.15

6.1.15

6.1.15

cray-python

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

3.11.7 (D)

cray-stat

4.12.6 (D)

4.12.6 (D)

4.12.6 (D)

4.12.5 (D)

4.12.5 (D)

4.12.5 (D)

4.12.4 (D)

4.12.4 (D)

4.12.3 (D)

4.12.3 (D)

cray-zmqnet

1.3.2 (D)

1.3.2 (D)

1.3.2 (D)

1.3.0 (D)

1.3.0 (D)

1.3.0 (D)

1.0.0 (D)

1.0.0 (D)

craype

2.7.35 (D)

2.7.35 (D)

2.7.35 (D)

2.7.34 (D)

2.7.34 (D)

2.7.34 (D)

2.7.33 (D)

2.7.33 (D)

2.7.32 (D)

2.7.32 (D)

craype-dl-plugin-ftr

22.06.1.2 (D)

22.06.1.2 (D)

22.06.1.2 (D)

craype-dl-plugin-py3

21.02.1.3

21.02.1.3

21.02.1.3

craype-dl-plugin-py3

22.09.1

22.09.1

22.09.1

craype-dl-plugin-py3

22.12.1 (D)

22.12.1 (D)

22.12.1 (D)

craype-targets-ex

craypkg-gen

1.3.36 (D)

1.3.36 (D)

1.3.36 (D)

1.3.35 (D)

1.3.35 (D)

1.3.35 (D)

1.3.34 (D)

1.3.34 (D)

1.3.33 (D)

1.3.33 (D)

forgesup

24.1.1

24.1.1

24.1.1

gdb4hpc

4.16.5 (D)

4.16.5 (D)

4.16.5 (D)

4.16.4 (D)

4.16.4 (D)

4.16.4 (D)

4.16.3 (D)

4.16.3 (D)

4.16.2 (D)

4.16.2 (D)

intel

2025.1 (D)

2025.1 (D)

2025.1 (D)

2025.0 (D)

2025.0 (D)

2025.0 (D)

2024.2 (D)

2024.2 (D)

2024.0 (D)

2024.0 (D)

lmod_-scripts

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

3.2.1 (D)

nvhpc

24.3 (D)

24.3 (D)

nvidia

25.5 (D)

25.5 (D)

25.5 (D)

24.3 (D)

24.3 (D)

24.3 (D)

24.3 (D)

24.3 (D)

24.3 (D)

24.3 (D)

perftools

25.09.0 (D)

25.09.0 (D)

25.09.0 (D)

25.03.0 (D)

25.03.0 (D)

25.03.0 (D)

24.11.0 (D)

24.11.0 (D)

24.07.0 (D)

24.07.0 (D)

rocm

6.4.1 (D)

6.4.1 (D)

6.4.1 (D)

6.3.0

6.3.0

6.2.1

6.1.0 (D)

6.1.0 (D)

6.1.0 (D)

6.1.0 (D)

saniti-zers4hpc

1.1.6 (D)

1.1.6 (D)

1.1.6 (D)

1.1.5 (D)

1.1.5 (D)

1.1.5 (D)

1.1.4 (D)

1.1.4 (D)

1.1.3 (D)

1.1.3 (D)

total-viewsup

2024.4.0

2024.4.0

2024.4.0

2024.1.21

2024.1.21

2024.1.21

2024.1.21

val-grind4hpc

2.13.6 (D)

2.13.6 (D)

2.13.6 (D)

2.13.5 (D)

2.13.5 (D)

2.13.5 (D)

2.13.4 (D)

2.13.4 (D)

2.13.3 (D)

2.13.3 (D)

Documentation and support

Documentation is available as a resource for using and managing CPE. This chapter provides details for obtaining CPE support and accessing available resources.

CPE installation and getting started guides

HPE CPE documentation comprises user and installation guides:

Title

Document Part Number

HPE Cray Supercomputing Programming Environment Installation Guide: CSM on HPE Cray Supercomputing EX Systems

S-8003

HPE Cray Supercomputing Programming Environment Installation Guide: HPCM on HPE Cray Supercomputing EX and HPE Cray Supercomputing Systems

S-8022

HPE Cray Supercomputing Programming Environment Installation Guide: HPE Cray XD2000 Systems

S-8012

HPE Cray Supercomputing Programming Environment Getting Started User Guide: HPE Cray Supercomputing EX Systems

S-9934

HPE Cray Supercomputing Programming Environment Getting Started Administrator Guide: HPE Cray Supercomputing EX Systems

S-9935

Other documentation resources

HPE provides CPE documentation and support through various online sources:

Glossary

This section provides a listing of CPE general terms and definitions.

A

  • Adaptive Routing (AR): A technology that dynamically selects the best path for data packets in a network to improve performance and fault tolerance.

  • Appentice3: A performance analysis tool that provides a graphical interface for visualizing performance data collected by HPE CrayPAT.

    • Command: app3

    • Module: module load app3

B

  • Batch System: Software that manages and schedules jobs on a supercomputer, ensuring efficient use of computational resources.

C

  • Cache Optimization: Techniques for optimizing data structures and algorithms to take advantage of cache locality to improve performance.

  • CCE (Cray Compiling Environment): HPE Cray’s native compiler suite for C, C++, and Fortran, optimized for Cray hardware.

    • Commands:

      • cc for C

      • CC for C++

      • ftn for Fortran

  • CrayPAT (Cray Performance Analysis Tools): A suite of tools for collecting and analyzing performance data of parallel applications.

    • Commands:

      • pat_build to instrument an application

      • pat_report to generate a performance report

    • Module: module load perftools

D

  • DataWarp: A technology for accelerating I/O by using SSD-based storage to provide a high-speed buffer between compute nodes and the parallel file system.

  • Distributed Debugging Tool (DDT): A specialized debugger for debugging parallel applications, including MPI and OpenMP programs. Allows developers to determine the performance state of processes running together across cluster nodes. CPE supports the integration of DDTs, such as Perforce TotalView and Allinea DDT.

    • Command: ddt

    • Module: module load ddt

E

  • Environment Groups: Logical groupings of environment variables and module settings to simplify switching between different development environments.

    • Commands:

      • envmgr activate <group_name>

      • envmgr deactivate <group_name>

  • Environment Variables: Variables used to configure the runtime environment, such as PATH, LD_LIBRARY_PATH and MODULEPATH.

F

  • File Striping: A method of dividing a file into segments and distributing them across multiple disks to improve I/O performance.

    • Command: lfs setstripe -s 1M -c -1 <path>

  • Finite Element Analysis (FEA): A computational technique used to approximate solutions to complex structural engineering problems.

  • FFTW (cray-fftw): An optimized and scalable library for computing Fast Fourier Transforms (FFTs) on HPE Cray EX Supercomputing systems, facilitating efficient FFT computations for various scientific and engineering applications.

    • Module: module load cray-fftw; gcc -o my_fft_program my_fft_program.c -lfftw3

G

  • GCC (GNU Compiler Collection): A widely-used alternative compiler suite that supports various programming languages.

    • Commands:

      • gcc for C

      • g++ for C++

      • gfortran for Fortran

  • gdb4hpc (HPE Cray gdb-based HPC Debugger): Advanced HPC debugger for complex applications at scale.

    • Command: gdb4hpc

    • Module: module load gdb4hpc

H

  • HDF5 (cray-hdf5 and cray-hdf5-parallel): A data model, library, and file format for storing and managing large amounts of data.

    • Module: module load cray-hdf5

  • Hybrid Parallel Programming: Combining MPI with OpenMP or other parallel programming models to leverage both inter-node and intra-node parallelism.

  • Huge pages: A Linux kernel feature that allows operating systems to manage memory in larger chunks as opposed to 4KB pages. Used to improve the efficiency of virtual memory systems.

I

  • Intel Compiler: A suite of compilers optimized for Intel architectures.

    • Commands:

      • icc for C

      • icpc for C++

      • ifort for Fortran

J

  • Job Arrays: A method to submit multiple similar jobs using a single job script.

    • Slurm Command: sbatch –array=0-9 my_job_script.sh

    • PBS Command: qsub -t 0-9 my_job_script.sh

  • Job Scheduler: A system that manages and schedules jobs on a supercomputer, ensuring efficient use of computational resources.

    • Slurm: sbatch, squeue, scancel

    • PBS: qsub, qstat, qdel

L

  • Low Level Virtual Machine (LLVM): A LLVM Foundation compiler and toolchain technology. Builds compilers, debuggers, and other software-based development tools. For CPE, specialized and used in conjunction with Clang for optimized coding for improved performance. HPE Clang C and C++ is based on Clang/LLVM. See the HPE Cray Clang C and C++ Quick Reference documentation for information on HPE Clang C and C++, Clang documentation for more information on Clang, or LLVM documentation for more information on LLVM.

  • Lustre: A type of parallel distributed file system, primarily used for large-scale cluster computing.

    • Command: lfs

  • LibSci (cray-libSci): A collection of scientific libraries optimized for Cray systems, including LAPACK, BLAS, and ScaLAPACK.

    • Module: module load cray-libsci

  • LibSci_ACC (cray-libsci_acc): An extension of HPE Cray LibSci that includes GPU-accelerated versions of mathematical routines, leveraging GPU hardware to improve performance in scientific computations on HPE Cray EX Supercomputing systems equipped with GPUs.

    • Module:

      module load cray-libsci_acc; nvcc -o my_gpu_program my_gpu_program.cu \
      -L${CRAY_LIBSCI_ACC_PREFIX_DIR}/lib -lsci_acc
      
  • Lmod - A Lua-based module management software tool.

M

  • Makefile: A file containing a set of directives used by the make build automation tool to compile and link programs.

    • Command: make

  • Modules: A system for dynamically modifying user environments through modulefiles. Modules can be loaded and unloaded to manage different software packages and versions.

    • Commands:

      • module load <module_name>

      • module unload <module_name>

      • module avail

  • MPI (Message Passing Interface): A standard for parallel programming that allows processes to communicate with each other by sending and receiving messages.

    • Common Functions: MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Send, MPI_Recv

N

  • NetCDF (cray-netcdf and cray-netcdf-hdf5parallel): Libraries supporting the creation, access, and sharing of array-oriented scientific data in the Network Common Data Form (NetCDF), offering parallel I/O support to improve performance and scalability on large-scale HPE Cray EX Supercomputing systems.

    • Module: module load cray-netcdf; gcc -o my_netcdf_program my_netcdf_program.c -lnetcdf

  • NUMA (Non-Uniform Memory Access): An architecture where memory access time depends on the memory location relative to the processor.

O

  • OpenACC (for Fortran): A directive-based parallel programming model for offloading computations to GPUs.

    • Command: ftn -hacc -o my_program my_program.f90

    • Directives: !$acc parallel, !$acc kernels

  • OpenMP: An API for parallel programming that supports multi-platform shared memory and GPU parallel programming.

    • Common Directives: #pragma omp parallel, #pragma omp for, #pragma omp critical, #pragma omp barrier

P

  • Parallel NetCDF (cray-parallel-netcdf): A high-performance parallel I/O library for NetCDF files, enabling efficient handling and management of large, distributed data sets in scientific applications running on HPE Cray EX Supercomputing systems.

    • Module: module load cray-parallel-netcdf; gcc -o my_pnetcdf_program my_pnetcdf_program.c -lpnetcdf

  • PBS (Portable Batch System): A job scheduler used on some HPE Cray EX Supercomputing systems.

    • Commands: qsub, qstat, qdel

  • Performance-Guided Optimization (PGO): Using profiling data to guide optimizations. Involves:

    • Compiling with profiling enabled: cc -h profile_generate -o my_program my_program.c

    • Running the program to generate profile data.

    • Recompiling with profile data: cc -h profile_use -o my_program my_program.c

R

  • Resource Constraints: Specify memory, CPU, and other resource constraints for job scheduling.

    • Slurm Command: sbatch –mem=4G –cpus-per-task=8 my_job_script.sh

    • PBS Command: qsub -l mem=4G,ncpus=8 my_job_script.sh

S

  • Slurm (Simple Linux Utility for Resource Management): A job scheduler used on many HPE Cray systems.

    • Commands:

      • sbatch: Submit a job script.

      • squeue: Check the status of jobs.

      • scancel: Cancel a job.

T

  • TensorFlow: An open-source platform for machine learning.

    • Module: module load tensorflow

U

  • User Access Node (UAN): A critical component that acts as a “gateway” to the supercomputer. It is a dedicated server or node where you log in to interact with the system, submit jobs, manage files, and perform development tasks. High-performance compute nodes (the powerful “brain” of the supercomputer) is not directly accessed for these activities—instead, you use the UAN to prepare your work.

    UAN Key Features:

    • Development Environment: The UAN provides tools for coding, compiling, debugging, and optimizing your programs. It is where you set up applications before running them on the compute nodes.

    • Job Submission: From the UAN, submit workloads (such as simulation or analysis tasks) to the job scheduler, which then runs tasks on the compute nodes.

    • File Management: The UAN is where you can access and manage files stored in the system.

    • Access Point: Users connect to the UAN through protocols like SSH (Secure Shell) to securely log in and work on the supercomputer.

    The UAN as the central point for interaction with the larger computing system.

V

  • Vectorization: Techniques for optimizing code to take advantage of vector instructions.

    • Compiler Flags: -h vector3

    • Directives: #pragma ivdep

W

  • Workload Managers: Software that orchestrates the execution of jobs in a high-performance computing environment. Examples include Slurm and PBS.

Published: April 2026