HPE Cray Supercomputing Programming Environment Getting Started Administrator Guide: HPE Cray Supercomputing EX Systems (26.03-Rev. A) S-9935
Copyright and Version
© Copyright 2022-2026 Hewlett Packard Enterprise Development LP. All third-party marks are the property of their respective owners.
CPE: 26.03-LocalBuild
Doc git hash: 4ea96f38ce9da5ed25d9431aa314b356c4365077
Generated: Wed Apr 15 2026
Record of revision
This chapter provides a record of updates and revisions to this guide.
Release updates
New in the CPE 26.03 (Rev. A) publication
Updated the table in the SLES X86 support matrix section.
Incorporated minor updates.
New in this 26.03 release
Updated the Understanding the key CPE components section.
Added the Resolving the CCE PGAS error and dependency issue resulting in failed image builds section.
Updated the Support matrices for previous releases chapter.
Updated the Supported systems chapter.
Added a link for a listing of CPE-related knowledge articles available on the HPE Support Center website in the Other documentation resources section of the Documentation and support chapter.
Added the HPE Slingshot SHMEM Software Installation Guide link in the Other documentation resources section of the Documentation and support chapter.
Incorporated minor updates.
New in this 25.09 (Rev. A) release
Updated the tables in Support matrices for previous releases chapter.
Incorporated minor updates.
New in this 25.09 release
Issued the first version of HPE Cray Supercomputing Programming Environment Getting Started Administrator Guide: HPE Cray Supercomputing EX Systems (25.09) S-9935.
Revision history
Publication Title |
Date |
|---|---|
HPE Cray Supercomputing Programming Environment Getting Started Administrator Guide: HPE Cray Supercomputing EX Systems (26.03) S-9935 |
April 2026 |
HPE Cray Supercomputing Programming Environment Getting Started Administrator Guide: HPE Cray Supercomputing EX Systems (25.09-Rev. A) S-9935 |
December 2025 |
HPE Cray Supercomputing Programming Environment Getting Started Administrator Guide: HPE Cray Supercomputing EX Systems (25.09) S-9935 |
September 2025 |
Document conventions
This section defines the documentation conventions used throughout the guide, including typographic styles for code, commands, paths, and the backslash as the shell line-continuation character. It explains command-prompt notation, showing how the host and account are indicated (root prompts end with #, non-root prompts use account@hostname>) and lists node abbreviations (CN, NCN, AN, UAN) with example prompts for specific node types and Kubernetes contexts. This section also provides a simple three-step workflow and a reminder to verify pasted commands.
Typographical and command prompt conventions
This section provides background information about typographical and command prompts used in this guide and describes how they are delineated throughout this guide.
Typographical conventions
Type |
Convention Description |
|---|---|
This style |
Indicates program code, reserved words, library functions, command-line prompts, screen output, file/path names, variables, and other software constructs. |
\ (backslash) |
When inserted at the end of a command line, indicates the Linux shell line continuation character (lines joined by a backslash are parsed as a single line). |
Command prompt conventions
Host name and account in command prompts: The host name in a command prompt indicates where the command must be run. The account that must run the command is also indicated in the prompt.
The root or super-user account always has the # character at the end of the prompt.
Any non-root account is indicated with account@hostname>. A user account that is not root or crayadm is seen as user.
Command Prompt |
Definition |
|---|---|
user@login> |
Run the command on any login node as any non-root user. |
hostname# |
Run the command on the specified system as root. |
user@hostname> |
Run the command on the specified system as any non-root user. |
Copying and pasting text from this document
Using the Copy and Paste functions from a PDF is unreliable. Although copying and pasting a command line typically works, copying and pasting formatted file content (for example, JSON, YAML) typically fails. To ensure that file content is copied and pasted correctly while performing the procedures in this guide:
Copy the content from the PDF.
Paste it to a neutral editing form and add the necessary formatting.
Copy the content from the neutral form and paste it into the console.
Tip: As a best practice, double-check copied/pasted commands for correctness, as some commands may not render correctly in the PDF.
About the HPE Cray Supercomputing Programming Environment
Welcome to the HPE Cray Supercomputing Programming Environment (CPE) Software suite, a complete application development and application development lifecycle software solution. CPE, offered in an integrated and user-friendly environment, provides a suite of programmer tools and libraries that support the development, optimization, and execution of high performance computing (HPC) applications for HPE Cray Supercomputing EX systems. These systems comprise multiple components. They include compute nodes, high-speed interconnects, storage solutions, cooling and power infrastructure, comprehensive system management software, security features, and other integral components and tools. CPE enables scientists, researchers, engineers, and other users to effectively leverage the advanced capabilities of these systems. Combined, CPE and its compatible systems provide for the computational needs of developed applications. Furthermore, these solutions deliver the performance, scalability, and flexibility required for HPC applications.
This administrator guide provides details for installing, configuring, updating, maintaining, monitoring, and troubleshooting CPE. As administrator, it will also be integral to manage and maintain system security, licensing, system tuning, benchmarking, planning, and various user support tasks. This guide assists you with these and other relative tasks.
For the latest version and revisions of this CPE guide, go to the HPE Support Center website, and perform a search on the part number of this document (S-9935). For additional information on how to use CPE or details regarding CPE components and modules, see the CPE Online Documentation website web page. See also the Documentation and support chapter for additional CPE resources and information.
About the CPE Software suite
CPE comprises a set of tools and toolkits that collectively provides a comprehensive environment for developing, optimizing, and running high performance applications on HPE Cray Supercomputing EX systems. The CPE Software suite includes:
HPE Cray Compiling Environment (CCE)
HPE Cray Debugging Support Tools
HPE Cray Environment (CENV) Setup and Compiling Support Tools
HPE Cray Message Passing Toolkit (MPT)
HPE Cray Performance, Measurement, and Analysis Tools (CPMAT)
HPE Cray Scientific and Math Libraries (CSML)
Tool/Toolkit Name |
What it is |
Description |
|---|---|---|
CCE |
A suite of compilers optimized for HPE Cray Supercomputing EX systems, including support for languages such as C, C++, and Fortran. |
Compiles your code into executable programs that take full advantage of the architecture and capabilities of HPE Cray Supercomputing EX systems. CCE is designed to generate highly optimized code, ensuring that your applications run efficiently. |
Debugging Tools |
A set of tools for diagnosing and troubleshooting issues in your code. |
Helps you identify and fix bugs in your applications. These tools provide features, such as breakpoints, variable inspection, and call stack tracing, which are essential for debugging complex parallel applications. |
CENV |
Tools and utilities designed to help users configure their programming environment and manage the compilation of their applications. |
Simplifies the setup and configuration of CPE on HPE Cray Supercomputing EX systems. These tools help ensure that the necessary libraries, compilers, and environment variables are correctly set up, making it easier for users to compile and run their applications efficiently. |
MPT |
A set of libraries and tools that assist in the development of parallel applications using the Message Passing Interface (MPI) standard. |
Enables efficient communication between multiple processes running on different nodes of HPE Cray Supercomputing EX system, which is crucial for HPC applications. This toolkit supports scalable and high performance data exchange, essential for tasks that require coordination and data sharing among numerous processors. |
CPMAT |
A collection of tools designed to help you measure, analyze, and optimize the performance of your applications running on HPE Cray systems. |
Ensures that your applications are running efficiently by identifying performance bottlenecks and providing insights into how to improve computational performance. This suite includes tools for profiling, tracing, and in-depth performance analysis. |
CSML |
A collection of high performance mathematical and scientific libraries. |
Provides pre-optimized routines for common mathematical and scientific computations, such as linear algebra, fast Fourier transforms, and more. These libraries help you achieve better performance and accuracy in your scientific applications without having to develop complex algorithms from scratch. |
After developing code in programming languages like Fortran, C, or C++, you can then use HPE-optimized compilers to convert your code into executable programs. Additionally, you can use CPE tools for testing the performance, streamlining, and debugging your applications. With CPE, you manage your software environment by:
Using its various modules,
Submitting jobs to the job scheduler and running applications on HPE Cray EX supercomputers, and
Using debugging and performance analysis tools.
CPE components allow you to run applications efficiently and correctly.
Understanding the key CPE components
The CPE Software suite comprises specific components tools designed to maximize developer productivity, application scalability, and code performance. It includes compilers, analyzers, optimized libraries, and debuggers.

CPE and third-party components
The CPE Software suite also provides a variety of parallel programming models that allow you to make appropriate choices based on the nature of existing and new applications. CPE uses build environment containers, providing the ability to compile, and launch and track job status. Containers enable you to store and retrieve files from both the local and shared system storage.
CPE components (by category) include:
Compilers
HPE Cray Compiling Environment (CCE): High-performance compilers for Fortran, C, and C++ that are optimized for HPE Cray Supercomputing EX system architectures. These compilers include advanced optimization features and support for parallel programming models, such as Open Multi-Processing (OpenMP), Open Accelerators (OpenACC), Heterogeneous-Compute Interface for Portability (HIP), and Partitioned Global Address Space (PGAS) languages (such as Coarray Fortran, Unified Parallel C).
Third-Party Compilers: Support for other industry-standard compilers, such as GNU Compiler Collection (GCC), Intel, NVIDIA, and AMD compilers.
Programming models
Model Name |
Description |
|---|---|
HPE Cray Message Passing Toolkit (MPT) |
Libraries and tools for parallel programming using the Message Passing Interface (MPI) standard, which is widely used for distributed memory parallelism. |
OpenMP |
Support for shared memory parallelism and GPU offloading using the OpenMP standard, which allows developers to parallelize and offload code using directives and APIs. |
OpenACC |
Support for GPU offloading using the OpenACC standard, which allows developers to parallelize and offload code using directives and APIs. |
CUDA |
Support for NVIDIA GPU offloading using the CUDA programming model. |
HIP |
Support for AMD GPU offloading using the HIP programming model. |
Partitioned Global Address Space (PGAS) |
Support for PGAS languages like Coarray Fortran and Unified Parallel C (UPC). |
OpenSHMEM |
As a programming library, simplifies and enhances the way you write parallel programs and allows you to manage data efficiently across multiple processors, ensuring that your high-performance applications run as fast and effectively as possible. |
Scientific and mathematical libraries
Library Name |
Library Description |
|---|---|
HPE Cray LibSci (cray-libsci) |
A library providing highly optimized and scalable mathematical routines, such as BLAS, LAPACK, and ScaLAPACK, aimed at enhancing the performance of linear algebra and other numerical computations on HPE Cray Supercomputing EX systems. |
HPE Cray FFTW (cray_fftw) |
Libraries for performing Fast Fourier Transforms (FFTs), based on FFTW3. |
HPE Cray LibSci ACC (cray-libsci_acc) |
An extension of HPE Cray LibSci that includes GPU-accelerated versions of mathematical routines, designed to leverage GPU hardware for improved performance in scientific computations on HPE Cray EX supercomputing systems with GPUs. |
HPE Cray HDF5 (cray-hdf5 and cray-hdf5-parallel) |
Libraries for managing and storing large scientific data sets in Hierarchical Data Format (HDF5), with parallel I/O capabilities to enhance performance and scalability on distributed HPE Cray Supercomputing EX systems. |
HPE Cray NetCDF (cray-netcdf and cray-netcdf-hdf5parallel) |
Libraries supporting the creation, access, and sharing of array-oriented scientific data in the Network Common Data Form (NetCDF), with parallel I/O support to improve scalability and performance on large-scale HPE Cray Supercomputing EX systems. |
HPE Cray Parallel NetCDF (cray-parallel-netcdf) |
A high-performance parallel I/O library for NetCDF files, enabling efficient handling and management of large, distributed data sets in scientific applications on Cray systems. |
Environment setup tools
HPE Cray Environment Setup and Compilation Support (CENV) is a CPE software package with tools and libraries specifically designed to support compilation and environment setup. It includes compiler drivers and CPE API (craype-api).
Performance Analysis Tools
HPE Cray Performance Measurement & Analysis Tools (CPMAT)/HPE Cray Performance Analysis Tools (CrayPAT): A suite of tools for profiling and analyzing the performance and behavior of applications and a Performance API (PAPI). This includes pat_build for instrumenting applications, pat_report for generating performance reports, and HPE Cray Apprentice3 for visualizing performance data.
HPE Cray Apprentice3: Provides performance analysis with event tracing and graphical data visualization. HPE Cray Apprentice3 provides enhanced scalability, an improved user interface, and advanced metrics for more detailed and efficient performance analysis.
Debugging Tools
HPE Cray Distributed Debugging Tool (DDT): An advanced debugging tool for parallel applications, supporting MPI, OpenMP, and hybrid applications.
gdb4hpc: GNU Debugger (GDB)-based HPC debugger with support for debugging serial and parallel applications.
Valgrind4hpc - A parallel debugging tool used to detect memory leaks and parallel application errors.
Sanitizers4hpc - A parallel debugging tool used to detect memory access or leak issues at runtime using information from LLVM sanitizers.
Stack Trace Analysis Tool (STAT) - A single merged stack backtrace tool used to analyze application behavior at the function level. Helps trace down the cause of crashes.
Abnormal Termination Processing (ATP) - A scalable core file generation and analysis tool for analyzing crashes, with a selection algorithm to determine which core files to dump. ATP helps to determine the cause of crashes.
Cray Comparative Debugger (CCDB) - Not a traditional debugger, but rather a tool to run and step through two versions of the same application side-by-side to help determine where they diverge.
All CPE debugger tools support C/C++, Fortran, and Universal Parallel C (UPC).
Development Environment
Environment Modules: A system for managing and configuring the user environment, allowing you to easily load and switch between different software packages and versions.
Build and Configuration Tools: Tools for building and configuring applications, including support for makefiles and CMake.
Application Porting and Optimization
HPE Parallel Application Launch Service (PALS): An automation tool for starting, managing, and optimizing the placement of parallel applications on HPE Cray Supercomputing EX systems, ensuring efficient resource utilization.
CrayPAT-lite: A lightweight version of CrayPAT for quick performance assessments and application tuning.
Understanding CPE modules
CPE modules are used in conjunction with RHEL and SLES to streamline and manage the software development environment on HPE Cray Supercomputing EX systems. As part of the CPE environment, you can load, unload, and switch one or more modules to efficiently manage the software stack required for your specific applications and development tasks. Modules can comprise CPE base, library-related, or tools-related modules. Loading a module automatically sets environment variables, paths, and other settings, allowing you to focus on development rather than environment configuration. Modules allow you to easily switch between different versions of compilers, libraries, and tools, enabling you to test and validate your applications against multiple configurations. Compiler and library compatibility and dependencies is assured through the use of modules:
Library Compatibility: Many high-performance computing (HPC) applications depend on specific versions of libraries and tools. Understanding which modules to load ensures that all dependencies are compatible, reducing runtime errors and conflicts.
Compiler Consistency: Different modules may provide different versions of compilers. Ensuring you use consistent compilers across your development and production environments can prevent compatibility issues.
Debugging tools are also available for diagnosing and optimizing your applications and providing critical insights into your application’s performance and behavior. Performance analysis tools help identify bottlenecks and optimize code, which is crucial in high-performance computing environments where efficiency is paramount.
Modules are essential for several reasons and are used to:
Simplify Environment Management
HPE Cray Supercomputing EX systems often have complex software stacks with multiple compilers, libraries, and tools. Modules simplify the process of configuring the environment by allowing you to easily load and unload different software components without manual configuration of environment variables.
Allow for consistency
Modules ensure that all users on the system have a consistent environment. This consistency is crucial for reproducibility of results, especially in a research or scientific computing context.
Offer flexibility
Different applications and development tasks might require different versions of compilers, libraries, or tools. Modules provide a flexible way to switch between these versions without conflicts.
Provide optimization
CPE modules are optimized for the underlying hardware. Different programming environments and compilers are optimized for specific architectures and workloads. Loading appropriate modules helps to ensure that your code and applications are making the best use of system architecture and running efficiently. Properly loading and unloading modules helps manage system resources, ensuring that you are balancing the load and not overloading the system with unnecessary tools and libraries.
Are easy to use
Modules abstract away the complexity of setting up and managing the environment. You can focus on development rather than expending excess time on configuration issues.
As you use CPE modules, keep in mind that many high-performance computing applications depend on specific versions of libraries and tools. Understanding which modules to load ensures that all dependencies are compatible, reducing runtime errors and conflicts. Understanding CPE modules and module commands is crucial for maximizing performance, ensuring compatibility, simplifying development and debugging, maintaining reproducibility, staying current with technological advancements, and fostering effective collaboration in high-performance computing environments.
The following subsections provide information on commonly-used CPE modules, libraries, and tools.
Commonly-used CPE modules, module command names, and module compiler commands
Commonly-used CPE modules and module commands include:
Module name |
Module command name |
CPE driver commands |
|---|---|---|
AMD compilers |
PrgEnv-amd, rocm |
ftn, cc, CC |
AOCC |
PrgEnv-aocc |
ftn, cc, CC |
CCE* |
PrgEnv-cray |
ftn, cc, CC |
GCC** |
PrgEnv-gnu |
ftn, cc, CC |
Intel compilers |
PrgEnv-intel |
ftn, cc, CC |
NVIDIA |
PrgEnv-nvhpc |
ftn, cc, CC |
CPE driver commands are used in conjunction with module commands to construct build configurations.
Commonly-used CPE library commands include:
Library name |
Module command name |
Compiler commands |
|---|---|---|
DSMML |
cray-dsmml |
|
Fast Fourier Transforms |
cray-fftw |
|
HDF5 |
cray-hdf5 |
|
HPE Cray LibSci*** |
cray-libsci, cray-fftw3, |
|
cray-libsci_acc |
||
HPE Cray MPICH |
cray-mpich |
mpicc |
Parallel NetCDF |
cray-parallel-netcdf |
gcc |
Commonly-used CPE tools and their commands include:
Tool name |
Module command name |
|---|---|
Apprentice 3 |
app3 |
Debuggers |
gdb4hpc, valgrind4hpc, sanitizers4hpc |
Distributed Debugging Tool |
ddt |
HPE CrayPAT |
perftools |
HPE CrayPAT Base |
perftools-base |
Huge pages |
craype-hugepages**** |
Perforce TotalView |
totalview |
Commonly-used CPE performance analysis commands include:
Tool name |
Module command name |
|---|---|
ATP |
atp |
Clang/Low Level Virtual Machine (LLVM) |
clang, llvm |
CrayPAT |
craypat |
TensorFlow |
tensorflow |
Commonly-used CPE specialized environment commands include:
Specialized environment name |
Environment command name |
|---|---|
OpenMPI |
openmpi |
OpenSHMEMX |
cray-openshmemx |
ROCM |
cray-rocm |
** - Compiler-specific manpages include crayftn(1), craycc(1), and crayCC(1). Available only when the compiler module is loaded*
*** - Compiler-specific manpages include gcc(1), gfortran(1), and g++(1). Available only when the compiler module is loaded.*
**** - Compiler-specific manpages include intro_libsci(3s), and intro_fftw3(3). Available only when the compiler module is loaded. When the module for a CSML package (such as cray-libsci or cray-fftw) is loaded, all relevant headers and libraries for these packages are added to the compile and link lines of the cc, ftn, and CC CPE drivers. You must load the cray-hdf5 module (a dependency) before loading the cray-netcdf module.*
***** - In addition to the default module systems, CPE offers, as an alternate module management system, Lmod. Lmod, a Lua-based module system, can load and unloads modulefiles, handle path variables, and manage library and header files. (If you are using another Linux distribution, use the huge pages implementation appropriate for that distribution.) To use huge pages, load the appropriate craype-hugepages at link time. Possible values include:*
craype-hugepages128K
craype-hugepages512K
craype-hugepages2M
craype-hugepages4M
craype-hugepages8M
craype-hugepages16M
craype-hugepages32M
craype-hugepages64M
craype-hugepages128M
craype-hugepages256M
craype-hugepages512M
craype-hugepages1G
craype-hugepages2G
Viewing loaded modules
To view, for example, loaded modules and their versions:
user@hostname> module list
Currently Loaded Modules:
1) craype-x86-rome 5) xpmem/2.6.2-2.5_2.27__gd067c3f.shasta 9) cray-mpich/8.1.28
2) libfabric/1.15.2.0 6) cce/17.0.0 10) cray-libsci/23.12.5
3) craype-network-ofi 7) craype/2.7.30 11) PrgEnv-cray/8.5.0
4) perftools-base/23.12.0 8) cray-dsmml/0.2.2
Module versions are for example purposes only and may vary from those on the system.
Viewing available modules
To view, for example, available modules and their versions:
user@hostname> module avail PrgEnv
------------------------------------ /opt/cray/pe/modulefiles ------------------------------------
PrgEnv-amd/8.3.3 PrgEnv-cray-amd/8.4.0 (D) PrgEnv-gnu/8.3.3 PrgEnv-nvhpc/8.4.0 (D)
PrgEnv-amd/8.4.0 (D) PrgEnv-cray/8.3.3 PrgEnv-gnu/8.4.0 (D) PrgEnv-nvidia/8.3.3
PrgEnv-aocc/8.3.3 PrgEnv-cray/8.4.0 (L,D) PrgEnv-intel/8.3.3 PrgEnv-nvidia/8.4.0 (D)
PrgEnv-aocc/8.4.0 (D) PrgEnv-gnu-amd/8.3.3 PrgEnv-intel/8.4.0 (D) PrgEnv-gnu-amd/8.4.0 (D)
Module versions are for example purposes only and may vary from those on the system.
Administrator responsibilities
The CPE administrator is responsible for managing emerging CPE software suite needs continuously. It encompass a wide range of tasks throughout the lifecycle of the HPE Supercomputing EX system, from initial installation to end-of-life. These tasks involve system setup, configuration, maintenance, performance optimization, user support, and decommissioning. As such, a key CPE administrator requirement is to thoroughly understand CPE-related product areas. An understanding of these areas ensures smooth operations, optimization, and support for the CPE users and their working supercomputing environment. Retaining CPE-related system knowledge equips the administrator to administer and optimize the HPE Cray Supercomputing EX environment effectively. Moreover, this knowledge base helps administrators to enable researchers and engineers to maximize their productivity and scientific output.
Administrator Focus Areas |
Description |
Details |
|---|---|---|
CPE Architecture |
- Understand how CPE integrates with hardware including interconnect (HPE Slingshot) and storage subsystems. |
|
- Remain well-informed of the components of the CPE software suite, including: |
• Compilers (HPE-specific Fortran, C, C++ compilers) |
|
• Performance analysis tools (CrayPAT, Performance Tools) |
||
• Debuggers (DDT, Perforce TotalView) |
||
• Libraries (LibSci, MPI, Lustre) |
||
• Environment management tools (module or Spack for environment variables/software versions) |
||
Software Installation and Updates |
- Install, configure, and update the CPE software suite to match system hardware and user requirements. |
|
- Familiarize with HPE Cray EX package management and repositories. |
||
- Stay informed on patches, bug fixes, and HPE updates. |
||
- Set up and manage licensing for proprietary software. |
||
- Ensure compliance with licensing agreements. |
||
- Manage dependencies: |
• Resolve compatibility issues. |
|
• Test and validate updates to avoid workflow disruption. |
||
System Configuration and Customization |
- Configure compilers, libraries, and tools to optimize performance. |
|
- Customize environment modules for easy compiler/library access. |
||
- Manage compiler flags, optimizations, and linking for architectures (x86, ARM, GPUs). |
||
- Validation/testing: |
• Perform system validation and benchmarking. |
|
• Run test jobs and verify performance. |
||
Performance Tuning and Optimization |
- Use Cray performance tools (HPE CrayPAT) for analysis and optimization. |
|
- Identify bottlenecks in MPI, OpenMP, hybrid parallel apps. |
||
- Assist in optimizing apps for HPE Cray Supercomputing EX architecture (NUMA, memory hierarchy, interconnect). |
||
Parallel Programming Models and Best Practices |
- Familiarity with models: MPI, PGAS (Coarray Fortran), and GPU models (OpenMP, OpenACC, CUDA, HIP) |
|
- Learn best practices for writing/compiling parallel code on HPE Cray EX systems. |
||
- Reference: Implementing and supporting parallel application best practices. |
||
Monitoring and Logging |
- Monitor health, usage, performance using HPE/third-party tools. |
|
- Analyze logs/diagnostic outputs to resolve issues. |
||
- Track usage statistics for planning/upgrades. |
||
Security and Compliance |
- Manage accounts, permissions, authentication (LDAP, Kerberos). |
|
- Apply patches/updates to address vulnerabilities. |
||
- Implement data protection/compliance measures. |
||
Debugging and Troubleshooting |
- Proficiency with debugging tools (DDT, Perforce TotalView). |
|
- Troubleshoot job failures, compiler errors, runtime issues. |
||
- Resolve hardware-software integration issues and bugs. |
||
Job Scheduling and Resource Management |
- Understand scheduler integration (Slurm, PBS Pro). |
|
- Configure job submission scripts. |
||
- Manage user priorities/resource allocation. |
||
- Monitor load and optimize scheduling policies. |
||
Documentation and Reporting |
- Maintain documentation on configurations, software versions, customizations. |
|
- Create user guides/cheat sheets. |
||
- Generate reports on usage, performance, and maintenance activities. |
||
User Support and Training |
- Guide users on effective use of CPE tools. |
|
- Assist in debugging/performance analysis. |
||
- Organize/deliver training sessions and documentation. |
||
Vendor and Community Collaboration |
- Collaborate with HPE support. |
|
- Participate in HPE/community training, webinars, conferences. |
||
- Stay updated on HPC trends, practices, advancements. |
||
End-of-Life (EOL) Management |
- Assist migrating workflows/apps to new platforms. |
|
- Ensure compliance/disposal of hardware/software licenses. |
||
- Decommissioning: |
• Plan execution when system reaches EOL. |
|
• Archive user data/system configurations. |
Training and resources
HPE Cray official documentation and user guides. See Documentation and support for more information.
Online HPE Cray Supercomputing EX system training courses and certifications.
HPC community forums and mailing lists. For example, the Cray User Group (CUG).
Vendor support and knowledge base (HPE customer portal). See Documentation and support for more information.
Implementing and supporting parallel application best practices
Deploying parallel coding best practices ensure it is optimized for the unique architecture and capabilities of the HPE Cray Supercomputing EX system, enabling high performance and scalability for demanding computational workloads. Understanding and sharing among team members these practices is integral to CPE administrator responsibilities. Best practices include:
Understanding the HPE Cray Supercomputing EX system architecture
Using the CPE tools
Writing efficient parallel code
Employing compiler optimization
Leveraging HPE Cray performance tools
Debugging parallel code
Scaling and testing code
Implementing hybrid parallelism
Planning efficient I/O functions
Documenting and controlling versions
Staying updated on HPE Cray-specific features
Understanding the HPE Cray Supercomputing EX system architecture
Know the hardware: Understand the architecture of the HPE Cray EX system, including:
Processor details (for example, AMD, Intel Xeon, or ARM-based processors).
GPU accelerators (if present, for example NVIDIA or AMD GPUs).
The high-speed HPE Slingshot interconnect.
NUMA (Non-Uniform Memory Access) characteristics.
Optimize for the interconnect: Take advantage of the low-latency, high-bandwidth Slingshot interconnect by optimizing communication patterns in your parallel code.
Using CPE tools
Compilers: Use the provided compilers optimized for HPE Cray systems:
HPE Cray Compilers: HPE Cray Fortran, C, C++.
Third-party compilers: GCC, Intel, AMD ROCm, NVIDIA HPC SDK (for GPU programming).
Libraries: Use pre-optimized libraries for scientific computing:
HPE Cray LibSci: Provides optimized BLAS, LAPACK, ScaLAPACK, FFT, and sparse solvers.
HPE Cray MPI: Optimized MPI implementation for inter-process communication.
Environment modules: Use the module command to load specific compiler versions, libraries, and tools:
module load PrgEnv-cray module load cray-mpich module load cray-libsci
Writing efficient parallel code
Programming Models
Choose the appropriate parallel programming model depending on deployed workload:
MPI: For distributed-memory parallelism across nodes.
OpenMP: For shared-memory parallelism on a single node.
Hybrid MPI + OpenMP: To leverage both inter-node and intra-node parallelism.
CUDA / OpenACC / CUDA / HIP: For GPU programming (if GPUs are present).
UPC: For PGAS programming if the workload benefits from one-sided communication.
Optimize Communication
Minimize communication overhead by reducing the frequency and size of MPI messages or other communication operations.
Use collectives (for example, MPI_Reduce, MPI_Bcast) instead of point-to-point communication wherever possible.
Overlap computation and communication using asynchronous communication (for example, MPI_Isend and MPI_Irecv).
Load Balancing
Ensure workloads are evenly distributed across processes and threads to minimize idle time.
Use domain decomposition or other problem-specific techniques to balance workloads.
Memory Usage
Optimize memory access patterns to minimize cache misses and NUMA penalties.
Use proper memory alignment and avoid false sharing in shared-memory programming.
Leverage the HPE Cray MEMKIND library for managing memory on nodes with High-Bandwidth Memory (HBM).
Employing compiler optimization
Best practices for compiler optimization involves, for example, employing compiler optimization flags, enabling auto-vectorization and manual vectorization where possible, and profiling and analyzing compiler-generated reports to identify missed optimizations and taking corrective action. As an administrator, you should educate users on compiler flags and provide performance feedback for using profiling tools. Options for compiler optimization include:
Option |
Description |
|---|---|
Compiler Flags |
- Always enable optimization flags to take advantage of compiler optimizations for Cray EX systems. |
- HPE Cray compilers: Use -O2 or -O3 for optimization, and -hfp3 for aggressive floating-point optimizations. For C/C++, use -O or -Ofast for optimization. Fortran defaults to -O2, use -hfp to control the floating point optimization levels. |
|
- Debugging: Use -g to enable debugging symbols. |
|
- Vectorization: Use -hvector to control CPU vectorization. |
|
- Example: ftn -O3 -hfp3 -hvector my_program.f90 -o my_program |
|
GPU-Specific Flags |
- For GPU-accelerated codes, use compiler directives and flags to offload loops or computations to GPUs. |
- Cray compilers: Use -hacc for OpenACC, -hcuda for CUDA. Use -fopenmp for OpenMP (C/C++/Fortran); use -hacc for OpenACC (Fortran); use CC -x hip for HIP. |
|
- NVIDIA compilers: Use -gpu flags with NVIDIA HPC SDK. |
|
Profile-Driven Optimization |
- Use CrayPAT to collect performance data and feed it back into the compiler for profile-guided optimization (PGO). |
Leveraging HPE Cray performance tools
CrayPAT: Profile and analyze your application to identify bottlenecks in computation, memory access, and communication. Example usage:
module load perftools
pat_build -g mpi my_program
aprun -n 64 ./my_program+pat
pat_report my_program+pat
Debugging parallel code
Use HPE Cray-supported debuggers (for example, Cray DDT or Perforce TotalView) to debug MPI, OpenMP, or hybrid applications.
Debug runtime errors such as deadlocks, data races, and out-of-bounds memory accesses.
Use Cray’s statistical debugging tools to debug large-scale runs efficiently.
Scaling and testing code
Strong and Weak Scaling
Test your code for both strong scaling (fixed problem size, increasing cores) and weak scaling (problem size grows with core count).
Identify scaling limits and investigate bottlenecks (for example, communication or I/O).
Use Smaller Test Cases - Develop smaller test cases to validate correctness before scaling up to the full system.
Implementing hybrid parallelism
Take advantage of hybrid programming models (for example, MPI + OpenMP) to maximize the use of node-level shared memory and inter-node communication.
Use one MPI process per NUMA domain and multiple OpenMP threads per process to optimize performance.
Planning efficient I/O functions
Use parallel I/O libraries, such as HDF5, NetCDF, or MPI-IO to handle input/output efficiently at scale.
Avoid frequent small I/O operations; batch I/O to reduce overhead.
Optimize I/O patterns for the Lustre file system commonly used in Cray systems.
Documenting and controlling versions
Document compiler flags, runtime parameters, and environment settings for reproducibility.
Use version control systems (for example, Git) to track changes in your codebase.
Staying updated on HPE Cray-specific features
Regularly check for updates to the Cray Programming Environment and learn about new optimizations and tools.
Attend HPE-hosted webinars or training sessions to stay current with best practices.
CPE software download and installation
As administrator, it is important to be aware of and CPE updates and related systems. Also, understanding how these updates impact your environment and strategizing an implementation plan is important. CPE software can be obtained from the:
Before you download and install CPE software or updates, be sure to carefully plan and ensure all prerequisites are met to avoid disruptions and ensure a smooth update process. Ensure you also install only supported systems with the appropriate CPE version. Supported systems for this CPE release are detailed in Supported systems.
Prerequisites
You must have a HPE passport account to access software from the My HPE Software Center.
You must retain the appropriate administrator privileges to upload and install CPE software into your site system.
You should review information under the Prerequisites and Release Information tabs on the My HPE Software Center and ensure that you understand installation requirements and contents. Cited supporting software must be compatible with your HPC system environment.
Review release announcements/notes before installing new updates.
Ensure you have access to software repositories or the appropriate distribution channels for downloading updates. Also, ensure that the system’s network configuration allows access to the required repositories or download locations.
Verify that your credentials or entitlement keys are valid and properly configured.
Ensure the system meets the minimum hardware and software requirements for the new version of the CPE suite.
Verify that all dependencies (for example, specific versions of operating systems, compilers, or libraries) are in place before proceeding with the update. See Supported systems for CPE dependencies relative to systems with CSM or HPCM, or where CPE is installed on HP Cray XD2000 systems.
Confirm that firewalls or security settings do not block access to update servers.
Verify that there is adequate disk space for the downloaded files and the installation process.
Key considerations when downloading updates
Before executing the installation of any new update, consider:
Compatibility with System Configuration
Ensure the new version of the CPE software is compatible with the specific hardware, operating system version, and workload manager in use on your HPE Cray Supercomputing EX system. See Supported systems for CPE dependencies relative to systems with CSM or HPCM, or where CPE is installed on HP Cray XD2000 systems.
Check the release notes or documentation for any hardware or software dependencies.
Carefully Review Release Announcements and Notes
Carefully examine the release notes for the new version to understand new features, bug fixes, and known issues.
Look for deprecated features or tools that might affect existing workflows.
Backup and System State
Create a backup of the current programming environment, including module files, configurations, and user applications. Document the current environment setup to facilitate rollback, if needed.
Change Management
Communicate the planned update with users and stakeholders, as the update may introduce changes to compilers, libraries, or tools that could affect user workflows.
Schedule updates during a maintenance window to minimize the impact on users.
Test Updates
Test the updated software in a controlled environment or on a test system before deploying it to production.
Verify that critical applications and workflows operate as expected with the new version.
Network and Download Requirements
Ensure that the system has stable internet connectivity for downloading updates from HPE repositories.
Verify that you have sufficient storage space for the downloaded software and any temporary files created during the installation.
Modules and User Environment
Check for changes in module files or naming conventions, as they may affect user scripts or workflows.
Update documentation or user guides if there are changes in the way modules are loaded or used.
Licensing Compliance
Confirm that you have valid licenses for the updated software components.
Ensure that any license servers or keys required for the CPE suite are properly configured and up to date.
Deploy Documentation Tools:
Have the relevant installation and upgrade documentation readily available for reference.
Ensure that required tools (for example, package managers, installation scripts) are installed and functional.
System Downtime Planning
Prepare for system downtime during the update process.
Allow for downtime, especially if the update requires restarting services or rebooting the system.
Addressing these considerations ensures a smooth and reliable update process for the CPE suite software.
Downloading CPE from the HPE My Software Center website
Go to HPE Support Center to access CPE software updates from the HPE Support Center.
Enter the name of the software needed (for example, Cray Programming Environment).
Click Drivers and Software (either the tab near the top of right pane or from the left pane).
Locate the software needed in the listed results.
Click Obtain software. You are directed to the My HPE Software Center.
Downloading unofficial updates
HPE intermittently downloads unofficial and unsupported pre-release updates. These unsupported releases occasionally address minor system bugs and can be downloaded from the CPE Online Documentation website under the How to Access our Token-Authenticated Package Repository page. Follow the instructions provided at the site to download this intermittent software.
CAUTION: Downloads from the CPE Online Documentation website are unofficial and unsupported by HPE. Use caution if downloading this pre-released software or software components.
Contact HPE support for additional details regarding software downloads from this site, as necessary. See Documentation and Support for information on contacting HPE support.
Determining system administrator status
Knowing who has administrator status and what privileges they hold in the HPE Cray Supercomputing Programming Environment is critical for safe, secure, and reliable operation. Use the procedures in this section to determine administrator status.
Prerequisite
You must have:
Root or an existing administrator account to make changes to user privileges,
Familiarity with the specific HPE Cray Supercomputing EX system, its architecture, and how administrative roles are managed, and
An understanding about CPE and it components,
An understanding of differing administrator team roles,
The ability to log into a system management workstation or HPE service node. All administrative tasks require the system management workstation of HPE service node.
Procedure
Ensure you have administrator access. Determining whether a new CPE user has administrator privileges for managing CPE involves checking their access to specific nodes, groups, commands, and files. This process varies slightly depending on the system architecture, terminals, and operating systems (for example, SLES or RHEL).
a. Acquire SSH access to the appropriate node using a terminal application (for example ssh on Linux/macOS or PuTTY on Windows) to log into the system.
a. Log into the management node. For example:
ssh username@<hostname>
Replace <hostname> with the name of the management or login node (for example, smw01 for the SMW, or login01 for a login node).
b. Verify the node type:
hostname
Check group memberships, and interpret results:
groups
If your username is part of the wheel, sudo, or crayadmin group, you likely have administrator privileges. If none of these groups are listed, you likely do not have administrative access.
Test sudo access by running a test command:
sudo ls /root
If prompted, enter your user password. If the command succeeds, you have administrator privileges. If Permission denied or User is not in the sudoers file appears, you do not have administrative access.
Test access to specific directories. For example:
ls /opt/cray ls /etc/opt/cray
If you can view the contents of these directories, you likely have administrative privileges.
Check permissions. If you encounter Permission denied errors, you likely do not have the necessary privileges.
Check module access:
module list
Administrative users should be able to load CPE modules.
Load administrative modules:
module load cpe
If the module loads successfully, you likely have access to administrative tools. If you encounter errors, you may lack administrative privileges or the proper configuration.
Verify access the CPE tools:
xtstat
Administrative users generally have access to CPE-specific tools and commands. If the command works without errors, you likely have administrative privileges.
Exam tool configuration, and check if you can access Cray-specific configuration files under /etc/opt/cray:
ls /etc/opt/cray
Setting up the initial administrator account
During initial installation of new HPE Cray Supercomputing EX system with CPE, HPE sets up initial administrator privileges based on customer input. Assigning administrator privileges for CPE on an HPE Cray Supercomputing EX supercomputing system is a key step during the installation process. CPE provides development tools and software for high-performance computing (HPC) workloads, and administrators need appropriate privileges to manage and configure this environment.
Note: The exact groups, permissions, and configuration steps may vary based on the specific version of CPE and the organization’s policies.
Prerequisite
Administrator access is required to initiate this initial procedures in this chapter.
Procedure
To set up an initial administrator account, HPE installers:
Access the management node or system management interface to be able to manage the HPE Cray Supercomputing EX system. The installer uses secure credentials to access the management environment for configuration purposes.
Identify or create a user account. This involves identifying the user account that will serve as the CPE Administrator. If an appropriate account does not already exist, the installer creates a new user account specifically for this role. This is typically done using standard Linux account management tools (useradd, passwd, and so forth) or through management scripts provided by HPE.
Assign privileges to the user. Privilege assignment ensures that the user account has access to the necessary tools, software modules, and configuration files. This process may include:
Adding the user to administrator groups. Installer adds the user to specific system groups required for managing CPE. Common groups may include:
pe-admin: A group often associated with administrative access to CPE.
root or other system-level groups if broader administrative access is required.
Commands to add a user to a group might include:
usermod -aG pe-admin <username>
Granting access to CPE tools. This step ensures that a user can access and configure the CPE tools, such as compilers, libraries, and debugging utilities. This step might involve modifying environment variables, module paths, or configuration files located in directories, such as /opt/cray/pe/ or similar.
Access to file systems. The administrator must have access to relevant file systems where CPE tools and modules are installed. This may involve setting appropriate permissions on directories like:
/opt/cray/pe/ /etc/opt/cray/pe/
Configure secure authentication for the administrator account to ensure that only authorized personnel can access CPE tools. This step can include setting up SSH key-based access, enforcing strong passwords, or enabling multi-factor authentication (MFA).
Validate privileges. The installer tests the administrator account to ensure it has the required access and functionality to manage CPE. This validation step involves:
a. Loading and unloading software modules (for example, using module load and module unload commands).
b. Configuring compiler settings and library paths.
c. Accessing debugging tools and performance analysis utilities.
After the privileges are validated, the installer documents the setup process and provides the administrator credentials and relevant instructions to the designated CPE administrator. This documentation typically includes:
Account details.
Steps for managing PE tools and modules.
Paths to configuration files and installed software.
The installer ensures that the privilege assignment aligns with HPE documentation, best practices, and security requirements. Specific details may depend on the software version and organizational policies.
Setting up, managing, and maintaining CPE users and user groups
Prerequisites
Retain root or an existing administrator privileges to set up or make changes to user privileges.
Retain login access to the system where CPE is installed, typically through SSH or another secure method.
Maintain familiarity with:
Specific HPE Cray Supercomputing EX systems, its architecture, and how administrative roles are managed.
System authentication.
Job scheduling tools.
Role-based access control (RBAC) through configuration files or centralized authentication systems (such as LDAP, Active Directory)
Linux/Unix groups on the system (for example, crayadmin or similar groups)
Important: Before making system modifications, be sure to back up any configuration files or settings, particularly if modifying system-level applications-specific configurations.
Prerequisites
Retain root or an existing administrator privileges to set up or make changes to user privileges.
Retain login access to the system where CPE is installed, typically through SSH or another secure method.
Maintain familiarity with:
Specific HPE Cray Supercomputing EX systems, its architecture, and how administrative roles are managed.
System authentication.
Job scheduling tools.
Role-based access control (RBAC) through configuration files or centralized authentication systems (such as LDAP, Active Directory)
Linux/Unix groups on the system (for example, crayadmin or similar groups)
Important: Before making system modifications, be sure to back up any configuration files or settings, particularly if modifying system-level applications-specific configurations.
Adding, deleting, and modifying configurations for CPE users
The following section provides instructions for:
Setting up a new CPE user
This procedure details how to set up a new CPE user. As you are completing this procedure, note that the exact commands and procedures may vary depending on the HPE Cray Supercomputing EX system configuration, authentication method (for example, LDAP, Kerberos), and job scheduler (for example, SLURM, PBS).
Gather new user information and access requirements:
Username
Full name
Email address
Group or project association
Home directory requirements
Shell preferences (for example, bash, zsh)
Use standard Linux commands to create the new user account. For example:
sudo useradd -m -s /bin/bash -G <group> <username>
Note:
-m: Creates a home directory for the user.
-s /bin/bash: Sets the user’s default shell.
-G <group>: Adds the user to a specific group (for example, a project group).
<username>: The system access name for the user.
Set the user’s password:
sudo passwd <username>
Ensure the user has appropriate permissions to access the necessary directories and files:
Verify access to their home directory (for example, /home/<username>).
Configure access to shared project directories, if applicable.
If the system uses resource allocation and limits (for example, through SLURM or PBS), assign the user to the correct groups for job scheduling and resource utilization.
For SLURM-based systems update the SLURM configuration to include the user in the appropriate account or partition. For example:
sacctmgr add user <username> DefaultAccount=<account>
Validate environment setup by ensuring that the user has access to CPE tools and modules. For example:
Check if the user can load CPE modules (for example, PrgEnv-cray, gcc, and so forth) through module load.
Verify paths to compilers and libraries are set correctly in their environment.
Test and verify the account by logging in as the new user or ask them to log in to ensure:
Successful authentication.
Proper access to their home directory.
Ability to load necessary modules and submit jobs.
Communicate credentials and guidelines. Provide the user with their login credentials, initial password, and instructions for accessing the system. Include guidelines for:
Changing their password.
Using modules to load tools.
Submitting jobs via the scheduler.
After the account is active, monitor new user activity to ensure that:
They can run jobs successfully.
Their resource usage is within expected limits.
Update system documentation or user management records to include new user information.
Deleting an existing CPE user
This procedure details instructions for deleting an existing CPE user. Deleting an existing user from an HPE Cray Supercomputing EX system involves several steps to ensure that the user’s account is disabled, their files are handled appropriately, and any system records (for example, job scheduler configurations) are updated. To delete a CPE user:
Confirm that the username of the user to be removed. Gather any additional information about their account, such as:
Home directory location
Group memberships
Active jobs or queued jobs in the job scheduler
If the system uses a job scheduler (for example, SLURM), check if the user has any active or pending jobs. For example, in SLURM:
squeue -u <username>
If active or queued jobs exist, coordinate with the user or relevant stakeholders to cancel them. To cancel jobs:
scancel -u <username>
Before permanently deleting the user, as a best practice, disable the account to prevent access while you handle their files and configurations. You can lock the account by running:
sudo usermod -L <username>
Alternatively, you can expire the account immediately:
sudo chage -E 0 <username>
If the user’s home directory or files need to be retained for archival or transfer purposes, back them up before deletion:
tar -czvf /backup/location/<username>.tar.gz /home/<username>
If you are ready to delete the account, remove the user and their home directory:
sudo userdel -r <username>
The -r option removes the user’s home directory and mail spool.
If you do not want to delete their home directory, omit the -r flag.
Remove the user from the job scheduler’s configuration. For example, in SLURM, you can remove the user from accounts or associations:
sacctmgr delete user name=<username>
If the user was part of specific groups (for example, project groups), remove their association from those groups:
sudo gpasswd -d <username> <group>
If the user was the only member of a specific group, consider deleting the group:
sudo groupdel <group>
Check for and remove any custom configurations or traces of the user, such as:
Entries in /etc/exports for NFS shares.
SSH keys in /etc/ssh/authorized_keys.
Resource allocation or quota configurations.
Ensure that the user account and associated data have been removed:
Check for the username in the system:
getent passwd <username>
Verify that the home directory or other files are no longer present.
Update system documentation or user management records to reflect the removal of the user.
Modifying configurations for an existing CPE user
Modifying the configurations of an existing CPE user on an HPE Cray Supercomputing EX system involves several steps, depending on the specific changes required. As a CPE administrator, you can adjust user settings related to account details, group memberships, job scheduler configurations, permissions, or environment variables.
Note: Always follow your organization’s policies and the official CPE documentation when performing user management tasks.
To modify an existing CPE user account:
Determine the username of the user whose configurations need to be modified. Identify the specific changes required, such as:
Updating account information (for example, shell, home directory).
Modifying group memberships.
Adjusting job scheduling/resource allocations.
Changing environment variables or module configurations.
To update basic user details like shell or home directory, use the usermod command:
a. To change the user’s shell:
sudo usermod -s /bin/zsh <username>
b. To change the user’s home directory:
sudo usermod -d /new/home/directory <username>
c. To move the home directory, ensure the old files are transferred to the new location:
sudo mv /home/<username> /new/home/directory sudo chown -R <username>:<group> /new/home/directory
Add or remove the user from specific groups:
a. To add the user to a group:
sudo usermod -aG <group> <username>
b. To remove the user from a group:
sudo gpasswd -d <username> <group>
If the system uses SLURM or another job scheduler, modify the user’s resource allocation, account, or partition access. For example, with SLURM:
a. Change the user’s default account:
sacctmgr modify user name=<username> set DefaultAccount=<new_account>
b. Add the user to a new account:
sacctmgr add user name=<username> Account=<new_account>
c. Update resource limits by modifying resource limits associated with the user, such as CPU hours or memory allocations, through the SLURM database or configuration files.
If the user needs access to new directories or files, adjust file system permissions using chmod or chown, grant access to a shared project directory:
sudo chown <username>:<group> /path/to/project sudo chmod 770 /path/to/project
If the user needs changes to their environment setup (for example, custom paths, module loading behavior), update their shell configuration files:
For bash: Modify /home/<username>/.bashrc or /home/<username>/.bash_profile.
For zsh: Modify /home/<username>/.zshrc.
For example, add custom paths or default module loads:
echo 'export PATH=/custom/software/bin:$PATH' >> /home/<username>/.bashrc echo 'module load PrgEnv-cray' >> /home/<username>/.bashrc
If the system uses centralized module configurations, adjust the relevant files or scripts that define user-specific module loading behavior.
Log in as the user or ask them to log in and verify that the changes are working as intended:
Check updated environment variables.
Confirm group memberships.
Test job submission and resource allocations.
Update system documentation or user management records to reflect the changes made to the user’s configuration.
Inform the user about the modifications and provide instructions if needed (for example, on new resource allocations or updated environment settings).
Setting up a user as an administrator
The following procedure details how to add administrative privileges to a new or existing CPE user. Before completing this procedure consider:
Security Impacts. Ensure that issuing administrative privileges in in line with your organization’s security policy.
CPE-specific Groups. Some HPE Cray EX Supercomputing systems may have specific administrative groups or roles. Be sure to check appropriate system documentation for any custom groups or access requirements. For documentation information, see Documentation and support.
Testing. After setup, test the user’s system capabilities to ensure that they have the required permissions without unnecessary access.
Determine which node the user is to use:
Management Node (for example, SMW or HPE Service Node): Used for administrative tasks, such as system configuration, HPE Cray Supercomputer EX system software installation, and system monitoring.
Login Node: Used for accessing the programming environment and running user-level development tasks.
Verify the hostname and confirm the node type:
hostname
Check if the user already exists:
id username
Do one the following:
If the user does not exist, create the user by issuing:
useradd -m -s /bin/bash username passwd username
If the user exists, go to step ##.
Log into the appropriate node using an account with root or administrator privileges:
ssh root@<hostname>, where <hostname> is the management or login node.
For example, to log in to the management node (SMW):
root@<hostname> ssh root@smw01
To log in to the login node:
root@<hostname> ssh root@login01
<hostname> is the name of the management node. smw is a management node. login01 is a login node.
Add the new or existing user to the appropriate administrative group. Do one of the following:
For SUSE Linux Enterprise Server (SLES):
sudo usermod -aG groupname username
In the above example, groupname is the name of the administrative group. SLES systems typically use the sudo or wheel group to grant administrative privileges. If the wheel group is not enabled for sudo access, enter /etc/sudoers file contains:
For Red Hat Enterprise Linux (RHEL):
a. Determine the appropriate administrative group. RHEL systems primarily use the wheel group to grant administrative privileges.
b. Add the user to the wheel group:
usermod -aG wheel username
c. Ensure the /etc/sudoers file contains (edit with visudo):
Ensure that the user has access to CPE-specific administrative tools and configuration files. This action may involve granting permission to directories like /opt/cray, /etc/opt/cray, or other system directories where CPE is installed.
sudo chown -R username /path/to/cray/directory sudo chmod -R 750 /path/to/cray/directory
In the above example, /path/to/cray/directory is a specific path to a directory on the system.
If additional HPE Cray-specific groups are defined (such as crayadmin, add the user to those groups:
sudo -aG crayadmin username
Ensure that the user’s environment is set up to use CPE by updating shell configuration files (for example, .bashrc or .profile), including necessary module commands:
module load cpe
Check CPE-specific documentation for additional environment variables or modules that must be loaded.
Test the user’s administrative access by switching to their account:
su - username
Confirm that the user can execute administrative commands, such as:
sudo ls /root
Confirm that the user can access and use CPE-specific tools.
Document and record the user’s new administrative privileges for auditing and troubleshooting purposes.
Setting up and managing user groups
Setting up user groups in CPE involves creating and managing Linux groups, configuring access to shared resources, and integrating groups with the job scheduler (such as, PBS or SLURM). User groups are essential for organizing users by projects, roles, or resource access requirements.
Note: If your system uses centralized authentication (for example, LDAP, Active Directory), group creation and management may need to be performed at the directory service level. Always refer to your organization’s policies and the official HPE Cray Supercomputing EX system and PBS documentation for best practices.
Planning for and setting up the user group with Slurm
Plan the user group structure beforehand by determining the purpose and structure of the groups. For instance:
Are groups organized by projects, departments, or roles?
Will groups control access to specific directories, files, or resources?
Are there job scheduler partitions or accounts linked to these groups?
Document the group names, their members, and their intended purpose.
Create a new group:
a. Use the groupadd command to create a new Linux group for the users:
sudo groupadd <groupname>
For example, to create a group for a project called, astro-research, enter:
sudo groupadd astro-research
b. (Optional) Assign a specific Group ID (GID). You can specify a GID during group creation:
sudo groupadd -g <GID> <groupname>
Add users to the group:
a. Add users to the group using the usermod command:
sudo usermod -aG <groupname> <username>
For example, to add a user jdoe to the astro-research group:
sudo usermod -aG astro-research jdoe
b. Use the groups command to verify the user’s group memberships:
groups <username>
If you are using PBS, configure queues based on user groups to control access to resources. To create or Modify a PBS Queue:
a. Edit the PBS queue configuration to restrict access to a specific group:
qmgr -c "create queue climate_queue queue_type=execution" qmgr -c "set queue climate_queue acl_user_enable=True" qmgr -c "set queue climate_queue acl_groups=climate-research" qmgr -c "set queue climate_queue enabled=True" qmgr -c "set queue climate_queue started=True"
Note: In the above example, a queue named climate_queue is created, and access is restricted to users in the climate-research group by enabling the acl_groups attribute.
b. Set default queue resource limits for the queue:
qmgr -c "set queue climate_queue resources_max.walltime=48:00:00" qmgr -c "set queue climate_queue resources_max.ncpus=64" qmgr -c "set queue climate_queue default_chunk.ncpus=1"
c. Set global PBS server policies to control group-based resource access by enabling Group Access Control at the Server Level:
qmgr -c “set server acl_group_enable=True”
d. Define global resource limits for groups. For example, to restrict the climate-research group to a maximum of 100 CPUs across the system, enter:
qmgr -c “set server resources_available.ncpus=100”
(Optional) If the group is to share a directory (for example, for project files), create the directory and configure permissions:
sudo mkdir /shared/projects/astro-research sudo chown :astro-research /shared/projects/astro-research sudo chmod 770 /shared/projects/astro-research
Note: In the above example, 770 grants full access to the group and the directory owner but denies access to others.
(Optional) Enable group sticky bit by ensuring that files created in the directory inherit the group ownership:
sudo chmod g+s /shared/projects/astro-research
If using Slurm, for example, configure the groups to control resource access and allocations. For Slurm:
a. Create or Update an Account in Slurm:
sacctmgr add account name=<accountname> description=”Astro Research Project”
b. Associate the group with the Slurm account:
c. Add users in the group to the corresponding Slurm account:
sacctmgr add user name=<username> account=<accountname>
Set resource limits for the group either at the system level (for example, through ulimit or cgroups) or in the job scheduler. For example, set Slurm QoS for a group to limit resource usage:
sacctmgr add qos name=<qosname> maxtres=cpu=1000 maxtresperuser=cpu=100 sacctmgr modify account name=<accountname> set qos=<qosname>
Ensure that the group is functioning as intended:
Verify group memberships with groups <username>.
Check access to shared directories and resources.
Confirm users can submit jobs with the correct group/account settings in Slurm:
sbatch –account=<accountname> jobscript.slurm
or
qsub -q climate_queue jobscript.pbs
Search Slurm Workload Manager documentation for more information on sbatch and batch scripts.
Confirm that users not in the group are denied access to the queue or resources.
Document the configuration, and maintain a record of the groups, their members, and their configurations:
Group names and members.
Associated PBS queues and resource limits.
Shared directories and file permissions.
Where applicable, maintain a record of the groups, their members, and their purpose. Include:
Group name and GID.
Members of the group.
Associated SLURM accounts, QoS, or resource limits.
Shared directories and access permissions.
Inform users about their group memberships, shared directory locations, and PBS queue configurations. Provide instructions on how to submit jobs to the appropriate queue.
Managing users for user groups
As you are managing user access within groups, note that:
If the system uses centralized authentication (such as LDAP, Active Directory), group membership changes may need to be made in the directory service instead of directly on the system. Consult your organization’s policies for managing groups in such environments.
You should always verify that users are added to the correct groups and that group permissions align with your organization’s security and resource access policies.
Adding users to a user group
Identify the croup and users:
Determine the name of the existing group to which users will be added.
Identify the usernames of the users to be added to the group.
Confirm the purpose of the group (e.g., file sharing, job scheduler access) and ensure the users require access.
Use the
usermodorgpasswdcommand to add users to the group.For a single user, enter:
sudo usermod -aG <groupname> <username>Note: In the above example:
aG: Adds the user to the group without removing them from other groups.<groupname>: The name of the group.<username>: The name of the user.
Example:
sudo usermod -aG research-group jdoeFor multiple users, enter:
sudo gpasswd -M <user1>,<user2>,<user3> <groupname>This replaces the group’s membership with the specified users. If you want to add users without overriding current members, add them individually using
usermod.Example:
sudo gpasswd -M jdoe,asmith,kwong research-group
After adding users, verify that they have been successfully added to the group.
a. Check a user’s group memberships:
groups <username>Example:
groups jdoeb. Check Members of a specific group. To see all members of a group, inspect the
/etc/groupfile, enter:getent group <groupname>Example:
getent group research-groupIf the group controls access to resources, such as shared directories or job scheduler queues, ensure that new users can access those resources:
Shared Directories: Verify the users have the appropriate permissions for any shared directories associated with the group.
ls -ld /path/to/shared/directoryIf needed, update directory permissions:
sudo chown :<groupname> /path/to/shared/directory sudo chmod 770 /path/to/shared/directory
Job Scheduler (PBS): If the group is associated with a PBS queue, confirm that new users can access the queue. You may need to update PBS access controls:
qmgr -c "set queue <queue_name> acl_user_enable=True" qmgr -c "set queue <queue_name> acl_users+=<username>"
Test Access by asking users to test their access to shared directories, PBS queues, or other resources to confirm they have been properly added to the group.
Document the changes. Update system documentation or user management records to reflect the changes to group membership. Record:
Group name.
Users added to the group.
Purpose of the group and associated resources.
Inform the users of their new group membership and provide instructions if necessary (for example, how to access shared directories, submit jobs to specific queues, and so forth).
Removing users from groups
To remove a user from a group:
Understand the context in the user(s) are being removed. In Slurm, user groups are typically managed through Linux user groups on the system. Slurm uses these groups to control access to resources through accounts and associations. To remove a user from a group associated with Slurm, primarily interact with the system’s user and group management tools.
Identify the group and users
Confirm the group name from which the users will be removed.
Identify the usernames of the users to be removed.
Verify that the users no longer require access to resources associated with the group (for example, shared directories, PBS queues).
Remove the user(s) from the Linux group. If the user group in Slurm is linked to a Linux group, remove users from that group using standard Linux commands.
To remove one user, enter:
sudo gpasswd -d <username> <groupname>In the above example,
<username>with the username of the user you want to remove, and<groupname>with the name of the group.Example:
sudo gpasswd -d jdoe research-groupTo replace the entire membership of a group while excluding certain users, use the
gpasswdcommand with the-Moption to redefine the group membership:sudo gpasswd -M <remaining_user1>,<remaining_user2> <groupname>Example:
sudo gpasswd -M asmith,kwong research-groupThe above command overwrites the group membership with only the specified users.
To edit the
/etc/groupfile directly, use:sudo nano /etc/group
Locate the
<groupname>entry, and remove the user’s name from the list of group members.sudo gpasswd -d <username> <groupname>Save and exit the file.
If Slurm is configured to use its own account and association system, ensure the user’s association with the group is removed in Slurm. Use the
sacctmgrcommand to manage Slurm accounts and associations. To remove a user from a Slurm account or group, enter:sacctmgr remove user where user=<username> account=<accountname>In the above example, replace
<username>with the username of the user and<accountname>with the name of the Slurm account or group.Example:
sacctmgr remove user where user=johndoe account=researchFor PBS, verify that they are no longer part of the group.
a. Check the user’s group memberships to confirm that a user is no longer a member of the group by issuing:
groups <username>Example:
groups jdoeb. Check members of a specific group to confirm the current membership of a group:
getent group <groupname>Example:
getent group research-groupc. If the group is associated with specific resources, ensure that the user’s access to those resources is revoked, as applicable.
For shared directories, verify that the user can no longer access shared directories associated with the group. If necessary, update permissions:
sudo chmod 770 /shared/projects/research-group udo chown :research-group /shared/projects/research-group
For the PBS job scheduler, if the group is tied to a PBS queue, ensure that the user’s access to the queue is also revoked. For example:
qmgr -c "set queue <queue_name> acl_users-=<username>"In the above example, replace
<queue_name>with the name of the queue and<username>with the user being removed.Confirm the changes when prompted.
Verify that the user has been removed from the group:
Check the Linux group membership by entering:
groups <username>The above command should no longer list the specified
<groupname>.Check Slurm account associations, as applicable:
sacctmgr show associations where user=<username>The system should no longer list the association with the specified
<accountname>after issuing the above command.
If the user has been removed from a group or account that controls access to compute resources, notify them to prevent confusion.
Document updates. Include:
Group name.
Users removed.
Associated resources affected.
Inform the affected users of the changes, especially if their access to specific resources (for example, directories or job queues) has been revoked.
Deleting unused groups
If a group is no longer needed, use:
sudo groupdel <groupname>
Common administrator customization and configuration tasks
CPE is designed to optimize and simplify the development, debugging, and execution of applications on HPE Cray supercomputers. As a system administrator, managing the CPE involves configuring, customizing, and maintaining the environment to meet the needs of users and workloads. Common administrative procedures include managing software modules, configuring compilers and libraries, customizing job environment settings, and optimizing system performance.
Prerequisites
To perform the procedures in this chapter, you must have:
Root or administrative access privilege to the HPE Cray Supercomputing EX system,
Access to the configuration files for Lmod (typically in /etc/modulefiles or /opt/modulefiles),
Access to HPE Cray Supercomputing EX system library directories.
Access to the module system configuration files (for example, /etc/profile.d/modules.sh or /etc/profile.d/lmod.sh),
Access to sample applications for profiling.
HPE CrayPAT module installed.
Root access to the Slurm configuration files (/etc/slurm/slurm.conf), and/or
Familiarity with the module system (module command), wrapper scripts (such as cc, CC, ftn), and Slurm commands (for example, sbatch, srun).
Managing software modules
The CPE uses the Lmod module system to manage software environments, allowing users to load and unload specific versions of tools, compilers, and libraries. To manage software modules:
Verify the module system configuration by checking the location of the modulefiles directory:
module path
Verify that the module system is operational:
module avail
Add a new module by copying or creating a modulefile for the software:
nano /opt/modulefiles/software_name/version
Example: gcc modulefile content:
#%Module1.0 proc ModulesHelp { } { puts stderr "This module loads GCC version x.y.z" } module-whatis "GCC Compiler x.y.z" setenv GCC_HOME /opt/gcc/x.y.z prepend-path PATH /opt/gcc/x.y.z/bin prepend-path LD_LIBRARY_PATH /opt/gcc/x.y.z/lib64Update the module cache by refreshing the module cache, as applicable:
module –ignore-cache avail
Load the module, and verify that it works:
module load software_name/version software_name --version
Configuring default modules
Default modules can be configured to ensure that users have access to essential tools and libraries when they log in. To do so:
Edit the default module configuration by modifying the system-wide module initialization
nano /etc/profile.d/modules.sh
Add commands to load default modules:
module load cpe/22.10 module load gcc/11.2.0
Log in as a non-administrative user and verify that the modules are loaded by default:
module list
If issues arise, revert the changes by editing the file again or restoring the previous configuration.
Configuring compiler and library defaults
CPE includes compilers (such as the HPE Cray Compiler Environment (CCE), GCC, Intel, and so forth) and libraries (such as Cray LibSci, MPI, and so forth). Administrators can configure default versions or customize compiler options.
Set the default compiler version. For module-based configurations:
module unload cce gcc intel module load gcc/11.2.0
Verify the default compiler:
cc –version
Customize compiler flags by editing the system-wide compiler wrapper configuration file (usually found in /opt/cray/pe):
nano /opt/cray/pe/compilers/default/compiler_flags
Add custom flags:
export CFLAGS="-O3 -march=native" export LDFLAGS="-L/opt/cray/lib"
Compile a sample program to verify the configuration:
cc test_program.c -o test_program ./test_program
Customizing job scheduler integration
HPE Cray Supercomputing EX systems often use Slurm as the workload manager. To customize Slurm settings to optimize job submission and resource usage for the CPE:
Edit the Slurm configuration
nano /etc/slurm/slurm.conf
Example: Adding constraints for high-bandwidth memory (HBM):
NodeName=cray[1-4] RealMemory=64000 Gres=hbm:16
Apply the changes by restarting Slurm:
systemctl restart slurmctld systemctl restart slurmd
Submit a test job requesting HBM:
sbatch –gres=hbm:4 test_job.sh
Managing HPE Cray-specific libraries
HPE Cray LibSci is a key library for scientific computing. To configure or update it, for example:
Check installed versions by listing available versions of HPE Cray LibSci:
module avail cray-libsci
Set default LibSci version by loading the desired version. For example:
module load cray-libsci/25.03.1
Verify the configuration by compiling and linking a sample program using HPE Cray LibSci:
cc -o test_program test_program.c -lsci
Optimizing performance using HPE CrayPAT
HPE Cray Performance Measurement and Analysis Tools (CrayPAT) are used to profile and optimize applications.
Load the HPE CrayPAT module:
module load perftools
Compile the application with instrumentation:
cc -h profile_generate -o test_program test_program.c
Run and collect data by executing the program to generate performance data:
srun ./test_program
Use HPE CrayPAT tools to analyze the collected data:
pat_build -O test_program pat_report test_program.xf
Common security protocols
Setting up common CPE-related security protocols is a critical task for administrators to ensure the system is secure, logs are properly analyzed, and unauthorized access or malicious activity is detected and mitigated. This chapter details some of the basic areas that an administrator should focus on, the steps to set up and analyze security protocols, and the tools required for each procedure. This process includes securing user authentication, configuring logging and auditing, setting up network security, and monitoring for suspicious activity.
Basic and key areas where security protocols need to be established include:
Area |
Description |
|---|---|
User Authentication and Access Control |
- Ensure secure login mechanisms (for example, SSH with key-based authentication). Restrict user access using PAM (Pluggable Authentication Module) and account policies. |
- Enforce strong password policies. |
|
Logging and Auditing |
- Centralize system logs. |
- Enable detailed auditing of system and user activities. |
|
- Monitor logs for anomalies. |
|
Network Security |
- Restrict network access using firewalls or iptables. |
- Configure secure communication protocols. |
|
- Monitor network traffic for unauthorized access. |
|
Software and Environment Security |
- Protect CPE-related modules and libraries.- Ensure system updates and patches are applied. |
Monitoring and Intrusion Detection |
- Set up intrusion detection systems (for example, fail2ban, auditd).- Analyze logs for unusual activity. |
User authentication and access control
The procedure in this section provides steps for securing user authentication and limit access to authorized users only.
Prerequisites
You must have:
Root or administrator access to the system.
SSH installed and configured.
Access to ssh-keygen, passwd, privileged access management (PAM) configuration files.
Procedure
To secure user authentication and limit access to authorized users:
Log in to the management node:
ssh admin@<hostname>
Set Up key-based SSH authentication
a. Generate SSH keys on the client machine:
ssh-keygen -t rsa -b 4096
b. Copy the public key to the HPE Cray Supercomputing EX system:
ssh-copy-id user@<hostname>
c. Disable password-based authentication in /etc/ssh/sshd_config:
PasswordAuthentication no
d. Restart the SSH service:
sudo systemctl restart sshd
Restrict root login by editing /etc/ssh/sshd_config to disable root login:
PermitRootLogin no
Enforce strong password policies by editing /etc/security/pwquality.conf to set password requirements:
minlen = 12 dcredit = -1 ucredit = -1 lcredit = -1 ocredit = -1
Test the changes:
passwd <username>
Limit user access by restricting system access to specific users using /etc/security/access.conf:
-:ALL EXCEPT admin_user:ALL
Enabling logging and auditing
This procedures details steps for enabling logging and auditing of user and system activity.
Prerequisites
You must have:
Administrative access to the system.
Access to rsyslog, journalctl, and auditd tools.
Procedure
To enable logging and auditing:
Enable system logging:
sudo systemctl enable rsyslog sudo systemctl start rsyslog
Check the configuration file /etc/rsyslog.conf to ensure log files are written to /var/log:
tail -f /var/log/messages
Enable persistent journal logs by configuring the journal to persist logs across reboots:
sudo mkdir -p /var/log/journal sudo systemctl restart systemd-journald
Ensure auditd is installed:
sudo yum install audit sudo systemctl enable auditd sudo systemctl start auditd
Add audit rules to /etc/audit/rules.d/audit.rules:
-w /etc/passwd -p wa -k passwd_changes -w /etc/shadow -p wa -k shadow_changes -w /var/log/secure -p wa -k auth_logs
Analyze the logs for anomalies. Use journalctl to view recent logs:
journalctl -xe
Search for specific keywords (such as, error, failed) after issuing:
grep -i “error” /var/log/secure
Setting up network security
This CPE-related procedures provides instructions for setting up network security.
Prerequisites
You must have access to:
Network configuration files.
iptables, firewalld, iftop, and sar.
Procedure
Use firewalld to allow only specific ports:
sudo firewall-cmd --add-service=ssh --permanent sudo firewall-cmd --add-service=slurm --permanent sudo firewall-cmd --reload
Use iptables to limit SSH attempts:
sudo iptables -A INPUT -p tcp –dport 22 -m connlimit –connlimit-above 5 -j DROP
Use iftop to monitor real-time network activity:
sudo iftop -i eth0
The iftop command is a real-time network traffic monitoring tool commonly used to observe and analyze network activity on a specific interface. By running sudo iftop -i <interface>, you can monitor bandwidth usage, including inbound and outbound traffic between nodes or external systems.
Example Normal Report:
This example normal report is for a supercomputing environment where nodes are exchanging data for workloads like parallel computations or file transfers. No anomalies are present.
10.0.0.1 => 10.0.0.2 5.5Mb 5.6Mb 5.5Mb <= 4.0Mb 4.1Mb 4.0Mb 10.0.0.3 => 10.0.0.4 1.2Mb 1.0Mb 1.1Mb <= 0.8Mb 0.7Mb 0.8Mb 10.0.0.5 => 10.0.0.6 0.5Mb 0.5Mb 0.5Mb <= 0.4Mb 0.4Mb 0.4Mb ---------------------------------------------------------------------------- TX: 7.2Mb RX: 5.2Mb TOTAL: 12.4MbReport Explanation:
Traffic Patterns:
The source (10.0.0.1, 10.0.0.3, 10.0.0.5) and destination (10.0.0.2, 10.0.0.4, 10.0.0.6) nodes are communicating normally.
Bandwidth usage is proportional to expected workload, with no significant spikes.
Traffic Volume:
Outbound (TX) traffic is 7.2Mb, and inbound (RX) traffic is 5.2Mb.
The total bandwidth usage on the interface is 12.4Mb, which is reasonable for moderate workloads.
Steady Traffic: Bandwidth usage is consistent across time intervals (2s, 10s, 40s averages are similar).
If anomalies are detected, resolve:
Irregular traffic patterns by investigating the process or application on 10.0.0.1, or check job logs, network configurations, or system resource usage for anomalies.
Unusual traffic patterns by using tools like netstat, tcpdump, or ss to identify the processes generating traffic, or checking application logs or job scheduler activity for anomalies.
Idle or under-utilized network activity by verifying whether the interface is correctly configured and active with the ip link show eth0, or checking if jobs or applications are running that should generate traffic.
Use sar to view historical network data:
sar -n DEV 1 5
Example Normal Report:
In the example normal report, the system is handling steady network traffic with no apparent anomalies.
12:00:01 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s 12:00:02 AM eth0 1200 1100 3200 3100 0.00 0.00 0.00 12:00:03 AM eth0 1250 1150 3300 3200 0.00 0.00 0.00 12:00:04 AM eth0 1300 1200 3400 3300 0.00 0.00 0.00 12:00:05 AM eth0 1290 1190 3380 3250 0.00 0.00 0.00 Average: eth0 1260 1160 3320 3210 0.00 0.00 0.00
Report Heading
Explanation
rxpck/s, txpck/s
- Packets received/transmitted per second. Normal values depend on the workload but should remain consistent during steady traffic.
rxkB/s, txkB/s
- Received and transmitted kilobytes per second. Normal values depend on the expected data transfer rates for the application.
rxcmp/s, txcmp/s
- Compressed packets. Typically 0.00 unless compression is enabled.
Example Abnormal Report:
For the example abnormal report, investigate the source of excessive outbound traffic (for example, application logs, intrusion detection). Check for network congestion or malicious activity.
12:00:01 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s 12:00:02 AM eth0 10 10000 50 10000 0.00 0.00 0.00 12:00:03 AM eth0 15 12000 60 12000 0.00 0.00 0.00 12:00:04 AM eth0 10 15000 50 15000 0.00 0.00 0.00 12:00:05 AM eth0 12 14000 55 14000 0.00 0.00 0.00
Report Heading
Explanation
High txpck/s and txkB/s
- Extremely high transmission rates suggest excessive outbound traffic, possibly caused by a misconfigured application or a denial-of-service (DoS) attack.
Low rxpck/s and rxkB/s**
- Very low inbound traffic may indicate a connectivity issue or an imbalance in communication.
Setting up software and environment security
The procedure in this section provides details on securing CPE-related modules and software, and ensure updates are applied.
Prerequisites
You must have:
Administrative access.
Access to module, system package manager (yum, zypper).
Procedure
Restrict access to critical modules by modifying permissions for sensitive modulefiles:
chmod 750 /opt/modulefiles/cce chgrp admin_group /opt/modulefiles/cce
Update HPE Cray Supercomputing EX software and dependencies:
sudo yum update
Check for missing or corrupted modules:
module avail
Monitoring and intrusion detection
The procedure in this section provides details for detecting and mitigating unauthorized access or activity using monitoring and intrusion detection tools.
Prerequisites
You must have:
Administrative access.
Access to fail2ban and auditd.
Procedure
Install fail2ban:
sudo yum install fail2ban sudo systemctl enable fail2ban sudo systemctl start fail2ban
Configure jail rules in /etc/fail2ban/jail.local:
sshd] enabled = true port = ssh filter = sshd logpath = /var/log/secure maxretry = 5
Monitor intrusion attempts, and view the fail2ban log for blocked IPs:
sudo fail2ban-client status sshd
Use auditctl to inspect suspicious activity:
sudo ausearch -k auth_logs
Common CPE monitoring tasks
Maintaining the health of CPE involves monitoring system health, analyzing log files for diagnostics, and tracking software usage to ensure optimal functionality and usage by users. This chapter details the most common administrative procedures for these tasks.
Analyzing log files and diagnostic outputs
Analyzing system and application logs is critical for diagnosing CPE issues, such as module failures, job errors, or hardware malfunctions.
Prerequisites
Administrative access to system logs.
Familiarity with log locations and tools (less, grep, journalctl).
Required tools/systems
System logs (/var/log), Slurm logs (/var/log/slurm/slurmctld.log), and CPE-specific logs (such as craype.log).
Log analysis tools: grep, less, journalctl.
Reviewing system logs
To review system logs:
Locate general system logs:
ls /var/log
List CPE-specific logs (for example, craype.log):
ls /opt/cray/logs
Use grep to filter errors or warnings:
grep -i error /var/log/messages
Analyze recent system events:
journalctl -xe
Example: Problematic Output
Oct 30 12:47:10 smw01 kernel: eth0: Link is Down Oct 30 12:47:12 smw01 kernel: eth0: Link is Up - 1Gbps/Full - flow control off Oct 30 12:47:12 smw01 systemd-networkd[112]: eth0: Lost carrier Oct 30 12:47:12 smw01 systemd-networkd[112]: eth0: Configured Oct 30 12:48:30 smw01 kernel: eth0: Link is Down Oct 30 12:48:45 smw01 kernel: eth0: Transmit queue timeout Oct 30 12:48:45 smw01 kernel: eth0: Reset adapter Oct 30 12:49:00 smw01 systemd-networkd[112]: eth0: Could not configure: Network unreachable
The above example suggests intermittent connectivity issues:
Transmit Queue Timeout: Indicates that packets are queued for transmission but the system is unable to send them:
Oct 30 12:48:45 smw01 kernel: eth0: Transmit queue timeout
This issue could be caused by hardware issues (such as faulty NIC or cable) or excessive network congestion.
Reset Adapter: The kernel resets the network adapter to recover from the timeout:
Oct 30 12:48:45 smw01 kernel: eth0: Reset adapter
Network Unreachable: Indicates that the system could not configure the network interface due to a lack of connectivity:
Oct 30 12:49:00 smw01 systemd-networkd[112]: eth0: Could not configure: Network unreachable
To troubleshoot problematic output:
Investigate hardware issues:
a. Check the physical connection (for example, cables, switches, NICs).
b. Use ip link to check the status of the interface:
ip link show eth0
Verify that the network interface is configured correctly:
ip addr show eth0
Restart the network service:
sudo systemctl restart systemd-networkd
Monitor for intermittent issues by continuously logging network-related events:
journalctl -f -u systemd-networkd
Investigate packet loss by using tools like ping or iperf to test connectivity and bandwidth.
If the issue persists, replace the network adapter, cable, or switch connected to the affected interface.
Analyzing Slurm logs
Check and inspect Slurm controller logs for job errors:
less /var/log/slurm/slurmctld.log
Filter for job errors, and search for specific job IDs:
grep <JobID> /var/log/slurm/slurmctld.log
Check logs from compute nodes for hardware or software issues.
Analyzing CPE-specific logs
To analyze logs:
Locate and review HPE Cray-specific logs. CPE logs are typically located in /opt/cray/logs or /var/log/cray.
ls /var/log/cray
Search for issues in craype.log or craypat.log:
grep -i error /var/log/craype.log
Record errors and determine potential causes for troubleshooting.
Example: Normal Output
Oct 30 12:45:01 smw01 craype[1234]: ERROR: Failed to load module 'cray-mpich': Module not found
Example: Problematic Output #1
Oct 30 12:45:01 smw01 craype[1234]: ERROR: Failed to load module 'cray-mpich': Module not found
To resolve Module not found issues (see issue directly above):
Verify the availability of the missing module:
module avail
Check the modulefiles directory for the cray-mpich module:
ls /opt/modulefiles/cray-mpich
If the module is missing, reinstall the HPE Cray MPI library or restore its modulefile.
Example: Problematic Output #2
Oct 30 13:00:15 smw01 craype[5678]: ERROR: Compiler 'cc' failed with error code 127 Oct 30 13:00:15 smw01 craype[5678]: ERROR: Unable to compile test program for compatibility check
To resolve module error issues (see issue directly above):
Ensure the compiler module is loaded:
module load cce
Verify the compiler version:
cc –version
Check if the cc binary is installed and accessible in the PATH:
which cc
If the cc module is broken, consider reinstalling the HPE Cray compiler suite (CCE).
Tracking CPE usage
Tracking how users interact with CPE modules, compilers, and libraries is important for resource planning and identifying underutilized or problematic software.
Prerequisites
Administrative access to the system.
Familiarity with Slurm accounting and module usage tracking mechanisms.
Required tools/systems
• Slurm accounting (sacct). • Environment module usage logs (if enabled). • Performance tools (such as, HPE CrayPAT).
Tracking module usage
Enable module logging, and add the module initialization file (for example, /etc/profile.d/modules.sh):
export MODULE_LOGFILE=/var/log/module_usage.log
Review the log file for module usage:
less /var/log/module_usage.log
Search for a specific module:
grep “cce” /var/log/module_usage.log
Example: Module Usage Log Output
Oct 30 12:00:01 user1 module: load cce/14.0.0 Oct 30 12:00:05 user1 module: load cray-mpich/8.1.9 Oct 30 12:00:10 user1 module: unload cce/14.0.0 Oct 30 12:05:20 user2 module: load gcc/11.2.0 Oct 30 12:10:15 user3 module: load cray-libsci/21.03.1 Oct 30 12:15:00 user1 module: load perftools/21.08.0
The above output reports:
Timestamp: The date and time when the module operation occurred.
Username: The user who executed the module command.
Operation: The module operation performed (for example, load, unload, swap).
Module Name and Version: The full name of the module (including version) being loaded, unloaded, or swapped.
Analyzing job resource usage
Use Slurm accounting to view job resource usage:
sacct –format=JobID,User,Partition,AllocCPUs,Elapsed
Create a usage summary:
sacct -S 2023-10-01 -E 2023-10-31 –format=User,JobName,AllocCPUs -P
Profiling application performance
Enable HPE CrayPAT:
module load perftools
To instrument and run an application, recompile the application with profiling:
cc -h profile_generate -o app app.c srun ./app
Access and analyze the performance report:
pat_report app.xf
Troubleshooting CPE
This section provides information on resolving common CPE issues. Should you encounter issues not included in this section, see Documentation and Support for additional resources and information on contacting HPE support.
Resolving the CCE PGAS error and dependency issue resulting in failed image builds
CPE releases previous to the CPE 26.03 release supported OpenSHMEM libraries on HPCM systems. However, with CPE 26.03 (and later), OpenSHMEM libraries are no longer supported on HPCM-based systems. This support limitation results in a potential issue during the installation procedure on HPCM-based systems.
Symptom
While building the CPE image on a system with HPCM, the image fails to build after the resulting error appears. For example:
Problem: conflicting requests - nothing provides libsma.so.0()(64bit) needed by cce-21.0.0-pgas-ofi
Cause
The cce-21.0.0-pgas-ofi package requires libsma.so.0 which was previously provided by cray-dsmml, but cray-dsmml is no longer a part of CPE 26.03 (or later).
Resolution
To remediate the issue, either:
To resolve the package dependency for cce-21.0.0-pgas-ofi from the new OpenSHMEM package, acquire the new OpenSHMEM release media, install it, and enable the repository. Also ensure that the new OpenSHMEM library has been installed into the base compute image.
To resolve the package dependency for cce-21.0.0-pgas-ofi from an older OpenSHMEM package, acquire the CPE 25.09 release media, install it, and enable the repository.
Do not install cce-21.0.0-pgas-ofi.
Resolving an issue where MPICH generates an OFI failure
An attempt to register a memory buffer for off-node MPI communication results in an MPICH error message.
Symptom
If this issue occurs, the following error message appears:
MPICH ERROR
...
OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Bad address)
Cause
This error occurs because:
An invalid address was issued. The validity of the memory region passed to MPI communication should be verified.
An attempt to use a GPU buffer in an MPI call was made without setting MPICH_GPU_SUPPORT_ENABLED to 1.
An application error occurred, particularly, when using GPU-aware MPI.
Too short of a buffer was passed to an MPI call.
Resolution
To resolve this error, perform one of the following:
Ensure the code is valid. The MPICH backtrace may often include the memory address in the call signature, and obvious issues may appear.
Verify that the validity of the memory region passed to MPI communication properly,
Set MPICH_GPU_SUPPORT_ENABLED to 1 (MPICH_GPU_SUPPORT_ENABLED=1) if you are using GPU memory, or
Establish the validity of the location and length of the buffer passed to MPI, if you are not using GPU memory.
Resolving issue where an incorrect MPICH version is linked in an application
The incorrect HPE Cray MPICH version is dynamically-linked in by the application.
Symptom
The output of the module list shows one version of cray-mpich but shows another version is being used after the program is executed.
Cause
The CPE module environment reflects the programming libraries that are used at build time.
Resolution
To make the runtime environment reflect the modules that are currently loaded, either:
Set LD_LIBRARY_PATH to:
LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:LD_LIBRARY_PATH
“Hard-code” the CPE library version into the executable. Note that compiler driver (CC/ftn/CC) -add-rpath and -add-runpath options can be used.
Change the HPE Cray MPICH default version, an administrator can execute the /opt/cray/pe/admin-pe/set_default_files/set_default_mpicj_<VERSION> script in the appropriate CPE image.
Resolving a bad address or bus error during MPI operations
An application encounters errors during an on-node MPI operation.
Symptom
An application encounters one of the following errors during an MPI call:
process_vm_readv: Bad address
Assertion failed in file ../src/mpid/ch4/shm/cray_common/cray_common_memops.c at line 461: 0
or
Bus error
Cause
This issue can occur if:
A bad memory address is encountered during an on-node MPI operation.
Attempting to use a GPU buffer in an MPI call without setting MPICH_GPU_SUPPORT_ENABLED to 1.
Passing too short of a buffer to an MPI call.
MPICH_SMP_SINGLE_COPY_MODE=CMA is used, which is the default on RHEL.
MPICH_SMP_SINGLE_COPY_MODE=XPMEM is used, which is the default in USS. Because bus errors can occur for other reasons, a debugger or core file may be necessary to confirm that this occurs in an MPI call.
Resolution
If you are not using GPU memory, set MPICH_GPU_SUPPORT_ENABLED to:
MPICH_GPU_SUPPORT_ENABLED=1
Handling an MPICH MPIDI OF handle cg error
An MPICH error occurs. While a seemingly MPICH error message appears, it is not. Rather, it is likely a fatal system error.
Symptom
The following error message appears:
MPIDI OF handle cg error (1059): OFI poll failed
(ofi_events.c:1061:MPIDI_OFI_handle_cg_error:Input/output error - CANCELED)
Cause
This issue occurs if either:
A node failure occurs, or
A link failure occurs, or
An invalid routing parameter (empty_route) is issued, or
The Retry Handler (RH), unable to resend a message, cancels a job. The CXI provider RH process running on each node monitors traffic in and out of the Network Interface Cards (NICs). The RH process resends dropped or discarded messages for various reasons. If the RH cannot resend a message, it eventually cancels the job and issues the error message.
Resolution
Contact HPE support for additional assistance.
Resolving MPICH MPIDI OFI error
Symptom
One of the following error messages appears:
MPICH ERROR
....
MPIDI_OFI_handle_cq_error(1062).....: OFI poll failed (ofi_events.c:1064:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND)
or
MPICH ERROR
....
MPIDI_OFI_handle_cq_error(1062).....: OFI poll failed (ofi_events.c:1064:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND)
Cause
These typically secondary errors occur if one or more ranks die (for example, segfault, out of resources, and so forth). If the dying rank is communicating with other ranks simultaneously, errors occurs.
Resolution
Debugging this issue requires you to locate and investigate the initial error signature and ignore secondary error signatures. Contact HPE support for additional assistance.
MPICH MPIDI OFI with a PKTBUG_ERROR error occurs
An MPICH MPIDI OFI/PKTBUG_ERROR error occurs as a result of a system configuration issue.
Symptom
The following error message appears:
MPICH ERROR
....
MPIDI_OFI_handle_cq_error(1059): OFI poll failed (ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Input/output error - PKTBUF_ERROR)
Cause
PKTBUF_ERROR usually signifies a system configuration issue, such as when a Rosetta switch is not programmed correctly or a mismatch between the Cassini and Rosetta device settings exist after an upgrade. If a node hits a PKTBUF_ERROR, it is generally not safe to run a job on that node again without rebooting it, as this error may leave the NIC in an unstable state.
Resolution
Contact HPE support for additional assistance.
Handling an issue where the Cassini Event Queue Overflows into CXI provider
In HPC environments utilizing the HPE Cray MPI and CXI provider, users might encounter a critical error related to the Cassini Event Queue overflow. This error is tied to the configuration of the CXI event queue, which plays a vital role in handling hardware-level communication events.
Symptom
The following error message appears:
libfabric:88194:cxi:core:cxip_cq_eq_progress():544<warn> Cassini Event Queue overflow detected.
Cause
This fatal error indicates that the job exceeded the capacity of the CXI provider’s event queue during execution. The Cassini Event Queue is directly connected to hardware and has a fixed maximum size specified at job launch. By default, the HPE Cray MPI sets this maximum size to 32,768 events. If the job generates more events than the queue can handle, an overflow occurs, leading to this error. Resizing the event queue dynamically is not feasible due to its hardware-level integration, making proper configuration essential during job initialization.
Resolution
To resolve the Cassini Event Queue overflow error, users can increase the maximum queue size at job launch by setting the FI_CXI_DEFAULT_CQ_SIZE environment variable to a higher value. For example, doubling the queue size to 65,536 can help accommodate larger workloads that exceed the default capacity. Use the following command to set the environment variable:
export FI_CXI_DEFAULT_CQ_SIZE=65536
Ensure this adjustment is made before launching the job to avoid encountering the overflow error.
Resolving an issue where the CXI provider flow control is triggered due to an LE depletion
Users might encounter a fatal error stemming from the depletion of a hardware resource known as List Entries (LE). This issue forces the NIC into Software Endpoint (SE) mode, changing how tag-matching and rendezvous processing are handled.
Symptom
The error is indicated by the following warning:
libfabric:44928:1640991101:cxi:core:cxip_recv_pending_ptlte_disable():1135<warn> RXC (0x8b0:30:0): Flow control triggered due to failure to append LE. Software endpoint mode required.
This error signifies that the job has exhausted the available LEs, triggering flow control and forcing the CXI provider to transition the NIC into SE mode.
Cause
Each CXI endpoint is allocated approximately 16,000 LEs, a hardware resource used to manage communication events. The depletion of LEs can occur due to:
A flood of unexpected messages.
A large number of pre-posted receives.
When the pool of LEs is depleted, the CXI provider automatically transitions the NIC into SE mode. While rendezvous processing remains in hardware, tag-matching is moved to software, which can impact performance.
Resolution
The CXI provider offers environment variables that allow users to manage how the system transitions into SE mode and optimize resource usage. The process has recently been updated. To address the issue, perform:
Transition Mode Configuration:
Use the updated FI_CXI_RX_MATCH_MODE environment variable to specify how tag-matching should be handled. Options include:
Hardware: Tag-matching is done entirely in hardware.
Software: Tag-matching is done entirely in software.
Hybrid: A combination of hardware and software is used.
To set the mode, for example:
export FI_CXI_RX_MATCH_MODE=[hardware | software | hybrid]
Optimize Buffer Resources:
Configure supporting environment variables to ensure efficient allocation of hardware resources:
FI_CXI_REQ_BUF_SIZE: Defines the size of the request buffer.
FI_CXI_REQ_BUF_MIN_POSTED: Specifies the minimum number of pre-posted receives.
FI_CXI_REQ_BUF_MAX_COUNT: Limits the total number of buffers that can be allocated.
Note: The older FI_CXI_MSG_OFFLOAD=0 environment variable used to switch to SE mode has been deprecated and should no longer be used.
Resolving MPICH error with MPICH_SINGLE_HOST_ENABLED=0 on HPE Slingshot-11 networks
When running MPI applications configured with MPICH_SINGLE_HOST_ENABLED=0 on systems using the HPE Slingshot-11 network, users may encounter an error during MPI_Init. This issue arises due to the network security token requirement. These tokens are managed by the workload manager (WLM). Single-node jobs often lack these tokens by default, as they typically do not require access to the NIC.
Symptom
When MPICH_SINGLE_HOST_ENABLED=0 is set, MPI_Init fails with the following error message:
OFI fi_open domain failed (ofi_init.c:616:MPIDI_OFI_mpi_init_hook:Function not implemented)
Cause
The HPE Slingshot-11 network enforces secure access to network resources through security tokens, which are distributed by WLM. These tokens are generally not allocated by default for single-node jobs, as such jobs do not require NIC access. Note that:
If the error occurs only on single-node jobs, the likely cause is the lack of Virtual Network Interface (VNI) allocation by the workload manager.
If the error occurs on jobs spanning two or more nodes, it may indicate a system configuration problem that requires administrative intervention.
Applications typically set MPICH_SINGLE_HOST_ENABLED=0 for specific reasons, such as enabling communication between MPI processes (for example, using MPI_Comm_accept) or for debugging purposes. Understanding application intent is essential for determining the correct resolution.
Resolution
To resolve the issue, take the following steps based on the workload manager and job configuration:
Confirm the error scope:
Ensure the error does not appear on multi-node jobs. If it only occurs on single-node jobs, proceed with the steps below.
If the error persists in multi-node jobs, contact the system administrator to investigate possible configuration problems.
Request VNI allocation:
For jobs requiring communication between MPI processes or across job steps, request VNI allocation from the workload manager. This ensures secure access to the NIC and resolves the issue during MPI_Init.
For Slurm:
a. For single-node jobs, add the –network single_node_vni option to the salloc, srun, or sbatch command.
b. For communicating between job steps, add the –network single_node_vni,job_vni option.
c. Ensure the system administrator has configured the Slurm Slingshot plugin correctly to support these options. For example:
salloc --network single_node_vni srun --network single_node_vni,job_vni sbatch --network single_node_vni
For PBS/PALS:
Use the –single-node-vni option with aprun or mpiexec commands. For example:
aprun --single-node-vni mpiexec --single-node-vni
For Flux:
The Flux workload manager does not currently support VNI allocation or enforcement. In this case, MPICH_SINGLE_HOST_ENABLED=0 should work without additional WLM options.
Verify application intent:
If the application sets MPICH_SINGLE_HOST_ENABLED=0 intentionally (for example, for MPI_Comm_accept), confirm its requirements. Users requesting communication between job steps must ensure consistent VNI allocation across all job steps in the allocation.
Addressing Slingshot network timeouts on HPE Cray MPI systems
Slingshot systems, integral to HPC environments, are designed to facilitate efficient communication for applications running across distributed nodes. However, certain applications may encounter network timeouts, which can impact communication performance.
Symptom
Applications running on Slingshot systems may experience network timeouts during execution. If such events occur, HPE Cray MPI tracks these timeouts and summarizes Cassini hardware counters for each job. If timeouts are detected, the following error message appears during the MPI Finalize phase:
[ MPICH Slingshot Network Summary: N network timeouts ]
These events could lead to lower-than-expected MPI communication performance, depending on application communication patterns.
Cause
Network timeouts are typically caused by “flapping links” within the Slingshot network. Flapping links are intermittent disruptions in network connections, which can lead to dropped packets and delays in communication. Applications that rely heavily on specific communication patterns may be more vulnerable to the performance impacts caused by these network issues.
Resolution
The HPE Slingshot-11 network is equipped to manage timeout events by automatically re-issuing affected network packets. While this mechanism helps mitigate immediate disruptions, applications may still experience reduced communication performance. To provide additional insight into network behavior and performance, users can collect Cassini hardware counters using the MPICH_OFI_CXI_COUNTER_REPORT variable. This feature is documented in the HPE Cray MPI man pages and allows administrators and users to monitor critical hardware metrics related to network activity.
Contact HPE support for additional assistance.
Addressing issues with fork() on HPE Slingshot-11 systems
The fork() system call is commonly used by applications to create child processes. However, on HPE Slingshot-11 systems, applications that rely on fork() may encounter issues under specific circumstances. These challenges arise whenever a child process attempts to access memory regions owned by its parent process after a fork() operation.
Symptom
Applications running on HPE Slingshot-11 systems may experience unexpected behavior or errors if using the fork() system call. These issues occur if the child process attempts to access memory regions that are allocated and owned by the parent process.
Cause
The root cause of these issues lies in how memory regions are handled during the fork() operation. On Slingshot-11 systems, the child process may encounter conflicts or access violations whenever it is interacting with memory regions managed by the parent process.
Resolution
To address this issue, configure specific runtime variables that ensure compatibility with fork() on Slingshot-11 systems. The following variables should be set in the runtime environment:
export CXI_FORK_SAFE=1
export CXI_FORK_SAFE_HP=0
export FI_CXI_DISABLE_CQ_HUGETLB=1
These settings help mitigate memory access problems between the parent and child processes following a fork() operation.
For systems running SLES15 SP4 or newer Linux kernels, some of the fork()-related issues have been addressed directly in the Linux kernel. As a result, users with updated Linux environments no longer need to set the CXI_FORK_SAFE runtime variables for applications that rely on fork(). This improvement simplifies application compatibility and eliminates the need for manual configuration in many cases.
Resolving GPU application hangs on HPE Slingshot-11 systems
HPE Slingshot-11 systems are designed to support high-performance computing applications, including GPU-enabled workloads. However, some GPU-enabled applications may encounter hangs or errors during execution. These issues are often accompanied by specific error messages in the system logs (dmesg) and are typically caused by configuration or runtime environment issues.
Symptom
GPU-enabled applications running on HPE Slingshot-11 systems may experience execution hangs, and the following error message is observed in the system logs (dmesg):
cxi core:cass vma write flag:22 VMA does not have write permissions
This error indicates a problem with memory permissions or GPU-related configuration that prevents the application from functioning correctly.
Cause
The error is generally attributed to one or more user configuration mistakes, including:
GPU-Aware logic not enabled:
The HPE Cray MPI GPU-aware logic was not enabled because the required runtime variable was missing in the job submission script. The variable that needs to be specified is:
MPICH_GPU_SUPPORT_ENABLED=1
Managed memory support disabled:
The application uses GPU Managed Memory regions, but HPE Cray MPI Managed Memory support was not properly enabled. By default, HPE Cray MPI supports Managed Memory regions, so this issue might arise if the default settings were altered.
Incorrect linking of GPU runtime library:
The application executable was not correctly linked against the GPU runtime library. On systems with NVIDIA GPUs, this issue often occurs if the following command was excluded from the environment or job submission script:
module load cudatoolkit
Resolution
To resolve GPU application hangs and related errors:
Ensure GPU-aware logic is enabled by setting the required runtime variable in the job submission script:
export MPICH_GPU_SUPPORT_ENABLED=1
Verify that HPE Cray MPI Managed Memory support is enabled. Since Managed Memory regions are supported by default, users should check for any modifications that may have disabled this feature.
Verify the GPU runtime library. For systems with NVIDIA GPUs, ensure the environment includes the following command:
module load cudatoolkit
Resolving pat_build errors: Missing required ELF section .note.link
The pat_build utility is used to instrument programs for performance analysis with HPE Cray Performance Tools (perftools). However, users may encounter an error indicating that a required ELF section, .note.link, is missing from the program being instrumented. This issue is typically related to build or linking problems during the application compilation process.
Symptom
If running the pat_build command to instrument an executable, the following error might appear:
% pat_build ./a.out
ERROR: Missing required ELF section '.note.link' from the program './a.out'. The program was built without the perftools module loaded or the program may already be instrumented.
This error indicates that the executable was not properly built or linked with the required perftools module or compiler drivers, preventing pat_build from detecting the necessary metadata.
Cause
The error is most often caused by the following issues:
Perftools module not loaded:
The program was built without loading the perftools module, which is required for instrumentation.
Incorrect linking process:
The program was linked using generic compiler commands (for example, gcc or hipcc) instead of the compiler drivers provided by the environment, such as CC, cc, and ftn. These drivers ensure proper integration with perftools.
Program already instrumented:
The executable might already contain instrumentation, resulting in a conflict if attempting to re-instrument it using pat_build.
Resolution
To resolve the error:
Load the perftools module. Ensure the perftools module is loaded in the environment before building the program. For example:
module load perftools
Use Compiler drivers for linking. Modify the linking step in the build process to use environment-specific compiler drivers (for example, CC, cc, or ftn) instead of generic compiler commands like gcc or hipcc. For example:
CC -o a.out source_file.c
These drivers ensure the correct metadata, including the .note.link ELF section, is included in the executable.
Verify instrumentation. If the program is already instrumented, remove the existing instrumentation and rebuild the executable from scratch with the proper perftools module loaded and compiler drivers used.
If issues persist, contact HPE support for additional assistance.
Resolving gdb4hpc CTI launch errors: Issues with mpiexec and PBS
The gdb4hpc tool is used for debugging HPC applications in distributed environments. However, during application launch, users may encounter errors related to the mpiexec binary and its compatibility with the CTI (Cray Tools Interface) framework.
Symptom
If attempting to launch an application with gdb4hpc, the following error message appears:
gdb4hpc: launch ...
Starting application, please wait...
Failed to launch CTI app.
CTI error: cti_launchAppBarrier: mpiexec was found at /opt/pbs/bin/mpiexec, but it is not a binary file. Tool launch requires direct access to the mpiexec binary. Ensure that the mpiexec binary is not wrapped by a script (tried HPCM / PALS).
This error prevents the application from launching successfully and indicates that the mpiexec binary is improperly wrapped or incompatible with CTI tool requirements.
Cause
The error is caused by the following factors:
Incorrect mpiexec configuration:
The mpiexec executable found in the PBS environment is not a binary file but a wrapper script. CTI tools require direct access to the mpiexec binary to properly launch applications.
PBS compatibility limitation:
PBS does not support the MPIR protocol required for CTI application launches. Instead, HPE Cray Supercomputing EX environments require PALS (Process Management and Launch Services) as the launcher for tools like gdb4hpc.
Resolution
To resolve the error and enable successful application launches with gdb4hpc, users should take the following steps:
Load the cray-pals module. Replace PBS’s default launcher with PALS by loading the cray-pals module. This can be done with the following command:
module load cray-pals
Verify mpiexec location. Ensure that the mpiexec binary provided by PALS is being used, rather than the wrapper script provided by PBS. This step guarantees compatibility with the CTI framework.
Re-launch the application. After loading the cray-pals module, reattempt to launch the application using gdb4hpc. The tool should now have direct access to the compatible mpiexec binary, resolving the error.
If the issue persists, contact HPE support for additional assistance.
Resolving gdb4hpc launch timeout issues
The gdb4hpc debugger is a powerful tool for debugging HPC applications. However, users could encounter launch timeout issues where the debugger fails to connect to all ranks of a distributed job. This issue can stem from several configuration or system-level problems.
Symptom
If attempting to launch a job with gdb4hpc, the debugger times out while attempting to connect to application ranks, and the following messages appears:
Creating network... (timeout in 300 seconds)
.............................
0/100 ranks connected (timeout in 270 seconds)
The debugger fails to connect to the specified ranks, preventing the application from being debugged successfully.
Cause
Several steps must work correctly for gdb4hpc to launch a job inside the debugger. Timeout issues can arise because:
MPIR hooks was not enabled:
The application must be launched with special MPIR hooks to stop the program on entry. Failure to enable these hooks can prevent the debugger from connecting.
Debugger processes were not started:
A debugger process (dbgsrv) must be started for each application rank on each node. If these processes fail to initialize, the debugger cannot establish communication.
Communication network issues exist:
A communication network must be built between the backend debugging processes (dbgsrv) and the gdb4hpc front-end. Network configuration issues can block this connection.
Resolution
To resolve gdb4hpc launch timeout issues, users can follow these steps:
Verify environment configuration:
a. For PBS Systems, ensure the cray-pals module is loaded. The PALS launcher is required for proper integration with gdb4hpc.
module load cray-pals
b. Attempt to launch the application outside of gdb4hpc, such as with the srun command, to confirm the issue is specific to the debugger.
Enable debug logging:
a. Launch the application with the –debug option to troubleshoot the issue further.
b. For newer versions of gdb4hpc, set a logging directory directly in the launch command.
c. For older versions, use environment variables to enable additional logging:
export CTI_DEBUG=1 export CTI_LOG_DIR=<path_to_cross_mounted_directory>
Network and launch diagnostics, and check whether network configuration is preventing the debugger from connecting. If network issues are suspected, consult system administration or HPE support to ensure proper connectivity between nodes.
Conduct a workaround for connecting to running jobs. If launch issues persist, bypass the launch process, and attach to a running job using the attach command in gdb4hpc.
a. Launch the application as usual and retrieve the job ID.
b. Use the attach command to connect to the running job. Refer to the help attach documentation within gdb4hpc for detailed instructions.
Address timing issues. If the problem being debugged occurs faster than you can attach to the job, add a sleep command at the beginning of your application to delay execution and allow time for attachment.
Resolving running process issue in gdb4hpc with multi-threaded code
If debugging multi-threaded applications using gdb4hpc, developers might encounter a situation where the debugger reports process is running even after encountering a stop. This behavior can be confusing and slows down the debugging process. Understanding the cause and applying a resolution can help streamline debugging in such scenarios.
Symptom
When encountering a breakpoint or stop while debugging with gdb4hpc, the debugger outputs, process is running, even though execution should have paused. This issue is specific to multi-threaded code.
Cause
The root of the problem lies in the default gdb4hpc configuration. This configuration uses the gdb4hpc non-stop mode. In non-stop mode, the debugger does not automatically switch focus to the thread that has stopped. As a result, the stopped thread is not selected, and debugging commands behave as though the process is still running.
Resolution
To resolve the issue:
Use the command information threads to list all threads and identify the one that has stopped.
Select the stopped thread manually using the command t <thread-no>, where <thread-no> corresponds to the thread number shown in the output of information threads. Manually selecting the appropriate thread ensures that the debugger focuses on the stopped thread, allowing you to proceed with debugging effectively.
Resolving a breakpoint error in gdb4hpc debugging
While using gdb4hpc to debug applications, an error stating cannot get to initial breakpoint occurs. This issue prevents debugging from starting as expected and might be difficult to address. Understanding the underlying cause and applying the appropriate resolution can help ensure smooth debugging.
Symptom
The debugger fails to reach the initial breakpoint during program launch, resulting in the cannot get to initial breakpoint error. This prevents the user from effectively interacting with the program during debugging.
Cause
This issue typically arises due to either:
The debugging information is incomplete or incorrect, making it difficult for the debugger to locate valid breakpoints.
The program being debugged is not an MPI program, which may lead to incompatibilities with gdb4hpc debugging mechanisms.
Resolution
To resolve the issue:
Before launching the debugger, execute:
maint set earlyentry on
The maint set earlyentry on command instructs gdb4hpc to enable an early entry mode, allowing the debugger to function even when the initial breakpoint cannot be reached.
Launch the program as usual.
After the launch completes, manually set a breakpoint within your program at a desired location.
Continue program execution, and proceed with debugging from the manually set breakpoint.
Enabling early entry mode and manually setting a breakpoint post-launch bypasses the issue.
Resolving unrecognized job ID error during a gdb4hpc attach with Slurm
While using gdb4hpc to debug applications in environments managed by Slurm, an unrecognized job id error might occur while attempting to attach to a running job. This issue can prevent the debugger from properly connecting to the desired process.
Symptom
During the process of attaching gdb4hpc to a job running on Slurm, the debugger outputs an unrecognized job id error. This issue prevents the successful attachment to the target job for debugging.
Cause
The error is typically related to how Slurm formats job IDs. In Slurm, job IDs include both a job identifier and a step identifier, formatted as <jobid>.<stepid>. If the step ID (often .0 for the first step) is not included in the attach command, gdb4hpc cannot recognize the job ID, resulting in the error.
Resolution
To resolve this issue:
Find the correct job ID with its step ID by using the Slurm command:
squeue -s
This command lists the jobs and their associated step IDs.
Update the attach command in gdb4hpc to include both the job ID and step ID. The format should be:
attach $a{n} <jobid>.<stepid>
Note: Replace <jobid> and <stepid> with the actual values from the squeue -s output. For additional information about the attach command, type help attach within gdb4hpc.
Including the step ID when specifying the job ID resolves the unrecognized job id error and successfully attaches gdb4hpc to the target Slurm job for debugging.
Resolving an undetected WLM error in debugger tools
If using debugging tools like gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc, an error occurs indicating that the WLM was not detected. This issue arises when the debugger tools cannot identify the WLM in use, which is essential for proper interaction with system utilities. Understanding the cause and providing the required configuration can resolve the error and enable the tools to function correctly.
Symptom
The debugger tools output an error similar to:
Launcher name was not found in PATH (tried system / WLM)
This error indicates the failure to detect an active WLM on the system, preventing the debugger from proceeding.
Cause
Debugger tools rely on a common library to automatically detect the WLM (for example, Slurm, ) running on the system. This detection process uses system paths and environment settings to identify the WLM. If the system is not configured with a recognized WLM or the detection process fails, the debugger cannot determine which utilities to use, resulting in the error.
Resolution
To resolve this issue, manually specify the WLM by setting the CTI_WLM_IMPL environment variable:
Identify the WLM in use on the system (for example, Slurm, PALS, Flux, or ALPS).
Set the CTI_WLM_IMPL environment variable to the corresponding WLM type. For example:
export CTI_WLM_IMPL=slurm
Note: Replace slurm with the appropriate value for your WLM (Slurm, PALS, Flux, or ALPS).
Retry launching the debugger tool. It should correctly identify the workload manager and proceed without errors.
Manually specifying the WLM type using the CTI_WLM_IMPL environment variable bypasses the detection issue and ensure that debugger tools like gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc operate as intended.
Resolving an unfound WLM PATH error in debugger tools
If using debugging tools, such as gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc, it is essential for the tools to interact with the system WLM. The error Workload manager not found in PATH indicates that the debugger tools cannot locate the appropriate WLM launcher in the system environment. Proper configuration of the system environment variables ensures seamless operation of these tools.
Symptom
The debugger tools display an error message similar to:
Launcher name was not found in PATH (tried system / WLM)
This error indicates that the debugger was prevented from starting or controlling jobs on the system.
Cause
Debugger tools rely on WLM utilities (for example, srun, aprun, mpiexec) to start and manage jobs. This error can occur for two primary reasons:
The WLM detected by the debugging tools is incorrect.
The WLM launcher is not available in the PATH environment variable, making it inaccessible to the debugger tools.
Resolution
To resolve this issue, take the following steps:
Verify the correct WLM is used. Determine the WLM used by your system (for example, Slurm, PALS, Flux, or ALPS). If the detected WLM is incorrect, manually specify the correct WLM type by setting the CTI_WLM_IMPL environment variable. For example:
export CTI_WLM_IMPL=slurm
Note: Replace slurm with the appropriate value for your WLM (Slurm, PALS, Flux, or ALPS).
Check the PATH environment variable. Ensure the WLM launcher (for example, srun for Slurm, aprun for ALPS, or mpiexec for Flux) is included in the system PATH environment variable. If it is missing, update the PATH to include the directory containing the launcher. For example:
export PATH=/path/to/launcher:$PATH
Note: Replace /path/to/launcher with the actual directory path where the launcher resides.
After updating WLM settings or PATH, retry launching the debugger tools. They should now correctly detect and use WLM utilities.
Ensuring the correct workload manager is specified and its launcher is accessible in the PATH resolves the Workload manager not found in PATH error, enabling debugging tools like gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc to function properly.
Resolving a launcher/binary file error in debugger tools
If using tools like gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc, an error occurs indicating that the launcher found is not a binary file. This issue prevents debugger tools from properly interacting with WLM utilities required to start and control jobs. Understanding the cause of this error and applying the correct solution ensures the debugger tools function seamlessly in your environment.
Symptom
The debugger tools display an error message similar to:
Launcher name was found at path, but it is not a binary file.
This error indicates that the debugger tools cannot directly access the launcher binary required for job management.
Cause
Debugger tools rely on direct access to the WLM launcher binary (for example, srun for Slurm, aprun for ALPS, or mpiexec for Flux). If the file found at the path of the launcher is a wrapper script instead of the binary, the debugger tools might fail to operate correctly. While tools natively support certain wrapper systems (XALT, Slurm), support for other custom wrapper scripts is limited, leading to this error.
Resolution
To resolve this issue, follow these steps:
Verify that the correct WLM is detected. If the detected WLM is incorrect, manually specify the correct WLM type by setting the CTI_WLM_IMPL environment variable. For example:
export CTI_WLM_IMPL=slurm
Note: Replace slurm with the appropriate WLM type (pals, flux, or alps).
Check the launcher binary path. Use the which command to locate the launcher binary. For example, on a Slurm system:
which srun
Verify that the file at the returned path is the actual launcher binary and not a wrapper script.
Handle wrapper scripts. If the launcher is a wrapper script, check whether it has a loadable module that can be unloaded. For example:
module unload <wrapper_module>
If no module exists or unloading is not possible, update the PATH environment variable to prioritize the directory containing the direct launcher binary. For example:
export PATH=/path/to/launcher_binary:$PATH
Note: Replace /path/to/launcher_binary with the actual directory containing the binary.
After ensuring the debugger has direct access to the launcher binary, retry launching the debugger tools.
By ensuring that the launcher binary is accessible and correctly prioritized in the system’s environment, you can resolve the Launcher is Not a Binary File error and enable tools like gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc to function properly with WLMs.
Debugging tools issue: Launcher lacks debug symbols
If using debugging tools like gdb4hpc, ccdb, atp, sanitizers4hpc, or valgrind4hpc, proper functionality requires launcher debug symbols to coordinate tool launches. These tools rely on system WLM utilities, such as Slurm, PALS, Flux, or ALPS, to start and manage jobs on the system. However, issues can arise if the launcher does not contain the necessary debug symbols.
Symptom
The following error message appears:
launcher name was found at path, but it does not contain debug symbols
This indicates that debugging tools cannot proceed, as they depend on the presence of debug symbols in the launcher binary for proper operation.
Cause
The problem may occur because:
The detected WLM is incorrect,
The file at the launcher path is a script rather than the direct launcher binary,
The launcher binary has been stripped of its debug symbols, or
Some installations of Slurm, for instance, might strip debugger symbols, rendering them incompatible with debugging tools.
Resolution
To resolve this issue, follow these steps:
Ensure that the correct WLM is detected. If the detected WLM is incorrect, manually set the appropriate WLM by using:
export CTI_WLM_IMPL=<wlm>
Note: Replace <wlm> with one of the supported options: slurm, pals, flux, or alps.
Check the launcher file, and confirm that the file at the specified path is the actual launcher binary and not a script.
Ensure that the launcher binary has not been stripped of debug symbols. If it has been stripped, reinstall or obtain an unstripped version of the launcher.
Use an alternative debug tool launcher. If your system supports passwordless access to compute nodes, bypass the default WLM by setting:
export CTI_WLM_IMPL=ssh
This configuration enables the use of a generic SSH-based debug tool launcher.
This resolution addresses the issue and enables debugging tools to function correctly with the system WLM.
Supported systems
This publication supports installing CPE 26.03 on HPE Cray Supercomputing EX systems with supported applicable HPE Cray Supercomputing EX systems. Depending on the HPE Cray Supercomputing EX system, supported architectures and operating systems (OS) versions vary. This chapter provides information on supported systems for this release.
IMPORTANT: In releases before the COS 25.9 release, COS comprised three components: COS Base, HPE USS, and HPE SLES. Starting with the CPE 25.09 release, COS 25.9 (and later) comprises:
HPE Cray Supercomputing User Services Software (USS)
HPE SUSE Linux Enterprise Server
This release also supports v21.0.0 of the HPE Cray Compiler Environment (CCE). See the CPE 26.03 Release Announcements on the CPE Online Documentation website for other supported dependencies.
Supported systems for CPE on CSM
This publication supports the installation of CPE 25.09 on HPE Cray Supercomputing EX systems with specific configurations:
Management Software & Version |
COS Version |
Operating System |
Architecture |
GCC Version |
|---|---|---|---|---|
CSM 1.7.X |
COS 25.9 (USS 1.4.X) |
SLES 15 SP6 |
X86 |
14.0 |
CSM 1.7.X |
COS 25.9 (USS 1.4.X) |
SLES 15 SP6 |
AArch64 |
14.0 |
This release also supports v20.0.0 of the HPE Cray Compiler Environment (CCE).
IMPORTANT: In releases before the COS 25.9 release, COS comprised three components: COS Base, HPE USS, and HPE SLES. With the COS 25.9 and CPE 25.09 releases, it should be noted that COS Base has been replaced with SLES 15 SP6. Starting with this CPE 25.09 release, COS 25.9 (and later) comprises:
HPE Cray Supercomputing User Services Software (USS)
HPE SUSE Linux Enterprise Server
See the CPE 25.09 Release Announcements on the CPE Online Documentation website for other supported dependencies.
Supported systems for CPE with HPCM
This publication supports installing CPE 25.09 on HPE Cray Supercomputing EX systems with specific configurations:
Management Software & Version |
COS Version |
Operating System |
Architecture |
GCC Version |
|---|---|---|---|---|
HPCM 1.14 |
COS 25.9 (USS 1.4.X) |
SLES 15 SP7 |
X86 |
Not Applicable |
HPCM 1.14 |
COS 25.9 (USS 1.4.X) |
SLES 15 SP6 |
X86 |
Not Applicable |
HPCM 1.14 |
COS 25.9 (USS 1.4.X) |
SLES 15 SP7 |
AArch64 |
Not Applicable |
HPCM 1.14 |
COS 25.9 (USS 1.4.X) |
SLES 15 SP6 |
AArch64 |
Not Applicable |
HPCM 1.14 |
Not Applicable |
RHEL 9.6 |
X86 |
14.0 |
HPCM 1.14 |
Not Applicable |
RHEL 9.5 |
X86 |
14.0 |
HPCM 1.14 |
Not Applicable |
RHEL 8.10 |
X86 |
14.0 |
HPCM 1.14 |
Not Applicable |
RHEL 9.6 |
AArch64 |
14.0 |
HPCM 1.14 |
Not Applicable |
RHEL 9.5 |
AArch64 |
14.0 |
This release also supports v20.0.0 of the HPE Cray Compiler Environment (CCE).
IMPORTANT: In releases before the COS 25.9 release, COS comprised three components: COS Base, HPE USS, and HPE SLES. With the COS 25.9 and CPE 25.09 releases, it should be noted that COS Base has been removed. Starting with this CPE 25.09 release, COS 25.9 (and later) comprises:
HPE Cray Supercomputing User Services Software (USS)
HPE SUSE Linux Enterprise Server
See the CPE 25.09 Release Announcements on the CPE Online Documentation website for other support dependencies.
Supported systems for CPE on the HPE Cray XD2000
For this release, CPE is supported on HPE Cray XD2000 systems with designated operating systems and architectures:
Management Software & Version |
Operating System |
Architecture |
|---|---|---|
HPCM 1.14 |
RHEL 8.10 |
X86 |
This release also supports v20.0.0 of the HPE Cray Compiler Environment (CCE).
IMPORTANT: CPE versions 25.03 (and earlier) previously supported MOFED versions 5.8 (or earlier) as directed in installation instructions. However, with the CPE 25.09 release, HPE recommends that MOFED/DOCAFED-dependent users with HPE Slingshot 10 (SS10) refrain upgrading CPE beyond the 25.03 CPE release. HPE observed a system bug, the Extended Reliable Connection (XRC) bug in MOFED. This system bug adversely affects CPE and SS10 functionality. The bug was introduced by NVIDIA in early 2023, and HPE reported details of the bug to NIDIA in April 2023. The bug is currently unresolved and is not expected to be fixed during the transition from MOFED to DOCA OFED. Until a resolution or workaround is introduced, CPE users should not upgrade past the CPE 25.03 release.
See the CPE 25.09 Release Announcements on the CPE Online Documentation website for other support dependencies.
Support matrices for previous releases
This chapter lists CPE-supported components, third-party software, and modules supported for applicable and previous releases of the CPE software. This information is provided for reference purposes.
CPE release matrices for SLES
CPE supports various SLES-based software components, including SLES for Aarch64 and x86 architectures. These components include compilers, libraries, debugging/profiling tools, programming models and so forth. Supported version of these components are updated with each release of CPE. This section lists which SLES-based component versions are supported for each CPE release.
SLES AArch64 support matrix
SLES with AArch64 systems is supported with CPE on HPE Cray Supercomputing EX systems with either CSM or HPCM. Below are product components, modules, third-party software versions supported with previous CPE releases with these configurations.
(D) represents the default version installed at installation.
* HPCM only
Release |
CPE 25.09 |
CPE 25.09 |
CPE 25.03 |
CPE 25.03 |
24.11 |
24.11 |
24.07 |
|---|---|---|---|---|---|---|---|
Product |
sles15sp7-aarch64 * |
sles15sp6-aarch64 |
sles15sp6-aarch64 |
sles15sp5-aarch64 |
sles15sp5-aarch64 |
sles15sp6-aarch64 |
sles15sp5-aarch64 |
COS |
25.9 |
25.9 |
25.3 |
24.7 |
25.1 |
24.7 |
24.7 |
COS Base |
N/A |
N/A |
3.3.0 |
3.1.0 |
3.2.0 |
3.1.0 |
3.1.0 |
CSM |
Not supported |
1.7 |
1.6.1 |
1.6.1 |
1.6 |
1.6 |
1.5 |
HPCM |
1.14 |
1.14 |
1.13 |
1.13 |
1.12 |
1.12 |
1.11 |
USS |
1.4.0 |
1.4.0 |
1.3.0 |
1.1.0 |
1.2.0 |
1.1.0 |
1.1.0 |
amd |
|||||||
aocc |
5.0 |
5.0 |
4.2 |
4.2 |
4.2** |
||
atp |
3.15.7 (D) |
3.15.7 (D) |
3.15.6 (D) |
3.15.6 (D) |
3.15.5 (D) |
3.15.5 (D) |
3.15.4 (D) |
cce |
20.0.0 |
20.0.0 |
19.0.0 (D) |
19.0.0 (D) |
18.0.1 (D) |
18.0.1 (D) |
18.0.0 (D) |
cpe-gcc-mpfr |
3.1.4 (D) |
3.1.4 (D) |
3.1.4 (D) |
3.1.4 (D) |
3.1.4 (D) |
3.1.4 (D) |
|
cpe-gcc-native |
14.2 (D) |
14.2 (D) |
|||||
cpe-gcc-native |
14 (D) |
14 (D) |
13.2 |
13.2 |
13.2 (D) |
13.2 (D) |
13.2 (D) |
cpe-gcc-native |
13 |
12.3 |
12.3 |
12.3 |
12.3 |
12.3 |
|
cpe-gcc-native |
12 |
||||||
cpe-prgenv-amd |
|||||||
cpe-prgenv-aocc |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
||||
cpe-prgenv-cray |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
|||
cpe-prgenv-cray-amd |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
||||
cpe-prgenv-gnu |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
|||
cpe-prgenv-gnu-amd |
|||||||
cpe-prgenv-intel |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
||||
cpe-prgenv-nvhpc |
8.5.0 (D) |
||||||
cpe-prgenv-nvidia |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
|||
cray-R |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
cray-ccdb |
5.0.7 (D) |
5.0.7 (D) |
5.0.6 (D) |
5.0.6 (D) |
5.0.5 (D) |
5.0.5 (D) |
5.0.4 (D) |
cray-cdst-support |
2.14.6 (D) |
2.14.6 (D) |
2.14.5 (D) |
2.14.5 (D) |
2.14.3 (D) |
||
cray-cti |
2.20.0 (D) |
2.20.0 (D) |
2.19.1 (D) |
2.19.1 (D) |
2.19.0 (D) |
2.19.0 (D) |
2.18.4 (D) |
cray-dsmml |
0.3.1 (D) |
0.3.1 (D) |
0.3.1 (D) |
0.3.0 (D) |
0.3.0 (D) |
0.3.0 (D) |
0.3.0 (D) |
cray-dwarf |
2.0.0 (D) |
2.0.0 (D) |
0.11.1 (D) |
0.11.1 (D) |
0.11.0 (D) |
0.11.0 (D) |
0.9.2 (D) |
cray-dyninst |
12.3.6 (D) |
12.3.6 (D) |
12.3.5 (D) |
12.3.5 (D) |
12.3.4 (D) |
12.3.4 (D) |
12.3.2 (D) |
cray-fftw |
3.3.10.11 (D) |
3.3.10.11 (D) |
3.3.10.10 (D) |
3.3.10.10 (D) |
3.3.10.9 (D) |
3.3.10.9 (D) |
3.3.10.8 (D) |
cray-hdf5 |
1.14.3.7 (D) |
1.14.3.7 (D) |
1.14.3.5 (D) |
1.14.3.5 (D) |
1.14.3.3 (D) |
1.14.3.3 (D) |
1.14.3.1 (D) |
cray-libsci |
25.09.0 (D) |
25.09.0 (D) |
25.03.0 (D) |
25.03.0 (D) |
24.11.0 (D) |
24.11.0 (D) |
24.07.0 (D) |
cray-libsci-acc |
25.09.0 (D) |
25.03.0 (D) |
25.03.0 (D) |
24.11.0 (D) |
24.11.0 (D) |
24.07.0 (D) |
|
cray-lmod |
8.7.60 (D) |
8.7.60 (D) |
8.7.55 (D) |
8.7.55 (D) |
8.7.37 (D) |
8.7.37 (D) |
8.7.37 (D) |
cray-modules |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
cray-mpich |
8.1.33 |
8.1.33 (D) |
8.1.32 (D) |
8.1.32 (D) |
8.1.31 (D) |
8.1.31 (D) |
8.1.30 (D) |
cray-mpich |
9.0.1 (D) |
9.0.1 (D) |
9.0.0 |
9.0.0 |
|||
cray-mpixlate |
1.0.7 (D) |
1.0.7 (D) |
1.0.6 (D) |
1.0.6 (D) |
1.0.5 (D) |
||
cray-mrnet |
5.1.6 (D) |
5.1.6 (D) |
5.1.5 (D) |
5.1.5 (D) |
5.1.4 (D) |
5.1.4 (D) |
5.1.3 (D) |
cray-netcdf |
4.9.2.1 (D) |
4.9.2.1 (D) |
4.9.0.17 (D) |
4.9.0.17 (D) |
4.9.0.15 (D) |
4.9.0.15 (D) |
4.9.0.13 (D) |
cray-open-shmemx |
11.7.5 (D) |
11.7.5 (D) |
11.7.4 (D) |
11.7.3 (D) |
11.7.3 (D) |
11.7.3 (D) |
11.7.2 (D) |
cray-papi |
7.2.0.2 (D) |
7.2.0.2 (D) |
7.2.0.1 (D) |
7.2.0.1 (D) |
7.1.0.4 (D) |
7.1.0.4 (D) |
7.1.0.2 (D) |
cray-parallel-netcdf |
1.12.3.19 (D) |
1.12.3.19 (D) |
1.12.3.17 (D) |
1.12.3.17 (D) |
1.12.3.15 (D) |
1.12.3.15 (D) |
1.12.3.13 (D) |
cray-pe-set-default |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
cray-pmi |
6.1.16 (D) |
6.1.16 (D) |
6.1.15 (D) |
6.1.15 (D) |
6.1.15 (D) |
6.1.15 (D) |
6.1.15 (D) |
cray-pmi-devel |
6.1.16 |
6.1.16 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
cray-pmi-doc |
6.1.16 |
6.1.16 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
cray-python |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
cray-stat |
3.11.7 (D) |
4.12.6 (D) |
4.12.5 (D) |
4.12.5 (D) |
4.12.4 (D) |
4.12.4 (D) |
4.12.3 (D) |
cray-ucx |
|||||||
cray-zmqnet |
1.3.2 (D) |
1.3.2 (D) |
1.3.0 (D) |
1.3.0 (D) |
1.0.0 (D) |
1.0.0 (D) |
|
craype |
2.7.35 (D) |
2.7.35 (D) |
2.7.34 (D) |
2.7.34 (D) |
2.7.33 (D) |
2.7.33 (D) |
2.7.32 (D) |
craype-dl-plugin-ftr |
|||||||
craype-dl-plugin-py3 |
24.07.1 (D) |
24.07.1 (D) |
24.07.1 (D) |
24.07.1 (D) |
24.07.1 (D) |
||
craype-targets-ex |
1.16.0 (D) |
1.16.0 (D) |
1.15.1 (D) |
1.15.1 (D) |
1.15.0 (D) |
1.15.0 (D) |
1.13.2 (D) |
craypkg-gen |
1.3.36 (D) |
1.3.36 (D) |
1.3.35 (D) |
1.3.35 (D) |
1.3.34 (D) |
1.3.34 (D) |
1.3.33 (D) |
forgesup |
24.1.1 |
24.1.1 |
23.1.2 |
23.1.2 |
23.1.2 |
||
gdb4hpc |
4.16.5 (D) |
4.16.5 (D) |
4.16.4 (D) |
4.16.4 (D) |
4.16.3 (D) |
4.16.3 (D) |
4.16.2 (D) |
intel |
|||||||
lmod_scripts |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
nvhpc |
24.3 (D) |
||||||
nvidia |
25.5 (D) |
25.5 (D) |
24.3 (D) |
24.3 (D) |
24.3 (D) |
24.3 (D) |
24.3 (D) |
perftools |
25.09.0 (D) |
25.09.0 (D) |
25.03.0 (D) |
25.03.0 (D) |
24.11.0 (D) |
24.11.0 (D) |
24.07.0 (D) |
rocm |
6.3.0 |
6.3.0 |
6.2.1 |
6.2.1 |
6.1.0 (D) |
||
saniti-zers4hpc |
1.1.6 (D) |
1.1.6 (D) |
1.1.5 (D) |
1.1.5 (D) |
1.1.4 (D) |
1.1.4 (D) |
1.1.3 (D) |
total-viewsup |
2024.4.0 |
2024.4.0 |
2024.1.21 |
2024.1.21 |
2024.1.21 |
||
val-grind4hpc |
2.13.6 (D) |
2.13.6 (D) |
2.13.5 (D) |
2.13.5 (D) |
2.13.4 (D) |
2.13.4 (D) |
2.13.3 (D) |
SLES X86 support matrix
SLES on X86 systems is supported with CPE on HPE Cray Supercomputing EX systems with either CSM or HPCM. Below are product components, modules, third-party software versions supported with previous CPE releases with these configurations.
(D) represents the default version installed at installation.
* HPCM only
Release |
CPE 25.09 |
CPE 25.09 |
CPE 25.03 |
CPE 25.03 |
24.11 |
24.11 |
24.07 |
|---|---|---|---|---|---|---|---|
Product/Version |
sles15sp7 * |
sles15sp6 |
sles15sp6 |
sles15sp5 |
sles15sp6 |
sles15sp5 |
sles15sp5 |
COS |
25.9 |
25.9 |
25.1 |
24.7 |
24.7 |
||
COS Base |
N/A |
N/A |
3.3.0 |
3.1.0 |
3.2.0 |
3.1.0 |
3.1.0 |
CSM |
Not supported |
1.7 |
1.6.1 |
1.6.1 |
1.6 |
1.6 |
1.5 |
HPCM |
1.14 |
1.14 |
1.13 |
1.13 |
1.12 |
1.12 |
1.11 |
USS |
1.4.0 |
1.4.0 |
1.3.0 |
1.1.0 |
1.2.0 |
1.1.0 |
1.1.0 |
amd |
6.4.1 (D) |
6.4.1 (D) |
6.2.1 (D) |
6.2.1 (D) |
6.2.1 (D) |
6.2.1 (D) |
|
aocc |
5.0.0 (D) |
5.0.0 (D) |
5.0.0 (D) |
5.0.0 (D) |
4.2.0 (D) |
4.2.0 (D) |
4.2.0 (D) |
atp |
3.15.7 (D) |
3.15.7 (D) |
3.15.6 (D) |
3.15.6 (D) |
3.15.5 (D) |
3.15.5 (D) |
3.15.4 (D) |
cce |
20.0.0 |
20.0.0 |
19.0.0 (D) |
19.0.0 (D) |
18.0.1 (D) |
18.0.1 (D) |
18.0.0 (D) |
cpe-gcc-mpfr |
3.1.4 (D) |
3.1.4 (D) |
3.1.4 (D) |
3.1.4 (D) |
3.1.4 (D) |
||
cpe-gcc-native |
14.2 (D) |
14.2 (D) |
|||||
cpe-gcc-native |
14 (D) |
14 |
13.2 |
13.2 |
13.2 (D) |
13.2 (D) |
13.2 (D) |
cpe-gcc-native |
13 |
12.3 |
12.3 |
12.3 |
12.3 |
12.3 |
|
cpe-gcc-native |
12 |
12.3 |
|||||
cpe-prgenv-amd |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
cpe-prgenv-aocc |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
cpe-prgenv-cray |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
cpe-prgenv-cray-amd |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
cpe-prgenv-gnu |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
cpe-prgenv-gnu-amd |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
cpe-prgenv-intel |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
cpe-prgenv-nvidia |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
8.5.0 (D) |
cray-R |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
cray-ccdb |
5.0.7 (D) |
5.0.7 (D) |
5.0.6 (D) |
5.0.6 (D) |
5.0.5 (D) |
5.0.5 (D) |
5.0.4 (D) |
cray-cdst-support |
2.14.6 (D) |
2.14.6 (D) |
2.14.5 (D) |
2.14.5 (D) |
2.14.3 (D) |
||
cray-cti |
2.20.0 (D) |
2.20.0 (D) |
2.19.1 (D) |
2.19.1 (D) |
2.19.0 (D) |
2.19.0 (D) |
2.18.4 (D) |
cray-dsmml |
0.3.1 (D) |
0.3.1 (D) |
0.3.1 (D) |
0.3.1 (D) |
0.3.0 (D) |
0.3.0 (D) |
0.3.0 (D) |
cray-dwarf |
2.0.0 (D) |
2.0.0 (D) |
0.11.1 (D) |
0.11.1 (D) |
0.11.0 (D) |
0.11.0 (D) |
0.9.2 (D) |
cray-dyninst |
12.3.6 (D) |
12.3.6 (D) |
12.3.5 (D) |
12.3.5 (D) |
12.3.4 (D) |
12.3.4 (D) |
12.3.2 (D) |
cray-fftw |
3.3.10.11 (D) |
3.3.10.11 (D) |
3.3.10.10 (D) |
3.3.10.10 (D) |
3.3.10.9 (D) |
3.3.10.9 (D) |
3.3.10.8 (D) |
cray-hdf5 |
1.14.3.7 (D) |
1.14.3.7 (D) |
1.14.3.5 (D) |
1.14.3.5 (D) |
1.14.3.3 (D) |
1.14.3.3 (D) |
1.14.3.1 (D) |
cray-libsci |
25.09.0 (D) |
25.09.0 (D) |
25.03.0 (D) |
25.03.0 (D) |
24.11.0 (D) |
24.11.0 (D) |
24.07.0 (D) |
cray-libsci-acc |
25.09.0 (D) |
25.09.0 (D) |
25.03.0 (D) |
25.03.0 (D) |
24.11.0 (D) |
24.11.0 (D) |
24.07.0 (D) |
cray-lmod |
8.7.60 (D) |
8.7.60 (D) |
8.7.55 (D) |
8.7.55 (D) |
8.7.37 (D) |
8.7.37 (D) |
8.7.37 (D) |
cray-modules |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
cray-mpich |
8.1.33 |
8.1.33 |
8.1.32 (D) |
8.1.32 (D) |
8.1.31 (D) |
8.1.31 (D) |
8.1.30 (D) |
cray-mpich |
9.0.1 |
9.0.1 |
9.0.0 |
9.0.0 |
|||
cray-mpixlate |
1.0.7 (D) |
1.0.7 (D) |
1.0.6 (D) |
1.0.6 (D) |
1.0.5 (D) |
||
cray-mrnet |
5.1.6 (D) |
5.1.6 (D) |
5.1.5 (D) |
5.1.5 (D) |
5.1.4 (D) |
5.1.4 (D) |
5.1.3 (D) |
cray-netcdf |
4.9.2.1 (D) |
4.9.2.1 (D) |
4.9.0.17 (D) |
4.9.0.17 (D) |
4.9.0.15 (D) |
4.9.0.15 (D) |
4.9.0.13 (D) |
cray-open-shmemx |
11.7.5 (D) |
11.7.5 (D) |
11.7.4 (D) |
11.7.4 (D) |
11.7.3 (D) |
11.7.3 (D) |
11.7.2 (D) |
cray-pals |
1.3.2 |
||||||
cray-papi |
7.2.0.2 (D) |
7.2.0.2 (D) |
7.2.0.1 (D) |
7.2.0.1 (D) |
7.1.0.4 (D) |
7.1.0.4 (D) |
7.1.0.2 (D) |
cray-parallel-netcdf |
1.12.3.19 (D) |
1.12.3.19 (D) |
1.12.3.17 (D) |
1.12.3.17 (D) |
1.12.3.15 (D) |
1.12.3.15 (D) |
1.12.3.13 (D) |
cray-pe-set-default |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
cray-pmi |
6.1.16 (D) |
6.1.16 (D) |
6.1.15 (D) |
6.1.15 (D) |
6.1.15 (D) |
6.1.15 (D) |
6.1.15 (D) |
cray-pmi-devel |
6.1.16 |
6.1.16 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
cray-pmi-doc |
6.1.16 |
6.1.16 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
cray-python |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
cray-stat |
4.12.6 (D) |
4.12.6 (D) |
4.12.5 (D) |
4.12.5 (D) |
4.12.4 (D) |
4.12.4 (D) |
4.12.3 (D) |
cray-ucx |
2.12.0 (D) |
2.12.0 (D) |
2.12.0 (D) |
2.12.0 (D) |
2.12.0 (D) |
||
cray-zmqnet |
1.3.2 (D) |
1.3.2 (D) |
1.3.1 (D) |
1.3.1 (D) |
1.0.0 (D) |
1.0.0 (D) |
|
craype |
2.7.35 (D) |
2.7.35 (D) |
2.7.34 (D) |
2.7.34 (D) |
2.7.33 (D) |
2.7.33 (D) |
2.7.32 (D) |
craype-dl-plugin-ftr |
22.06.1.2 (D) |
22.06.1.2 (D) |
22.06.1.2 (D) |
22.06.1.2 (D) |
22.06.1.2 (D) |
||
craype-dl-plugin-py3 |
21.04.1 |
21.04.1 |
21.04.1 |
21.04.1 |
21.04.1 |
||
craype-dl-plugin-py3 |
22.06.1.2 |
22.06.1.2 |
22.06.1.2 |
22.06.1.2 |
22.06.1.2 |
||
craype-dl-plugin-py3 |
22.08.1 |
22.08.1 |
22.08.1 |
22.08.1 |
22.08.1 |
||
craype-dl-plugin-py3 |
22.09.1 |
22.09.1 |
22.09.1 |
22.09.1 |
22.09.1 |
||
craype-dl-plugin-py3 |
22.12.1 |
22.12.1 |
22.12.1 |
22.12.1 |
22.12.1 |
||
craype-dl-plugin-py3 |
23.09.1 |
23.09.1 |
23.09.1 |
23.09.1 |
23.09.1 |
||
craype-dl-plugin-py3 |
24.03.1 (D) |
24.03.1 (D) |
24.03.1 (D) |
24.03.1 (D) |
24.03.1 (D) |
||
craype-targets-ex |
1.16.0 (D) |
1.16.0 (D) |
1.15.1 (D) |
1.15.1 (D) |
1.15.0 (D) |
1.15.0 (D) |
1.13.2 (D) |
craypkg-gen |
1.3.36 (D) |
1.3.36 (D) |
1.3.35 (D) |
1.3.35 (D) |
1.3.34 (D) |
1.3.34 (D) |
1.3.33 (D) |
forgesup |
24.1.1 |
24.1.1 |
23.1.2 |
23.1.2 |
23.1.2 |
||
gdb4hpc |
4.16.5 (D) |
4.16.5 (D) |
4.16.4 (D) |
4.16.4 (D) |
4.16.3 (D) |
4.16.3 (D) |
4.16.2 (D) |
intel |
2025.1 (D) |
2025.1 (D) |
2025.0 (D) |
2025.0 (D) |
2024.2 (D) |
2024.2 (D) |
2024.0 (D) |
lmod_scripts |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
nvhpc |
24.3 (D) |
||||||
nvidia |
25.5 (D) |
25.5 (D) |
24.3 (D) |
24.3 (D) |
24.3 (D) |
24.3 (D) |
24.3 (D) |
perftools |
25.09.0 (D) |
25.09.0 (D) |
25.03.0 (D) |
25.03.0 (D) |
24.11.0 (D) |
24.11.0 (D) |
24.07.0 (D) |
rocm |
6.4.1 (D) |
6.4.1 (D) |
6.3.0 |
6.3.0 |
6.2.1 |
6.2.1 |
6.1.0 (D) |
saniti-zers4hpc |
1.1.6 (D) |
1.1.6 (D) |
1.1.5 (D) |
1.1.5 (D) |
1.1.4 (D) |
1.1.4 (D) |
1.1.3 (D) |
total-viewsup |
2024.4.0 |
2024.4.0 |
2024.1.21 |
2024.1.21 |
2024.1.21 |
||
val-grind4hpc |
2.13.6 (D) |
2.13.6 (D) |
2.13.5 (D) |
2.13.5 (D) |
2.13.4 (D) |
2.13.4 (D) |
2.13.3 (D) |
CPE release matrices for RHEL
CPE supports various RHEL-based software components, including SLES for Aarch64 and x86 architectures. These components include compilers, libraries, debugging/profiling tools, programming models and so forth. Supported version of these components are updated with each release of CPE. This section lists which RHEL-based component versions are supported for each CPE release.
RHEL AArch64 support matrix
RHEL on AArch64 systems is supported with CPE on HPE Cray Supercomputing EX systems with HPCM. Below are product components, modules, third-party software versions supported with previous CPE releases with these configurations.
(D) represents the default version installed at installation.
Release |
CPE 25.09 |
CPE 25.09 |
CPE 25.03 |
CPE 25.03 |
24.11 |
24.07 |
|---|---|---|---|---|---|---|
rhel96 |
rhel95 |
rhel95 |
rhel94 |
rhel94 |
rhel94 |
|
Product |
aarch64 |
aarch64 |
aarch64 |
aarch64 |
aarch64 |
aarch64 |
HPCM |
1.13 |
1.13 |
1.12 |
1.11 |
||
amd |
6.2.1 (D) |
6.2.1 (D) |
6.2.1 (D) |
6.2.1 (D) |
||
aocc |
5.0.0 (D) |
5.0.0 (D) |
5.0.0 (D) |
4.2.0 (D) |
||
atp |
3.15.7 (D) |
3.15.7 (D) |
3.15.6 (D) |
3.15.6 (D) |
3.15.5 (D) |
3.15.4 (D) |
cce |
20.0.0 |
20.0.0 |
19.0.0 (D) |
19.0.0 (D) |
18.0.1 (D) |
18.0.0 (D) |
cpe-gcc-mpfr |
||||||
cpe-gcc-native |
||||||
cpe-gcc-native |
14.2 (D) |
|||||
cpe-gcc-native |
14 (D) |
14 (D) |
13.3 |
13.2 (D) |
13.2 (D) |
13.2 (D) |
cpe-gcc-native |
13 |
12.2 |
12.2 |
12.2 |
12.2 |
|
cpe-gcc-native |
12 |
|||||
cpe-prgenv-amd |
||||||
cpe-prgenv-aocc |
||||||
cpe-prgenv-cray |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
||
cpe-prgenv-cray-amd |
||||||
cpe-prgenv-gnu |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
cpe-prgenv-gnu-amd |
||||||
cpe-prgenv-intel |
||||||
cpe-prgenv-nvidia |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
cray-R |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
cray-ccdb |
5.0.7 (D) |
5.0.7 (D) |
5.0.6 (D) |
5.0.6 (D) |
5.0.5 (D) |
5.0.4 (D) |
cray-cdst-support |
2.14.6 (D) |
2.14.6 (D) |
2.14.5 (D) |
2.14.3 (D) |
||
cray-cti |
2.20.0 (D) |
2.20.0 (D) |
2.19.1 (D) |
2.19.1 (D) |
2.19.0 (D) |
2.18.4 (D) |
cray-cti |
||||||
cray-dsmml |
0.3.1 (D) |
0.3.1 (D) |
0.3.1 (D) |
0.3.1 (D) |
0.3.0 (D) |
0.3.0 (D) |
cray-dwarf |
2.0.0 (D) |
2.0.0 (D) |
0.11.1 (D) |
0.11.1 (D) |
0.11.0 (D) |
0.9.2 (D) |
cray-dyninst |
12.3.6 (D) |
12.3.6 (D) |
12.3.5 (D) |
12.3.5 (D) |
12.3.4 (D) |
12.3.2 (D) |
cray-fftw |
3.3.10.11 (D) |
3.3.10.11 (D) |
3.3.10.10 (D) |
3.3.10.10 (D) |
3.3.10.9 (D) |
3.3.10.8 (D) |
cray-hdf5 |
1.14.3.7 (D) |
1.14.3.7 (D) |
1.14.3.5 (D) |
1.14.3.5 (D) |
1.14.3.3 (D) |
1.14.3.1 (D) |
cray-libsci |
25.09.0 (D) |
25.09.0 (D) |
25.03.0 (D) |
25.03.0 (D) |
24.11.0 (D) |
24.07.0 (D) |
cray-libsci-acc |
25.09.0 (D) |
25.09.0 (D) |
25.03.0 (D) |
25.03.0 (D) |
24.11.0 (D) |
24.07.0 (D) |
cray-lmod |
8.7.60 (D) |
8.7.60 (D) |
8.7.55 (D) |
8.7.55 (D) |
8.7.37 (D) |
8.7.37 (D) |
cray-modules |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
cray-mpich |
9.0.1 (D) |
9.0.1 (D) |
||||
cray-mpich |
9.0.0 |
9.0.0 |
||||
cray-mpich |
8.1.33 |
8.1.33 |
8.1.32 (D) |
8.1.32 (D) |
8.1.31 (D) |
8.1.30 (D) |
cray-mpixlate |
1.0.7 (D) |
1.0.7 (D) |
1.0.6 (D) |
1.0.5 (D) |
||
cray-mrnet |
5.1.6 (D) |
5.1.6 (D) |
5.1.5 (D) |
5.1.5 (D) |
5.1.4 (D) |
5.1.3 (D) |
cray-netcdf |
4.9.2.1 (D) |
4.9.2.1 (D) |
4.9.0.17 (D) |
4.9.0.17 (D) |
4.9.0.15 (D) |
4.9.0.13 (D) |
cray-openshmemx |
11.7.5 (D) |
11.7.5 (D) |
11.7.4 (D) |
11.7.4 (D) |
11.7.3 (D) |
11.7.2 (D) |
cray-papi |
7.2.0.2 (D) |
7.2.0.2 (D) |
7.2.0.1 (D) |
7.2.0.1 (D) |
7.1.0.4 (D) |
7.1.0.2 (D) |
cray-parallel-netcdf |
1.12.3.19 (D) |
1.12.3.19 (D) |
1.12.3.17 (D) |
1.12.3.17 (D) |
1.12.3.15 (D) |
1.12.3.13 (D) |
cray-pe-set-default |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
cray-pmi |
6.1.16 (D) |
6.1.16 (D) |
6.1.15 (D) |
6.1.15 (D) |
6.1.15 (D) |
6.1.15 (D) |
cray-pmi-devel |
6.1.16 |
6.1.16 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
cray-pmi-doc |
6.1.16 |
6.1.16 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
cray-python |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
cray-stat |
4.12.6 (D) |
4.12.6 (D) |
4.12.5 (D) |
4.12.5 (D) |
4.12.4 (D) |
4.12.3 (D) |
cray-ucx |
||||||
cray-zmqnet |
1.3.2 (D) |
1.3.2 (D) |
1.3.0 (D) |
1.3.0 (D) |
1.0.0 (D) |
|
craype |
2.7.35 (D) |
2.7.35 (D) |
2.7.34 (D) |
2.7.34 (D) |
2.7.33 (D) |
2.7.32 (D) |
craype-dl-plugin-ftr |
||||||
craype-dl-plugin-py3 |
||||||
craype-targets-ex |
||||||
craypkg-gen |
1.3.36 (D) |
1.3.36 (D) |
1.3.35 (D) |
1.3.35 (D) |
||
forgesup |
24.1.1 |
24.1.1 |
||||
gdb4hpc |
4.16.5 (D) |
4.16.5 (D) |
4.16.4 (D) |
4.16.4 (D) |
4.16.3 (D) |
4.16.2 (D) |
intel |
2025.0 (D) |
2025.0 (D) |
2025.0 (D) |
2024.2 (D) |
||
lmod_scripts |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
nvhpc |
24.3 (D) |
|||||
nvidia |
25.5 (D) |
25.5 (D) |
24.3 (D) |
24.3 (D) |
24.3 (D) |
24.3 (D) |
perftools |
25.09.0 (D) |
25.09.0 (D) |
25.03.0 (D) |
25.03.0 (D) |
24.11.0 (D) |
24.07.0 (D) |
rocm |
6.3.0 |
6.3.0 |
6.2.1 |
6.1.0 (D) |
||
sanitizers4hpc |
1.1.6 (D) |
1.1.6 (D) |
1.1.5 (D) |
1.1.5 (D) |
1.1.4 (D) |
1.1.3 (D) |
totalviewsup |
2024.4.0 |
2024.4.0 |
2024.1.21 |
2024.1.21 |
||
valgrind4hpc |
2.13.6 (D) |
2.13.6 (D) |
2.13.5 (D) |
2.13.5 (D) |
2.13.4 (D) |
2.13.3 (D) |
RHEL X86 support matrix
RHEL on X86 systems is supported with CPE on HPE Cray Supercomputing EX systems with HPCM. Below are product components, modules, third-party software versions supported with previous CPE releases with these configurations.
(D) represents the default version installed at installation.
Release |
25.09 |
25.09 |
25.09 |
25.03 |
25.03 |
25.03 |
24.11 |
24.11 |
24.07 |
24.07 |
|---|---|---|---|---|---|---|---|---|---|---|
rhel96 |
rhel95 |
rhel810 |
rhel95 |
rhel94 |
rhel810 |
rhel94 |
rhel810 |
rhel94 |
rhel810 |
|
Product |
(X86) |
(X86) |
(X86) |
(X86) |
(X86) |
(X86) |
(X86) |
(X86) |
(X86) |
|
HPCM |
1.14 |
1.14 |
1.14 |
1.13 |
1.13 |
1.13 |
1.12 |
1.12 |
1.12 |
1.11 |
amd |
6.4.1 (D) |
6.4.1 (D) |
6.4.1 (D) |
6.2.1 (D) |
6.2.1 (D) |
6.2.1 (D) |
6.2.1 (D) |
6.2.1 (D) |
6.2.1 (D) |
6.2.1 (D) |
aocc |
5.0.0 (D) |
5.0.0 (D) |
5.0.0 (D) |
5.0.0 (D) |
5.0.0 (D) |
5.0.0 (D) |
4.2.0 (D) |
4.2.0 (D) |
4.2.0 (D) |
4.2.0 (D) |
atp |
3.15.7 (D) |
3.15.7 (D) |
3.15.7 (D) |
3.15.6 (D) |
3.15.6 (D) |
3.15.6 (D) |
3.15.5 (D) |
3.15.5 (D) |
3.15.4 (D) |
3.15.4 (D) |
cce |
20.0.0 |
20.0.0 |
20.0.0 |
19.0.0 (D) |
19.0.0 (D) |
19.0.0 (D) |
18.0.1 (D) |
18.0.1 (D) |
18.0.0 (D) |
18.0.0 (D) |
cpe-gcc-mpfr |
||||||||||
cpe-gcc-native |
12 (D) |
13 (D) |
||||||||
cpe-gcc-native |
14 (D) |
12 |
10.3 |
|||||||
cpe-gcc-native |
14 (D) |
13 |
||||||||
cpe-gcc-native |
12.2 |
12.2 |
10.3 |
12.2 |
10.3 |
12.2 |
10.3 |
|||
cpe-gcc-native |
13.3 |
13.2 (D) |
11.2 |
13.2 (D) |
11.2 |
13.2 (D) |
11.2 |
|||
cpe-gcc-native |
14.2 (D) |
12.2 |
||||||||
cpe-prgenv-amd |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
8.5.0 (D) |
cpe-prgenv-aocc |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
8.5.0 (D) |
cpe-prgenv-cray |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
8.5.0 (D) |
cpe-prgenv-cray-amd |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
8.5.0 (D) |
cpe-prgenv-gnu |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
8.5.0 (D) |
cpe-prgenv-gnu-amd |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
8.5.0 (D) |
cpe-prgenv-intel |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
8.5.0 (D) |
cpe-prgenv-nvhpc |
8.5.0 (D) |
8.5.0 (D) |
||||||||
cpe-prgenv-nvidia |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
8.5.0 (D) |
8.5.0 (D) |
cray-R |
8.6.0 (D) |
8.6.0 (D) |
8.6.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
4.4.0 (D) |
cray-ccdb |
5.0.7 (D) |
5.0.7 (D) |
5.0.7 (D) |
5.0.6 (D) |
5.0.6 (D) |
5.0.6 (D) |
5.0.5 (D) |
5.0.5 (D) |
5.0.4 (D) |
5.0.4 (D) |
cray-cdst-support |
2.14.6 (D) |
2.14.6 (D) |
2.14.6 (D) |
2.14.5 (D) |
2.14.5 (D) |
2.14.3 (D) |
2.14.3 (D) |
|||
cray-cti |
2.20.0 (D) |
2.20.0 (D) |
2.20.0 (D) |
2.19.1 (D) |
2.19.1 (D) |
2.19.1 (D) |
2.19.0 (D) |
2.19.0 (D) |
2.18.4 (D) |
2.18.4 (D) |
cray-dsmml |
0.3.1 (D) |
0.3.1 (D) |
0.3.1 (D) |
0.3.1 (D) |
0.3.1 (D) |
0.3.1 (D) |
0.3.0 (D) |
0.3.0 (D) |
0.3.0 (D) |
0.3.0 (D) |
cray-dwarf |
2.0.0 (D) |
2.0.0 (D) |
2.0.0 (D) |
0.11.1 (D) |
0.11.1 (D) |
0.11.1 (D) |
0.11.0 (D) |
0.11.0 (D) |
0.9.2 (D) |
0.9.2 (D) |
cray-dyninst |
12.3.6 (D) |
12.3.6 (D) |
12.3.6 (D) |
12.3.5 (D) |
12.3.5 (D) |
12.3.5 (D) |
12.3.4 (D) |
12.3.4 (D) |
12.3.2 (D) |
12.3.2 (D) |
cray-fftw |
3.3.10.11 (D) |
3.3.10.11 (D) |
3.3.10.11 (D) |
3.3.10.10 (D) |
3.3.10.10 (D) |
3.3.10.10 (D) |
3.3.10.9 (D) |
3.3.10.9 (D) |
3.3.10.8 (D) |
3.3.10.8 (D) |
cray-hdf5 |
1.14.3.7 (D) |
1.14.3.7 (D) |
1.14.3.7 (D) |
1.14.3.5 (D) |
1.14.3.5 (D) |
1.14.3.3 (D) |
1.14.3.3 (D) |
1.14.3.5 (D) |
1.14.3.1 (D) |
1.14.3.1 (D) |
cray-libsci |
25.09.0 (D) |
25.09.0 (D) |
25.09.0 (D) |
25.03.0 (D) |
25.03.0 (D) |
25.03.0 (D) |
24.11.0 (D) |
24.11.0 (D) |
24.07.0 (D) |
24.07.0 (D) |
cray-libsci-acc |
25.09.0 (D) |
25.09.0 (D) |
25.09.0 (D) |
25.03.0 (D) |
25.03.0 (D) |
25.03.0 (D) |
24.11.0 (D) |
24.11.0 (D) |
24.07.0 (D) |
24.07.0 (D) |
cray-lmod |
8.7.60 (D) |
8.7.60 (D) |
8.7.60 (D) |
8.7.55 (D) |
8.7.55 (D) |
8.7.55 (D) |
8.7.37 (D) |
8.7.37 (D) |
8.7.37 (D) |
8.7.37 (D) |
cray-modules |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
3.2.11.7 (D) |
cray-mpich |
8.1.33 |
8.1.33 |
8.1.33 |
8.1.32 (D) |
8.1.32 (D) |
8.1.32 (D) |
8.1.31 (D) |
8.1.31 (D) |
8.1.30 (D) |
8.1.30 (D) |
cray-mpich |
9.0.1 (D) |
9.0.1 (D) |
9.0.1 (D) |
9.0.0 |
9.0.0 |
9.0.0 |
||||
cray-mpixlate |
1.0.7 (D) |
1.0.7 (D) |
1.0.7 (D) |
1.0.6 (D) |
1.0.6 (D) |
1.0.5 (D) |
1.0.5 (D) |
|||
cray-mrnet |
5.1.6 (D) |
5.1.6 (D) |
5.1.6 (D) |
5.1.5 (D) |
5.1.5 (D) |
5.1.5 (D) |
5.1.4 (D) |
5.1.4 (D) |
5.1.3 (D) |
5.1.3 (D) |
cray-netcdf |
4.9.2.1 (D) |
4.9.2.1 (D) |
4.9.2.1 (D) |
4.9.0.17 (D) |
4.9.0.17 (D) |
4.9.0.17 (D) |
4.9.0.15 (D) |
4.9.0.15 (D) |
4.9.0.13 (D) |
4.9.0.13 (D) |
cray-open-shmemx |
11.7.5 (D) |
11.7.5 (D) |
11.7.5 (D) |
1.12.3.17 (D) |
1.12.3.17 (D) |
1.12.3.17 (D) |
1.12.3.15 (D) |
1.12.3.15 (D) |
1.12.3.13 (D) |
1.12.3.13 (D) |
cray-papi |
7.2.0.2 (D) |
7.2.0.2 (D) |
7.2.0.2 (D) |
7.2.0.1 (D) |
7.2.0.1 (D) |
7.2.0.1 (D) |
7.1.0.4 (D) |
7.1.0.4 (D) |
7.1.0.2 (D) |
7.1.0.2 (D) |
cray-parallel-netcdf |
1.12.3.19 (D) |
1.12.3.19 (D) |
1.12.3.19 (D) |
1.12.3.17 (D) |
1.12.3.17 (D) |
1.12.3.17 (D) |
1.12.3.15 (D) |
1.12.3.15 (D) |
1.12.3.13 (D) |
1.12.3.13 (D) |
cray-pe-set-default |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
3.3 (D) |
cray-pmi |
6.1.16 (D) |
6.1.16 (D) |
6.1.16 (D) |
6.1.15 (D) |
6.1.15 (D) |
6.1.15 (D) |
6.1.15 (D) |
6.1.15 (D) |
6.1.15 (D) |
6.1.15 (D) |
cray-pmi-devel |
6.1.16 (D) |
6.1.16 (D) |
6.1.16 (D) |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
cray-pmi-doc |
6.1.16 (D) |
6.1.16 (D) |
6.1.16 (D) |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
6.1.15 |
cray-python |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
3.11.7 (D) |
cray-stat |
4.12.6 (D) |
4.12.6 (D) |
4.12.6 (D) |
4.12.5 (D) |
4.12.5 (D) |
4.12.5 (D) |
4.12.4 (D) |
4.12.4 (D) |
4.12.3 (D) |
4.12.3 (D) |
cray-zmqnet |
1.3.2 (D) |
1.3.2 (D) |
1.3.2 (D) |
1.3.0 (D) |
1.3.0 (D) |
1.3.0 (D) |
1.0.0 (D) |
1.0.0 (D) |
||
craype |
2.7.35 (D) |
2.7.35 (D) |
2.7.35 (D) |
2.7.34 (D) |
2.7.34 (D) |
2.7.34 (D) |
2.7.33 (D) |
2.7.33 (D) |
2.7.32 (D) |
2.7.32 (D) |
craype-dl-plugin-ftr |
22.06.1.2 (D) |
22.06.1.2 (D) |
22.06.1.2 (D) |
|||||||
craype-dl-plugin-py3 |
21.02.1.3 |
21.02.1.3 |
21.02.1.3 |
|||||||
craype-dl-plugin-py3 |
22.09.1 |
22.09.1 |
22.09.1 |
|||||||
craype-dl-plugin-py3 |
22.12.1 (D) |
22.12.1 (D) |
22.12.1 (D) |
|||||||
craype-targets-ex |
||||||||||
craypkg-gen |
1.3.36 (D) |
1.3.36 (D) |
1.3.36 (D) |
1.3.35 (D) |
1.3.35 (D) |
1.3.35 (D) |
1.3.34 (D) |
1.3.34 (D) |
1.3.33 (D) |
1.3.33 (D) |
forgesup |
24.1.1 |
24.1.1 |
24.1.1 |
|||||||
gdb4hpc |
4.16.5 (D) |
4.16.5 (D) |
4.16.5 (D) |
4.16.4 (D) |
4.16.4 (D) |
4.16.4 (D) |
4.16.3 (D) |
4.16.3 (D) |
4.16.2 (D) |
4.16.2 (D) |
intel |
2025.1 (D) |
2025.1 (D) |
2025.1 (D) |
2025.0 (D) |
2025.0 (D) |
2025.0 (D) |
2024.2 (D) |
2024.2 (D) |
2024.0 (D) |
2024.0 (D) |
lmod_-scripts |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
3.2.1 (D) |
nvhpc |
24.3 (D) |
24.3 (D) |
||||||||
nvidia |
25.5 (D) |
25.5 (D) |
25.5 (D) |
24.3 (D) |
24.3 (D) |
24.3 (D) |
24.3 (D) |
24.3 (D) |
24.3 (D) |
24.3 (D) |
perftools |
25.09.0 (D) |
25.09.0 (D) |
25.09.0 (D) |
25.03.0 (D) |
25.03.0 (D) |
25.03.0 (D) |
24.11.0 (D) |
24.11.0 (D) |
24.07.0 (D) |
24.07.0 (D) |
rocm |
6.4.1 (D) |
6.4.1 (D) |
6.4.1 (D) |
6.3.0 |
6.3.0 |
6.2.1 |
6.1.0 (D) |
6.1.0 (D) |
6.1.0 (D) |
6.1.0 (D) |
saniti-zers4hpc |
1.1.6 (D) |
1.1.6 (D) |
1.1.6 (D) |
1.1.5 (D) |
1.1.5 (D) |
1.1.5 (D) |
1.1.4 (D) |
1.1.4 (D) |
1.1.3 (D) |
1.1.3 (D) |
total-viewsup |
2024.4.0 |
2024.4.0 |
2024.4.0 |
2024.1.21 |
2024.1.21 |
2024.1.21 |
2024.1.21 |
|||
val-grind4hpc |
2.13.6 (D) |
2.13.6 (D) |
2.13.6 (D) |
2.13.5 (D) |
2.13.5 (D) |
2.13.5 (D) |
2.13.4 (D) |
2.13.4 (D) |
2.13.3 (D) |
2.13.3 (D) |
Documentation and support
Documentation is available as a resource for using and managing CPE. This chapter provides details for obtaining CPE support and accessing available resources.
CPE installation and getting started guides
HPE CPE documentation comprises user and installation guides:
Title |
Document Part Number |
|---|---|
HPE Cray Supercomputing Programming Environment Installation Guide: CSM on HPE Cray Supercomputing EX Systems |
S-8003 |
HPE Cray Supercomputing Programming Environment Installation Guide: HPCM on HPE Cray Supercomputing EX and HPE Cray Supercomputing Systems |
S-8022 |
HPE Cray Supercomputing Programming Environment Installation Guide: HPE Cray XD2000 Systems |
S-8012 |
HPE Cray Supercomputing Programming Environment Getting Started User Guide: HPE Cray Supercomputing EX Systems |
S-9934 |
HPE Cray Supercomputing Programming Environment Getting Started Administrator Guide: HPE Cray Supercomputing EX Systems |
S-9935 |
Other documentation resources
HPE provides CPE documentation and support through various online sources:
Retrieve a range of HPE resources through the HPE Support Center, including access to support issues; the latest guides (as listed in CPE installation and getting started guides), including guide revisions; software download information; the HPE knowledge base; product information; and other resources.
To help you to get the most out of the CPE online, access the CPE Online Documentation website to obtain initially released installation and Getting Started guides, in addition to general user procedures, release announcements, and best practice manuals.
Important: Be sure to regularly check for guide revisions on the HPE Support Center. Revisions of installation and Getting Started guides that are posted to the HPE Support Center are presumed more current than those posted on the CPE Online Documentation website.
To search CPE articles, see the HPE Support Center listing of CPE-related Knowledge Articles.
For HPE Slingshot SHMEM download and installation information, refer to the HPE Slingshot SHMEM Software Installation Guide.
Join the CPE #hpe-cray-programming-environment Slack channel through the HPE Developer Community Slack web page for interactive and collaborative CPE interactions.
Access CCE help using CCE module commands:
man craycc or man crayCC - Returns HPE Cray C and C++ compiler man pages. (Alias for man clang.)
craycc –help - Returns a summary of the command line options and arguments.
man crayftn - Returns HPE Cray Fortran compiler man pages.
crayftn –help - Returns a summary of the command line options and arguments.
The complete Clang reference manual is included in HTML format in the /opt/cray/pe/cce/<version>.0.0/doc/html/index.html file system location. Note that the man page is presumed to be more current if content differences exist.
For CPE and software installation and update information, see My HPE Software Center for general CPE information.
Access the HPE Cray Supercomputing Programming Environment Software QuickSpecs online.
Access third-party documentation resources online, including:
Glossary
This section provides a listing of CPE general terms and definitions.
A
Adaptive Routing (AR): A technology that dynamically selects the best path for data packets in a network to improve performance and fault tolerance.
Appentice3: A performance analysis tool that provides a graphical interface for visualizing performance data collected by HPE CrayPAT.
Command: app3
Module: module load app3
B
Batch System: Software that manages and schedules jobs on a supercomputer, ensuring efficient use of computational resources.
C
Cache Optimization: Techniques for optimizing data structures and algorithms to take advantage of cache locality to improve performance.
CCE (Cray Compiling Environment): HPE Cray’s native compiler suite for C, C++, and Fortran, optimized for Cray hardware.
Commands:
cc for C
CC for C++
ftn for Fortran
CrayPAT (Cray Performance Analysis Tools): A suite of tools for collecting and analyzing performance data of parallel applications.
Commands:
pat_build to instrument an application
pat_report to generate a performance report
Module: module load perftools
D
DataWarp: A technology for accelerating I/O by using SSD-based storage to provide a high-speed buffer between compute nodes and the parallel file system.
Distributed Debugging Tool (DDT): A specialized debugger for debugging parallel applications, including MPI and OpenMP programs. Allows developers to determine the performance state of processes running together across cluster nodes. CPE supports the integration of DDTs, such as Perforce TotalView and Allinea DDT.
Command: ddt
Module: module load ddt
E
Environment Groups: Logical groupings of environment variables and module settings to simplify switching between different development environments.
Commands:
envmgr activate <group_name>
envmgr deactivate <group_name>
Environment Variables: Variables used to configure the runtime environment, such as PATH, LD_LIBRARY_PATH and MODULEPATH.
F
File Striping: A method of dividing a file into segments and distributing them across multiple disks to improve I/O performance.
Command: lfs setstripe -s 1M -c -1 <path>
Finite Element Analysis (FEA): A computational technique used to approximate solutions to complex structural engineering problems.
FFTW (cray-fftw): An optimized and scalable library for computing Fast Fourier Transforms (FFTs) on HPE Cray EX Supercomputing systems, facilitating efficient FFT computations for various scientific and engineering applications.
Module: module load cray-fftw; gcc -o my_fft_program my_fft_program.c -lfftw3
G
GCC (GNU Compiler Collection): A widely-used alternative compiler suite that supports various programming languages.
Commands:
gcc for C
g++ for C++
gfortran for Fortran
gdb4hpc (HPE Cray gdb-based HPC Debugger): Advanced HPC debugger for complex applications at scale.
Command: gdb4hpc
Module: module load gdb4hpc
H
HDF5 (cray-hdf5 and cray-hdf5-parallel): A data model, library, and file format for storing and managing large amounts of data.
Module: module load cray-hdf5
Hybrid Parallel Programming: Combining MPI with OpenMP or other parallel programming models to leverage both inter-node and intra-node parallelism.
Huge pages: A Linux kernel feature that allows operating systems to manage memory in larger chunks as opposed to 4KB pages. Used to improve the efficiency of virtual memory systems.
I
Intel Compiler: A suite of compilers optimized for Intel architectures.
Commands:
icc for C
icpc for C++
ifort for Fortran
J
Job Arrays: A method to submit multiple similar jobs using a single job script.
Slurm Command: sbatch –array=0-9 my_job_script.sh
PBS Command: qsub -t 0-9 my_job_script.sh
Job Scheduler: A system that manages and schedules jobs on a supercomputer, ensuring efficient use of computational resources.
Slurm: sbatch, squeue, scancel
PBS: qsub, qstat, qdel
L
Low Level Virtual Machine (LLVM): A LLVM Foundation compiler and toolchain technology. Builds compilers, debuggers, and other software-based development tools. For CPE, specialized and used in conjunction with Clang for optimized coding for improved performance. HPE Clang C and C++ is based on Clang/LLVM. See the HPE Cray Clang C and C++ Quick Reference documentation for information on HPE Clang C and C++, Clang documentation for more information on Clang, or LLVM documentation for more information on LLVM.
Lustre: A type of parallel distributed file system, primarily used for large-scale cluster computing.
Command: lfs
LibSci (cray-libSci): A collection of scientific libraries optimized for Cray systems, including LAPACK, BLAS, and ScaLAPACK.
Module: module load cray-libsci
LibSci_ACC (cray-libsci_acc): An extension of HPE Cray LibSci that includes GPU-accelerated versions of mathematical routines, leveraging GPU hardware to improve performance in scientific computations on HPE Cray EX Supercomputing systems equipped with GPUs.
Module:
module load cray-libsci_acc; nvcc -o my_gpu_program my_gpu_program.cu \ -L${CRAY_LIBSCI_ACC_PREFIX_DIR}/lib -lsci_acc
Lmod - A Lua-based module management software tool.
M
Makefile: A file containing a set of directives used by the make build automation tool to compile and link programs.
Command: make
Modules: A system for dynamically modifying user environments through modulefiles. Modules can be loaded and unloaded to manage different software packages and versions.
Commands:
module load <module_name>
module unload <module_name>
module avail
MPI (Message Passing Interface): A standard for parallel programming that allows processes to communicate with each other by sending and receiving messages.
Common Functions: MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Send, MPI_Recv
N
NetCDF (cray-netcdf and cray-netcdf-hdf5parallel): Libraries supporting the creation, access, and sharing of array-oriented scientific data in the Network Common Data Form (NetCDF), offering parallel I/O support to improve performance and scalability on large-scale HPE Cray EX Supercomputing systems.
Module: module load cray-netcdf; gcc -o my_netcdf_program my_netcdf_program.c -lnetcdf
NUMA (Non-Uniform Memory Access): An architecture where memory access time depends on the memory location relative to the processor.
O
OpenACC (for Fortran): A directive-based parallel programming model for offloading computations to GPUs.
Command: ftn -hacc -o my_program my_program.f90
Directives: !$acc parallel, !$acc kernels
OpenMP: An API for parallel programming that supports multi-platform shared memory and GPU parallel programming.
Common Directives: #pragma omp parallel, #pragma omp for, #pragma omp critical, #pragma omp barrier
P
Parallel NetCDF (cray-parallel-netcdf): A high-performance parallel I/O library for NetCDF files, enabling efficient handling and management of large, distributed data sets in scientific applications running on HPE Cray EX Supercomputing systems.
Module: module load cray-parallel-netcdf; gcc -o my_pnetcdf_program my_pnetcdf_program.c -lpnetcdf
PBS (Portable Batch System): A job scheduler used on some HPE Cray EX Supercomputing systems.
Commands: qsub, qstat, qdel
Performance-Guided Optimization (PGO): Using profiling data to guide optimizations. Involves:
Compiling with profiling enabled: cc -h profile_generate -o my_program my_program.c
Running the program to generate profile data.
Recompiling with profile data: cc -h profile_use -o my_program my_program.c
R
Resource Constraints: Specify memory, CPU, and other resource constraints for job scheduling.
Slurm Command: sbatch –mem=4G –cpus-per-task=8 my_job_script.sh
PBS Command: qsub -l mem=4G,ncpus=8 my_job_script.sh
S
Slurm (Simple Linux Utility for Resource Management): A job scheduler used on many HPE Cray systems.
Commands:
sbatch: Submit a job script.
squeue: Check the status of jobs.
scancel: Cancel a job.
T
TensorFlow: An open-source platform for machine learning.
Module: module load tensorflow
U
User Access Node (UAN): A critical component that acts as a “gateway” to the supercomputer. It is a dedicated server or node where you log in to interact with the system, submit jobs, manage files, and perform development tasks. High-performance compute nodes (the powerful “brain” of the supercomputer) is not directly accessed for these activities—instead, you use the UAN to prepare your work.
UAN Key Features:
Development Environment: The UAN provides tools for coding, compiling, debugging, and optimizing your programs. It is where you set up applications before running them on the compute nodes.
Job Submission: From the UAN, submit workloads (such as simulation or analysis tasks) to the job scheduler, which then runs tasks on the compute nodes.
File Management: The UAN is where you can access and manage files stored in the system.
Access Point: Users connect to the UAN through protocols like SSH (Secure Shell) to securely log in and work on the supercomputer.
The UAN as the central point for interaction with the larger computing system.
V
Vectorization: Techniques for optimizing code to take advantage of vector instructions.
Compiler Flags: -h vector3
Directives: #pragma ivdep
W
Workload Managers: Software that orchestrates the execution of jobs in a high-performance computing environment. Examples include Slurm and PBS.
Published: April 2026