Copyright and Version
© Copyright 2022-2025 Hewlett Packard Enterprise Development LP. All third-party marks are the property of their respective owners.
CPE: 25.09-LocalBuild
Doc git hash: 9bc237b81fef60fa958b671c3018c2a3bfd223f0
Generated: Mon Sep 29 2025
Record of revision
This chapter provides a record of updates and revisions to this guide.
Release updates
New in this 25.09 release
Added the Example: Running and compiling a deep learning job using Horovod section.
Updated the Performance profiling and debugging commands section.
Updated the Running, compiling, analyzing, and debugging the job procedure in the Example: Running a parallel MPI automotive-based CFD job chapter.
Updated the Debugging and reviewing the performance of your job procedure in the Example: Running, compiling, and debugging a sample Fortran health industry job chapter.
Removed references to Apprentice2 in the:
Example: Choosing a library for a job and accessing Apprentice3 reporting functions chapter.
Getting Started key CPE terms section.
Environment setup tools section.
Running sample jobs introduction.
Reviewing reports introduction.
Updated the title of this guide to HPE Cray Supercomputing Programming Environment Getting Started User Guide: HPE Cray Supercomputing EX Systems (25.09) S-9934
Updated the Documentation and support chapter.
Added the Supported systems chapter.
Incorporated minor updates
New in this 25.03 release
Issued the first version of HPE Supercomputing Cray Programming Environment Getting Started User Guide: HPE Cray Supercomputing EX Systems (25.03) S-9934
Revision history
Publication Title |
Date |
|---|---|
HPE Cray Supercomputing Programming Environment Getting Started User Guide: HPE Cray Supercomputing EX Systems (25.09) S-9934 |
September 2025 |
HPE Cray Programming Environment Getting Started User Guide: HPE Cray Supercomputing EX Systems (25.03) S-9934 |
March 2025 |
Conventions
Typographic conventions
This style indicates program code, reserved words, library functions,
command-line prompts, screen output, file/path names, variables, and
other software constructs. \ (backslash) At the end of a command line,
indicates the Linux shell line continuation character (lines joined by a
backslash are parsed as a single line).
Command prompt conventions
Host name and account in command prompts: The host name in a command prompt indicates where the command must be run. The account that must run the command is also indicated in the prompt.
The root or super-user account always has the
#character at the end of the prompt.Any non-root account is indicated with
account@hostname>. A nonprivileged account is referred to asuser.
Node abbreviations: The following list contains abbreviations for nodes used in command prompts.
CN - Compute Nodes
NCN - Non-Compute Nodes
AN - Application Node (special type of NCN)
UAN - User Access Node (special type of AN)
Command prompts: The following list contains command prompts used in this guide.
ncn-m001#- Run the command as root on the specific NCN-M (NCN that is a Kubernetes master node) with hostnamencn-m001.ncn-w001#- Run the command as root on the specific NCN-W (NCN that is a Kubernetes worker node) with hostnamencn-w001.uan01#- Run the command on a specific UAN.cn#- Run the command as root on any CN. Note that a CN has a hostname of the formnid123456(that is, “nid” and a six-digit, zero padded number).pod#- Run the command as root within a Kubernetes pod.
Copying and pasting text from this document
Using the Copy and Paste functions from a PDF is unreliable. Although copying and pasting a command line typically works, copying and pasting formatted file content (for example, JSON, YAML) typically fails. To ensure that file content is copied and pasted correctly while performing the procedures in this guide:
Copy the content from the PDF.
Paste it to a neutral editing form and add the necessary formatting.
Copy the content from the neutral form and paste it into the console.
Tip: It is always a good idea to double-check copied/pasted commands for correctness, as some commands may not render correctly in the PDF.
About the HPE Supercomputing Cray Programming Environment
Welcome to the HPE Supercomputing Cray Programming Environment (CPE) Software suite, a complete application development and application development lifecycle software solution. CPE, offered in an integrated and user-friendly environment, provides a suite of programmer tools and libraries that support the development, optimization, and execution of high performance computing (HPC) applications for HPE Cray Supercomputing EX systems. These systems comprise multiple components. They include compute nodes, high-speed interconnects, storage solutions, cooling and power infrastructure, comprehensive system management software, security features, and other integral components and tools. CPE enables scientists, researchers, engineers, and other users to effectively leverage the advanced capabilities of these systems. Combined, CPE and its compatible systems provide for the computational needs of developed applications. Furthermore, these solutions deliver the performance, scalability, and flexibility required for HPC applications.
This guide provides background information about CPE. It provides details for logging into the system, and introductory information for running and debugging jobs using CPE and its tools and toolkits. Also included are example jobs, information about additional CPE documentation, and details for accessing support information through HPE. This guide is intended for users who are initially using CPE and want to become familiar with its basic functions following its successful installation.
For the latest version and revisions of this CPE guide, go to the HPE Support Center website, and perform a search on the part number of this document (S-9934). For additional information on how to use CPE or details regarding CPE components and modules, see the CPE Online Documentation website web page. See also the Documentation and support chapter for additional CPE resources and information.
About the CPE software suite
CPE comprises a set of tools and toolkits that collectively provides a comprehensive environment for developing, optimizing, and running high performance applications on HPE Cray Supercomputing EX systems. The CPE Software suite includes:
HPE Cray Compiling Environment (CCE)
HPE Cray Debugging Support Tools
HPE Cray Environment (CENV) Setup and Compiling Support Tools
HPE Cray Message Passing Toolkit (MPT)
HPE Cray Performance, Measurement, and Analysis Tools (CPMAT)
HPE Cray Scientific and Math Libraries (CSML)

The suite of CPE tools and libraries, in conjunction with third-party software, allows you to write, develop, compile, run, debug, port, and optimize high performance applications on HPE Cray EX supercomputers.
Tool/Toolkit Name |
What it is |
Description |
|---|---|---|
CCE |
A suite of compilers optimized for HPE Cray |
Compiles your code into executable programs that take |
Supercomputing EX systems, including support |
full advantage of the architecture and capabilities of HPE |
|
for languages such as C, C++, and Fortran. |
Cray Supercomputing EX systems. CCE is designed to |
|
generate highly optimized code, ensuring that your |
||
applications run efficiently. |
||
Debugging |
A set of tools for diagnosing and troubleshooting |
Helps you identify and fix bugs in your applications. |
Tools |
issues in your code. |
applications. These tools provide features, such as |
breakpoints, variable inspection, and call stack tracing, |
||
which are essential for debugging complex parallel |
||
applications. |
||
CENV |
Tools and utilities designed to help users |
Simplifies the setup and configuration of CPE on HPE Cray |
configure their programming environment and manage |
Supercomputing EX systems. These tools help ensure that |
|
the compilation of their applications. |
the necessary libraries, compilers, and environment |
|
variables are correctly set up, making it easier for users |
||
to compile and run their applications efficiently. |
||
MPT |
A set of libraries and tools that assist in the |
Enables efficient communication between multiple processes |
development of parallel applications using the |
running on different nodes of HPE Cray Supercomputing EX |
|
Message Passing Interface (MPI) standard. |
system, which is crucial for HPC applications. This toolkit |
|
supports scalable and high performance data exchange, |
||
essential for tasks that require coordination and data |
||
sharing among numerous processors. |
||
CPMAT |
A collection of tools designed to help you measure, |
Ensures that your applications are running efficiently by |
analyze, and optimize the performance of your |
identifying performance bottlenecks and providing insights |
|
applications running on HPE Cray systems. |
into how to improve computational performance. This suite |
|
includes tools for profiling, tracing, and in-depth |
||
performance analysis. |
||
CSML |
A collection of high performance mathematical and |
Provides pre-optimized routines for common mathematical |
scientific libraries. |
and scientific computations, such as linear algebra, fast |
|
Fourier transforms, and more. These libraries help you |
||
achieve better performance and accuracy in your scientific |
||
applications without having to develop complex algorithms |
||
from scratch. |
After developing code in programming languages like Fortran, C, or C++, you can then use HPE-optimized compilers to convert your code into executable programs. Additionally, you can use CPE tools for testing the performance, streamlining, and debugging your applications. With CPE, you manage your software environment by:
Using its various modules,
Submitting jobs to the job scheduler and running applications on HPE Cray EX supercomputers, and
Using debugging and performance analysis tools.
CPE components allow you to run applications efficiently and correctly.
CPE usage
CPE is used by a diverse range of users who develop an array of diverse applications, particularly in HPC environments. While CPE administrators provide systems oversight and operational support, developers use CPE to develop applications that require significant computational power and are often involved in scientific research, engineering, and data analysis.
User Category |
Application Areas |
Use Cases |
Example |
|---|---|---|---|
Scientists and Researchers |
Computational physics, chemistry, biology, climate science, and astrophysics. |
Running large-scale simulations, modeling complex phenomena, processing large datasets, and performing data-intensive calculations. |
Simulating molecular dynamics, climate modeling, and astrophysical simulations. |
Engineers |
Aerospace, automotive, civil engineering, and materials science. |
Performing finite element analysis (FEA), computational fluid dynamics (CFD), structural analysis, and optimizing engineering designs. |
Simulating airflow over aircraft wings, stress testing of materials, and optimizing engine designs. |
Data Scientists |
Big data analytics, machine learning, and artificial intelligence. |
Analyzing large datasets, training machine learning models, and performing predictive analytics. |
Analyzing genomic data, training learning models for image recognition, and predicting financial market trends. |
HPC Application Developers |
Development of HPC-specific software and libraries. |
Creating and optimizing parallel algorithms, developing scientific libraries, and enhancing performance of existing applications. |
Developing parallel versions of linear algebra libraries, optimizing MPI-based applications, and creating scalable software for HPC systems. |
Academic Users |
Various academic disciplines that require computational support. |
Conducting research projects, running educational simulations, and teaching HPC concepts. |
Running simulations for academic research, conducting computational experiments, and providing hands-on HPC training for students. |
Government and Defense Researchers |
National security, defense simulations, and policy modeling. |
Conducting simulations related to defense, modeling scenarios for policy analysis, and running classified computations. |
Simulating defense systems, analyzing security scenarios, and modeling the impact of policy decisions |
Healthcare and Bioinformatics Professionals |
Genomics, proteomics, and healthcare analytics |
Analyzing genetic data, simulating biological processes, and developing healthcare models |
Genome sequencing, protein folding simulations, and analyzing patient data for healthcare insights. |
Weather and Climate Scientists |
Meteorology, climate modeling, and environmental science. |
Running weather prediction models, studying climate change effects, and analyzing environmental data. |
Forecasting weather patterns, simulating climate scenarios, and analyzing environmental impact data. |
Financial Analysts |
Quantitative finance, risk modeling, and financial simulations. |
Running financial risk models, developing trading algorithms, and simulating market scenarios. |
Analyzing market data for risk assessment, developing high-frequency trading algorithms, and simulating financial crises. |
Industrial and Energy Sector Engineers |
Oil and gas exploration, renewable energy, and manufacturing |
Simulating reservoir models, optimizing energy production, and improving manufacturing processes. |
Simulating oil reservoir behavior, optimizing wind turbine designs, and modeling manufacturing workflows. |
CPE key features
Key CPE benefits and features include:
Performance Optimization: CPE is designed to exploit the full potential of HPE Cray Supercomputing EX hardware, providing high performance compilers, libraries, and tools that optimize application performance.
Ease of Use: The integrated environment and comprehensive documentation make it easier for you to develop, debug, and optimize applications.
Scalability: CPE supports scalable programming models and tools that allow you to develop applications that can run efficiently on large-scale HPE Cray Supercomputing EX systems.
Flexibility: Support for multiple programming languages, compilers, and parallel programming models provides flexibility for you to choose the best tools for your specific needs.
Advanced Analysis and Debugging: Sophisticated performance analysis and debugging tools help you identify bottlenecks, optimize code, and ensure accurate parallel applications.
Efficiency and Productivity: CPE provides a comprehensive suite of optimized tools and libraries. These components reduce the time and effort needed to develop, debug, and optimize HPC applications.
Interoperability: CPE allows for multiple programming languages and standards, making it easier to integrate existing code bases and workflows.
No matter what application, CPE converges on HPE Supercomputing EX system and user applications, and uses its components in multiple ways. For example:
For: |
CPE Provides: |
|---|---|
Researchers and Scientists |
- Optimization: Compilers, libraries, and tools that are designed |
to maximize computational efficiency on HPE Cray Supercomputing EX hardware, |
|
speeding up simulations and calculations. |
|
- Debugging: Tools like the CDST help identify and fix bugs in complex |
|
scientific codes, ensuring accurate and reliable results. |
|
- Performance Tuning: Profiling and tracing tools (for example, CPMAT) |
|
provide detailed insights into performance bottlenecks, allowing researchers |
|
to fine-tune their applications for better efficiency. |
|
Data Scientists |
- Accelerated Data Processing: High performance libraries for numerical |
computations and data analytics streamline data processing tasks, making it |
|
easier to handle large datasets. |
|
- Deep Learning Capabilities: Ability to integrate specialized tools for |
|
developing, training, and deploying deep learning models significantly |
|
reduce training times and improve model performance. |
|
- Scalability: Support for scaling data-intensive tasks across multiple |
|
nodes, enabling the processing of large volumes of data more effectively. |
|
Application Developers |
- Comprehensive Development Tools: Access to optimized compilers, |
debugging tools, and performance analyzers simplify the development |
|
lifecycle, from writing code to optimizing it for performance. |
|
- Parallel Programming: Libraries and tools for MPI and OpenMPI supports |
|
the development of parallel applications that are essential for leveraging |
|
the computational power of supercomputers. |
|
- Environment Management: Utilities for setting up and managing the |
|
development environment ensure that all dependencies and configurations are |
|
handled efficiently. |
|
Academia |
- Educational Resources: Extensive documentation, tutorials, and examples |
make it easier for students and researchers to learn and use HPC tools. |
|
- Multi-Disciplinary Applications: Support for a broad range of |
|
scientific and engineering disciplines, making it versatile for various |
|
academic research projects. |
|
- Student Projects: Accessible tools for developing and running large- |
|
scale simulations, enabling students to tackle real-world computational |
|
problems. |
|
Government and Defense |
- Security and Compliance: Features that ensure the secure handling of |
Researchers |
sensitive data and adherence to compliance requirements. |
- Simulation and Modeling: Optimization capabilities for running complex |
|
simulations critical for defense and policy planning, from cryptography to |
|
strategic simulations. |
|
- Performance and Reliability: High reliability and performance tools |
|
to ensure that critical applications run smoothly and efficiently. |
|
Weather and Climate Scientists |
- High-Resolution Modeling: Tools and libraries optimized for running |
high-resolution weather models and climate simulations, enabling more |
|
accurate predictions. |
|
- Data Analysis and Visualization: Analyzation support for large datasets |
|
generated by climate models, including tools for visualizing complex data |
|
patterns. |
|
- Scalable Solutions: The ability to scale computations across thousands |
|
of nodes, essential for simulating intricate weather patterns and long-term |
|
climate changes. |
CPE tools and libraries enable users to extend research and development initiatives and, in turn, advance innovations.
About the CPE hardware, HPCM, and CSM computing environment
Using the CPE software suite with HPE Cray Supercomputing EX hardware and either HPE Performance Cluster Management (HPCM) software or HPE Cray System Management (CSM) software provides a robust and efficient high performance computing environment. First, HPE Cray Supercomputing EX hardware comprises:
Powerful compute nodes
High-speed interconnects
Advanced storage solutions
Next, HPCM manages and surveils a large network of servers designed to work efficiently together. Alternatively, CSM manages the extreme scale and unique needs of HPE Cray Supercomputing EX systems. Combined, these components, working in their respective environments, offer a well-rounded HPC solution.
HPCM allows for efficient resource allocation and job scheduling, ensuring that computational tasks are distributed and executed optimally across HPE Cray EX supercomputing nodes. Using CPE with HPCM offers:
Flexibility: For environments with a variety of computing resources (not just supercomputers), HPCM can manage all these different systems efficiently.
Versatility: HPCM is good for a broad range of HPC tasks, making it versatile for different kinds of computational jobs.
Integrated Management Functions: It provides a unified way to manage both your high performance clusters and other computing resources, making it easier to keep everything running smoothly.
CSM manages the overall system health, monitoring hardware status, managing firmware and software updates, and providing diagnostic tools to maintain system stability and performance. Using HPE CPE with CSM offers:
Supercomputer Optimization: CSM is finely tuned for HPE Cray Supercomputing EX systems. This capability allows it to efficiently manage the special needs and large-scale composition of these powerful machines.
Advanced Features: CSM offers advanced and specialized features for HPE Cray Supercomputing EX systems (for example, better job scheduling, resource management, and system monitoring).
Optimal Performance: If you need the best possible performance from HPE Cray Supercomputing EX systems, CSM provides more specialized and optimization tools.
Together, these CPE components create a cohesive environment that allows you to develop and run high performance applications, use hardware resources efficiently, and maintain system reliability and performance. Hardware components deliver the computational power needed for demanding scientific and engineering applications. The CPE software suite offers an integrated set of tools and libraries for developing, compiling, debugging, and optimizing applications tailored to leverage the HPE Cray Supercomputing EX architecture. HPCM and CSM are uniquely designed to address, manage, and surveil specific system requirements.
Getting Started key CPE terms
Key terms to know relative to this guide include:
Term |
Definition |
|---|---|
Compiler |
A tool that translates source code written in a programming language |
(for example, C, C++, Fortran) into machine code that can be executed by |
|
the computer. |
|
Commands: |
|
Compute Node |
A single server within the supercomputer that performs computation. |
Executables run on compute nodes. Each node typically has multiple CPUs, |
|
memory, and often one or more GPUs. |
|
Core |
A single processing unit within a CPU or GPU. Modern CPUs/GPUs have |
multiple cores, allowing them to perform multiple tasks simultaneously. |
|
Debugging |
The process of identifying and fixing errors or bugs in a program. |
Tools: HPE Cray DDT, gdb4hpc. |
|
Environment Variable |
A variable that stores information about the operating system environment |
and is used to configure software behavior. |
|
Example: |
|
Job Scheduler |
A software tool that manages the allocation of computational resources and |
schedules jobs to run on the supercomputer. |
|
Example: Simple Linux Utility for Resource Management (SLURM). |
|
Job Script |
A script that contains instructions for the job scheduler on how to run a |
specific job, including resource requests and application execution |
|
commands. |
|
Example: A shell script with |
|
Library |
A collection of precompiled routines that programs can use. Libraries |
provide functionality to applications without the need to write code from |
|
scratch. |
|
Example: HPE Cray LibSci (scientific and mathematical libraries). |
|
Message Passing Interface (MPI) |
A standardized and portable messaging system designed to allow processes |
to communicate with each other in parallel computing environments. |
|
Module System |
A system for managing and configuring the user environment by loading, |
unloading, and switching between different software packages and versions. |
|
With CPE, use either the Tool Command Language (Tcl) or Lua Module System |
|
(Lmod). |
|
Commands: |
|
MPI Compiler Wrappers |
Wrapper scripts that simplify the compilation of MPI programs by |
automatically including the necessary MPI libraries and include paths. |
|
Commands: |
|
OpenMP |
An API that supports multi-platform shared memory and GPU parallel |
programming in C, C++, and Fortran. |
|
Parallel Computing |
A type of computation in which many calculations or processes are carried |
out simultaneously, leveraging multiple compute resources. |
|
Profiling |
The process of measuring the performance of an application to identify |
bottlenecks and optimize code. |
|
Tools: HPE CrayPAT, HPE Cray Apprentice3 |
|
Resource Allocation |
The process of assigning computational resources (such as CPUs, GPUs, |
memory, and nodes) to a job. |
|
|
A command used to submit a job script to the job scheduler. |
Example: |
|
|
A command used to cancel a specific job in the job scheduler. |
Example: |
|
|
A command used to view the status of jobs in the queue. |
|
A command used to run a parallel job interactively or within a job script. |
Example: |
|
User Access Node |
A critical component in CPE that serves as the primary interface between |
users and the supercomputer system. The compiler runs on the UAN. |
Logging into the CPE system
After you have been provided access credentials to the CPE Software suite, you can then log in to HPE systems to use CPE. This chapter provides information on how to log in to HPE systems. Specifically, you can log in to the system through:
A login node,
A User Access Instance (UAI),
A User Access Node (UAN), or
Through remote desktop access.
Each login method has its own use cases and benefits. For example:
Direct Login Node access provides:
Direct access to the HPE Cray Supercomputing EX system.
Standard environment for job preparation and submission.
The best scenario for direct access to resources.
User Access Instance (UAI) access provides:
Customizable and isolated environment.
Flexibility for specific software and workflows.
An ideal scenario for tailored user needs.
User Access Node (UAN) access provides:
Secure and user-friendly interface.
Handles multiple user sessions.
A suitable environment for general development and job management.
Remote desktop access provides:
A graphical user interface (GUI).
Useful environment for data visualization and interactive development.
An ideal scenario for users preferring a GUI.
Consult with your CPE administrator to determine which method is appropriate for your environment.
Logging into CPE through a login node
Logging into the HPE Cray Supercomputing EX system involves using the secure shell (SSH) protocol to connect to the system login nodes. To do so:
If you have not already done so, contact your CPE administrator to obtain your login credentials, which typically include a username and password.
If the system uses SSH key-based authentication, obtain an SSH key pair (public and private keys).
If using SSH keys, ensure that your CPE administrator adds your public key to the
~/.ssh/authorized_keysfile on the HPE Cray Supercomputing EX system. This assignment is usually done by the system administrator.Open a terminal or command prompt on your local machine. This action is a process where you enter the SSH command to connect to the HPE Cray Supercomputing EX system.
Use the SSH command to connect to the login node of the Cray Supercomputing EX system. The general syntax for the SSH command is:
ssh username@login-node-addressReplace
usernamewith your actual username andlogin-node-addresswith the hostname or IP address of the login node.Examples:
Password-based Authentication:
ssh your_username@cray-ex-login.example.comAfter entering this command, you will be prompted to enter your password.
Key-based Authentication:
ssh -i /path/to/your/private_key your_username@cray-ex-login.example.comReplace
/path/to/your/private_keywith the path to your private SSH key. If your public key has been added to~/.ssh/authorized_keyson the login node, you will be granted access without a password prompt.
If required, complete the two-factor authentication (2FA). Follow the prompts to enter your second factor, which might be a code from a mobile app, a hardware token, or a text message.
After successful login, you will have access to the login node of the HPE Cray Supercomputing EX system. You should see a command prompt indicating that you are now on the remote system.
Logging into CPE through a User Access Interface (UAI)
Logging into the HPE Cray Supercomputing EX system through a UAI involves using an intermediate node that provides a more user-friendly and secure interface for accessing the supercomputer. To log in:
If you have not already done so, contact your CPE administrator to obtain your login credentials, which typically include a username and password.
If the system uses SSH key-based authentication, obtain an SSH key pair (public and private keys).
Obtain the hostname or IP address of the UAI from your CPE administrator and access the UAI. The UAI acts as a gateway to the CPE Supercomputing EX system.
Set Up SSH Keys (if applicable). If you use SSH keys, ensure that your CPE administrator has added your public key to the
~/.ssh/authorized_keysfile on the UAI.Open a terminal or command prompt on your local machine, and then enter the SSH command to connect to the UAI.
Use the SSH command to connect to the UAI. The general syntax for the SSH command is:
ssh username@uai-addressReplace
usernamewith your actual username anduai-addresswith the hostname or IP address of the UAI. For example:ssh your_username@uai.example.comWhen prompted, enter your password. For key-based authentication, enter:
ssh -i /path/to/your/private_key your_username@uai.example.comReplace
/path/to/your/private_keywith the path to your private SSH key. If your public key has been added to~/.ssh/authorized_keyson the UAI, you will be granted access without a password prompt.If required, complete the two-factor authentication (2FA). Follow the prompts to enter your second factor, which might be a code from a mobile app, a hardware token, or a text message.
After successful login, you will have access to the UAI. You should see a command prompt indicating that you are now on the UAI.
Logging into CPE through a User Access Node (UAN)
Logging into the HPE Cray Supercomputing EX system through a UAN involves using an intermediate node that provides a more user-friendly and secure interface for accessing the supercomputer. To log in:
If you have not already done so, contact your CPE administrator to obtain your login credentials, which typically include a username and password.
If the system uses SSH key-based authentication, obtain an SSH key pair (public and private keys).
Set up SSH Keys (if applicable). If using SSH keys, ensure that your CPE administrator has added you public key to the
~/.ssh/authorized_keysfile on the UAI.Open a terminal or command prompt on your local machine. This is where you will enter the SSH command to connect to the UAI.
Use the SSH command to connect to the UAN. The general syntax for the SSH command is:
ssh username@uan-addressReplace
usernamewith your actual username anduan-addresswith the hostname or IP address of the UAN. For example:ssh your_username@uan.example.comWhen prompted, enter your password. For key-based authentication, enter:
ssh -i /path/to/your/private_key your_username@uan.example.comReplace
/path/to/your/private_keywith the path to your private SSH key. If your public key has been added to~/.ssh/authorized_keyson the UAN, you will be granted access without a password prompt.If required, complete the two-factor authentication (2FA). Follow the prompts to enter your second factor, which might be a code from a mobile app, a hardware token, or a text message.
After successful login, you should see a command prompt indicating that you have access to the UAN.
Logging into CPE through remote desktop access
Logging into CPE on the HPE Cray Supercomputing EX system through remote desktop access provides a GUI for users to interact with the system. This scenario can be particularly useful for tasks that benefit from a GUI, such as data visualization, interactive development, or using applications that require a GUI. To log in to the system through remote desktop access:
Obtain credentials and setup information:
If you have not already done so, contact your CPE administrator to obtain your login credentials, which typically include a username and password.
Obtain the hostname or IP address of the remote desktop server or User Access Node (UAN) that supports remote desktop access.
If required, get the port number used for the remote desktop connection.
Install the remote desktop client using, for example:
Microsoft Remote Desktop (for Windows, macOS, iOS, and Android)
Remmina (for Linux)
Vinagre (for Linux)
VNC Viewer (cross-platform, if using VNC)
Configure the remote desktop client, defining the new connection:
Hostname/IP Address: The address of the remote desktop server or UAN.
Port: The port number for the remote desktop service (commonly 3389 for RDP, or specific port for VNC if applicable).
Username: Your username for the HPE Cray system.
Password: Your password for the HPE Cray system (you may be prompted to enter this later).
For example, to configure Microsoft Remote Desktop (RDP):
Launch the Microsoft Remote Desktop client on your local machine.
Click Add PC or New Desktop to configure a new connection.
Enter connection details:
PC name: Enter the hostname or IP address of the remote desktop server.
User Account: Select Ask when required or enter your username and password.
(Optional) Configure any other additional settings. For example, you can configure display settings, local resources (such as printers, clipboard), and other options, as needed.
Save the configuration and click Connect to initiate the remote desktop session.
As another example, to set up VNC Viewer configuration details:
Launch the VNC Viewer client on your local machine.
To add a new connection, click File>New Connection to configure a new connection.
Enter connection details:
VNC Server: Enter the hostname or IP address of the VNC server followed by the port number (for example,
uan.example.com:5901).Name: Enter a name for the connection (optional).
Save and connect:
Save the configuration and double-click the new connection to initiate the VNC session.
Enter your username and password when prompted.
After successful login, you should see the desktop environment of the remote server. You can now interact with the graphical interface, open terminal windows, run applications, and perform various tasks as needed.
Before you begin running and debugging CPE jobs
This chapter provides preliminary information and necessary steps you should consider and know before running a CPE job. Before you run, compile, debug, and analyze a job, review:
Be sure to review the CPE Release Announcements for operating constraints, component versioning details, and compatibility information. See the links in Documentation and Support for additional information regarding release announcements.
Understanding the key CPE components
The CPE Software suite comprises specific components tools designed to maximize developer productivity, application scalability, and code performance. It includes compilers, analyzers, optimized libraries, and debuggers.

The CPE Software suite also provides a variety of parallel programming models that allow you to make appropriate choices based on the nature of existing and new applications. CPE uses build environment containers, providing the ability to compile, and launch and track job status. Containers enable you to store and retrieve files from both the local and shared system storage.
CPE components (by category) include:
Compilers
HPE Cray Compiling Environment (CCE): High-performance compilers for Fortran, C, and C++ that are optimized for HPE Cray Supercomputing EX system architectures. These compilers include advanced optimization features and support for parallel programming models, such as Open Multi-Processing (OpenMP), Open Accelerators (OpenACC), Heterogeneous-Compute Interface for Portability (HIP), and Partitioned Global Address Space (PGAS) languages (such as Coarray Fortran, Unified Parallel C).
Third-Party Compilers: Support for other industry-standard compilers, such as GNU Compiler Collection (GCC), Intel, NVIDIA, and AMD compilers.
Programming models
Model Name |
Description |
|---|---|
HPE Cray Message Passing Toolkit (MPT) |
Libraries and tools for parallel programming using the Message Passing |
Interface (MPI) standard, which is widely used for distributed memory |
|
parallelism. |
|
OpenMP |
Support for shared memory parallelism and GPU offloading using the |
OpenMP standard, which allows developers to parallelize and offload code |
|
using directives and APIs. |
|
OpenACC |
Support for GPU offloading using the OpenACC standard, which allows |
developers to parallelize and offload code using directives and APIs. |
|
CUDA |
Support for NVIDIA GPU offloading using the CUDA programming model. |
HIP |
Support for AMD GPU offloading using the HIP programming model. |
Partitioned Global Address Space (PGAS) |
Support for PGAS languages like Coarray Fortran and Unified Parallel C |
(UPC). |
|
OpenSHMEM |
As a programming library, simplifies and enhances the way you write |
parallel programs and allows you to manage data efficiently across |
|
multiple processors, ensuring that your high-performance applications run |
|
as fast and effectively as possible. |
|
Scientific and mathematical libraries
Library Name |
Library Description |
|---|---|
HPE Cray LibSci (cray-libsci) |
A library providing highly optimized and scalable mathematical |
routines, such as BLAS, LAPACK, and ScaLAPACK, aimed at enhancing |
|
the performance of linear algebra and other numerical computations |
|
on HPE Cray Supercomputing EX systems. |
|
HPE Cray FFTW (cray_fftw) |
Libraries for performing Fast Fourier Transforms (FFTs), based on |
FFTW3. |
|
HPE Cray LibSci ACC (cray-libsci_acc) |
An extension of HPE Cray LibSci that includes GPU-accelerated |
versions of mathematical routines, designed to leverage GPU |
|
hardware for improved performance in scientific computations on |
|
HPE Cray EX supercomputing systems with GPUs. |
|
HPE Cray HDF5 (cray-hdf5 and cray-hdf5-parallel) |
Libraries for managing and storing large scientific data sets in |
Hierarchical Data Format (HDF5), with parallel I/O capabilities to |
|
enhance performance and scalability on distributed HPE Cray |
|
Supercomputing EX systems. |
|
HPE Cray NetCDF (cray-netcdf and |
Libraries supporting the creation, access, and sharing of |
cray-netcdf-hdf5parallel) |
array-oriented scientific data in the Network Common Data Form |
(NetCDF), with parallel I/O support to improve scalability and |
|
performance on large-scale HPE Cray Supercomputing EX systems. |
|
HPE Cray Parallel NetCDF (cray-parallel-netcdf) |
A high-performance parallel I/O library for NetCDF files, enabling |
efficient handling and management of large, distributed data sets |
|
in scientific applications on Cray systems. |
Environment setup tools
HPE Cray Environment Setup and Compilation Support (CENV) is a CPE
software package with tools and libraries specifically designed to
support compilation and environment setup. It includes compiler drivers
and CPE API (craype-api).
Performance Analysis Tools
HPE Cray Performance Measurement & Analysis Tools (CPMAT)/HPE Cray Performance Analysis Tools (CrayPAT): A suite of tools for profiling and analyzing the performance and behavior of applications and a Performance API (PAPI). This includes
pat_buildfor instrumenting applications,pat_reportfor generating performance reports, and HPE Cray Apprentice3 for visualizing performance data.HPE Cray Apprentice3: Provides performance analysis with event tracing and graphical data visualization. HPE Cray Apprentice3 provides enhanced scalability, an improved user interface, and advanced metrics for more detailed and efficient performance analysis.
Debugging Tools
HPE Cray Distributed Debugging Tool (DDT): An advanced debugging tool for parallel applications, supporting MPI, OpenMP, and hybrid applications.
gdb4hpc: GNU Debugger (GDB)-based HPC debugger with support for debugging serial and parallel applications.
Valgrind4hpc - A parallel debugging tool used to detect memory leaks and parallel application errors.
Sanitizers4hpc - A parallel debugging tool used to detect memory access or leak issues at runtime using information from LLVM sanitizers.
Stack Trace Analysis Tool (STAT) - A single merged stack backtrace tool used to analyze application behavior at the function level. Helps trace down the cause of crashes.
Abnormal Termination Processing (ATP) - A scalable core file generation and analysis tool for analyzing crashes, with a selection algorithm to determine which core files to dump. ATP helps to determine the cause of crashes.
Cray Comparative Debugger (CCDB) - Not a traditional debugger, but rather a tool to run and step through two versions of the same application side-by-side to help determine where they diverge.
All CPE debugger tools support C/C++, Fortran, and Universal Parallel C (UPC).
Development Environment
Environment Modules: A system for managing and configuring the user environment, allowing you to easily load and switch between different software packages and versions.
Build and Configuration Tools: Tools for building and configuring applications, including support for
makefilesandCMake.
Application Porting and Optimization
HPE Parallel Application Launch Service (PALS): An automation tool for starting, managing, and optimizing the placement of parallel applications on HPE Cray Supercomputing EX systems, ensuring efficient resource utilization.
CrayPAT-lite: A lightweight version of CrayPAT for quick performance assessments and application tuning.
About huge pages
CPE, relying HPE Cray Operating System (COS) Base as its operating system, uses its huge pages implementation as part of its operation. Because HPC applications require extensive memory operations, massive computations, large data sets, and expanse scalability, huge pages satisfy multiple HPC needs required by CPE. As a virtual memory page solution for systems that are larger than the default base page size of 4K bytes, huge pages specifically offers multiple benefits.
Huge page Benefit |
Area |
Explanation |
|---|---|---|
Reduces overhead |
Translation Lookaside Buffer (TLB) |
The TLB is a small, fast cache that holds the most |
recent translations of virtual memory addresses to |
||
physical addresses. If the system uses regular 4 KB |
||
pages, it needs to keep many entries in the TLB. |
||
Huge pages reduce the number of entries required |
||
because each huge page covers a larger portion of |
||
memory. The result is fewer TLB misses and less |
||
overhead. |
||
Improves performance |
Memory Access Speed |
By using huge pages, the system reduces the number |
of page table entries, which means fewer lookups are |
||
required when accessing memory. This facet improves |
||
the performance of memory-intensive applications, |
||
a common trait in HPC environments. |
||
Reduced Page Table Size |
With fewer, larger pages, the size of the page table |
|
(a data structure used to keep track of the mapping |
||
between virtual and physical memory) is reduced. |
||
This structure leads to less memory overhead and |
||
faster memory operations. |
||
Uses cache more efficiently |
Cache Efficiency |
Huge pages can help improve the efficiency of the |
CPU cache. Since each page is larger, the likelihood |
||
of accessing data within the same page increases, |
||
reducing the number of times the cache needs to be |
||
refreshed or replaced with new data. |
CPE usage of huge pages
As you use CPE, you can use huge pages in a variety of ways.
Use huge pages for: |
To: |
|---|---|
SHMEM applications |
Map the static data or a private heap onto huge pages. |
Applications written in Unified Parallel (UPC) C |
Map the static data and/or private heap onto huge pages. |
and other languages based on the PGAS programming model |
|
MPI applications |
Map the static data or heap onto huge pages. |
Use huge page additionally for:
Applications using shared memory that are concurrently registered with high-speed network drivers for remote communication.
Applications doing heavy I/O.
Improving memory performance for common access patterns on large data sets.
Accessing huge pages in CPE
Access to huge pages through a virtual file system called hugetlbfs.
Every file on this file system is backed by huge pages and is directly
accessed with mmap() or read().
The libhugetlbfs library allows an application to use huge pages more
easily by directly accessing the hugetlbfs file system. A user may
enable libhugetlbfs to back application text and data segments.
Note: On x86 processors, the only page sizes supported by the processor are 4K, 2M, and 1G.
Note: 512M huge pages are not supported on Aarch64 nodes running COS
Base 3.3. If you run module load craype-hugepages512M on Aarch64
nodes, subsequent commands hang. To resolve a hung state, kill the
process. This issue does not occur on x86_64 nodes, and it does not
affect transparent huge pages.
Key preliminary CPE job factors
It is important to ensure that the jobs you create run efficiently, effectively, and make the best use of the available resources. By considering key preliminary factors, you can optimize job submissions, ensure efficient use of resources, and increase the likelihood of successful and productive runs on the HPE Cray Supercomputing EX system. Key considerations to think about before running jobs include:
Understanding the System
System Architecture: Familiarize yourself with the architecture of the HPE Cray Supercomputing EX system, including the type of processors, number of cores per node, memory hierarchy, and network interconnect. This knowledge will help in optimizing your job for the system.
Available Resources: Understand the available resources, such as the total number of nodes, memory per node, and any specialized hardware like GPUs.
Application Requirements
Resource Needs: Determine the computational resources required by your application, including the number of cores, memory, and any special hardware (for example, GPUs).
Runtime Estimation: Estimate the runtime of your job to set appropriate wall time limits and avoid job termination due to exceeding these limits.
Job Submission
Job Scheduler: Learn how to use the job scheduler (for example, SLURM) to submit jobs. Understand how to specify resource requests, such as the number of nodes, cores, memory, and wall time.
Queue Policies: Be aware of the queue policies, including priority levels, job limits, and fair share policies. Choose the most appropriate queue for your job.
Code Optimization
Compiler and Flags: Use the appropriate compilers and optimization flags for your application to ensure it runs efficiently on HPE Cray Supercomputing EX systems. HPE provides specific compilers for C, C++, and Fortran that are optimized for their systems.
Parallelism: Optimize your application for parallel execution, using MPI, OpenMP, or hybrid parallelism, as appropriate. Ensure that your code scales well across multiple nodes and cores.
Environment Setup
Modules: Use the module command to load the necessary software modules and libraries for your application. Ensure that all dependencies are correctly loaded.
Environment Variables: Set any required environment variables that your application or job script needs to run correctly.
Performance Monitoring and Debugging
Profiling Tools: Use profiling tools, like Cray Perftools, to gather performance data and identify bottlenecks in your application.
Debugging Tools: Be prepared to use debugging tools to troubleshoot any issues that arise during job execution.
Data Management
Input and Output Data: Plan the location of input and output data. Ensure that input data is staged appropriately and that there is sufficient storage space for output data.
I/O Optimization: Optimize I/O operations to minimize bottlenecks, using parallel I/O techniques if necessary.
Scalability and Efficiency
Scalability Testing: Conduct scalability tests to determine the most efficient number of nodes and cores for your application.
Load Balancing: Ensure that the workload is balanced across the resources to avoid idle cores and maximize efficiency.
Job Dependencies and Workflow
Workflow Management: If your job is part of a larger workflow, ensure that job dependencies are correctly specified and managed. Use job arrays or workflow management systems if necessary.
Checkpointing: Implement checkpointing if possible to save the state of your job at intervals. Checkpointing helps by restarting a failed job from its last checkpoint.
Error Handling and Recovery
Error Checking: Include error-checking mechanisms in your job script to handle potential issues gracefully and provide meaningful error messages.
Recovery Plan: Have a plan for recovering from job failures, including strategies for resubmitting jobs or debugging issues.
Documentation and Best Practices
Read Documentation: Review the system and software documentation to understand any specific guidelines or best practices for running jobs on the HPE Cray system.
Consult Examples: Look for example job scripts and application notes that provide insights into best practices for similar applications.
Account and Usage Policies
Resource Allocation: Understand the allocation of resources and any usage policies or quotas associated with your user account.
Billing and Accounting: Be aware of any billing or accounting practices in place, especially if the system usage is charged based on resource consumption.
Understanding CPE modules
CPE modules are used in conjunction with RHEL and SLES to streamline and manage the software development environment on HPE Cray Supercomputing EX systems. As part of the CPE environment, you can load, unload, and switch one or more modules to efficiently manage the software stack required for your specific applications and development tasks. Modules can comprise CPE base, library-related, or tools-related modules. Loading a module automatically sets environment variables, paths, and other settings, allowing you to focus on development rather than environment configuration. Modules allow you to easily switch between different versions of compilers, libraries, and tools, enabling you to test and validate your applications against multiple configurations. Compiler and library compatibility and dependencies is assured through the use of modules:
Library Compatibility: Many high-performance computing (HPC) applications depend on specific versions of libraries and tools. Understanding which modules to load ensures that all dependencies are compatible, reducing runtime errors and conflicts.
Compiler Consistency: Different modules may provide different versions of compilers. Ensuring you use consistent compilers across your development and production environments can prevent compatibility issues.
Debugging tools are also available for diagnosing and optimizing your applications and providing critical insights into your application’s performance and behavior. Performance analysis tools help identify bottlenecks and optimize code, which is crucial in high-performance computing environments where efficiency is paramount.
Modules are essential for several reasons and are used to:
Simplify Environment Management
HPE Cray Supercomputing EX systems often have complex software stacks with multiple compilers, libraries, and tools. Modules simplify the process of configuring the environment by allowing you to easily load and unload different software components without manual configuration of environment variables.
Allow for consistency
Modules ensure that all users on the system have a consistent environment. This consistency is crucial for reproducibility of results, especially in a research or scientific computing context.
Offer flexibility
Different applications and development tasks might require different versions of compilers, libraries, or tools. Modules provide a flexible way to switch between these versions without conflicts.
Provide optimization
CPE modules are optimized for the underlying hardware. Different programming environments and compilers are optimized for specific architectures and workloads. Loading appropriate modules helps to ensure that your code and applications are making the best use of system architecture and running efficiently. Properly loading and unloading modules helps manage system resources, ensuring that you are balancing the load and not overloading the system with unnecessary tools and libraries.
Are easy to use
Modules abstract away the complexity of setting up and managing the environment. You can focus on development rather than expending excess time on configuration issues.
As you use CPE modules, keep in mind that many high-performance computing applications depend on specific versions of libraries and tools. Understanding which modules to load ensures that all dependencies are compatible, reducing runtime errors and conflicts. Understanding CPE modules and module commands is crucial for maximizing performance, ensuring compatibility, simplifying development and debugging, maintaining reproducibility, staying current with technological advancements, and fostering effective collaboration in high-performance computing environments.
The following subsections provide information on commonly-used CPE modules, libraries, and tools. See CPE Commonly Used Commands and Running Sample Jobs for more details on using CPE commands.
Commonly-used CPE modules, module command names, and module compiler commands
Commonly-used CPE modules and module commands include:
Module name |
Module command name |
CPE driver commands |
|---|---|---|
AMD compilers |
|
|
AOCC |
|
|
CCE* |
|
|
GCC** |
|
|
Intel compilers |
|
|
NVIDIA |
|
|
CPE driver commands are used in conjunction with module commands to construct build configurations.
Commonly-used CPE library commands include:
Library name |
Module command name |
Compiler commands |
|---|---|---|
DSMML |
|
|
Fast Fourier Transforms |
|
|
HDF5 |
|
|
HPE Cray LibSci*** |
|
|
|
||
HPE Cray MPICH |
|
|
Parallel NetCDF |
|
|
Commonly-used CPE tools and their commands include:
Tool name |
Module command name |
|---|---|
Apprentice 3 |
|
Debuggers |
|
|
|
Distributed Debugging Tool |
|
HPE CrayPAT |
|
HPE CrayPAT Base |
|
Huge pages |
|
TotalView |
|
Commonly-used CPE performance analysis commands include:
Tool name |
Module command name |
|---|---|
ATP |
|
Clang/Low Level Virtual Machine (LLVM) |
|
CrayPAT |
|
TensorFlow |
|
Commonly-used CPE specialized environment commands include:
Specialized environment name |
Environment command name |
Associated commands |
|---|---|---|
OpenMPI |
|
|
OpenSHMEMX |
|
|
ROCM |
|
* - Compiler-specific manpages include crayftn(1), craycc(1), and
crayCC(1). Available only when the compiler module is loaded
** - Compiler-specific manpages include gcc(1), gfortran(1), and
g++(1). Available only when the compiler module is loaded.
*** - Compiler-speific manpages include intro_libsci(3s), and
intro_fftw3(3). Available only when the compiler module is loaded.
When the module for a CSML package (such as cray-libsci or
cray-fftw) is loaded, all relevant headers and libraries for these
packages are added to the compile and link lines of the cc, ftn, and
CC CPE drivers. You must load the cray-hdf5 module (a dependency)
before loading the cray-netcdf module.
**** - In addition to the default module systems, CPE offers, as an
alternate module management system, Lmod. Lmod, a Lua-based module
system, can load and unloads modulefiles, handle path variables, and
manage library and header files. (If you are using another Linux
distribution, use the huge pages implementation appropriate for that
distribution.) To use huge pages, load the appropriate
craype-hugepages at link time. Possible values include:
craype-hugepages128Kcraype-hugepages512Kcraype-hugepages2Mcraype-hugepages4Mcraype-hugepages8Mcraype-hugepages16Mcraype-hugepages32Mcraype-hugepages64Mcraype-hugepages128Mcraype-hugepages256Mcraype-hugepages512Mcraype-hugepages1Gcraype-hugepages2G
Viewing loaded modules
To view, for example, loaded modules and their versions:
user@hostname> module list
Currently Loaded Modules:
1) craype-x86-rome 5) xpmem/2.6.2-2.5_2.27__gd067c3f.shasta 9) cray-mpich/8.1.28
2) libfabric/1.15.2.0 6) cce/17.0.0 10) cray-libsci/23.12.5
3) craype-network-ofi 7) craype/2.7.30 11) PrgEnv-cray/8.5.0
4) perftools-base/23.12.0 8) cray-dsmml/0.2.2
Module versions are for example purposes only and may vary from those on the system.
Viewing available modules
To view, for example, available modules and their versions:
user@hostname> module avail PrgEnv
------------------------------------ /opt/cray/pe/modulefiles ------------------------------------
PrgEnv-amd/8.3.3 PrgEnv-cray-amd/8.4.0 (D) PrgEnv-gnu/8.3.3 PrgEnv-nvhpc/8.4.0 (D)
PrgEnv-amd/8.4.0 (D) PrgEnv-cray/8.3.3 PrgEnv-gnu/8.4.0 (D) PrgEnv-nvidia/8.3.3
PrgEnv-aocc/8.3.3 PrgEnv-cray/8.4.0 (L,D) PrgEnv-intel/8.3.3 PrgEnv-nvidia/8.4.0 (D)
PrgEnv-aocc/8.4.0 (D) PrgEnv-gnu-amd/8.3.3 PrgEnv-intel/8.4.0 (D) PrgEnv-gnu-amd/8.4.0 (D)
Module versions are for example purposes only and may vary from those on the system.
About Lmod and modules
Lmod, an environment module system that helps you to manage and customize software environments, can be used as a module management tool (as opposed to supported Red Had Enterprise Linux or SUSE Linux Enterprise Server) for some operations. Lmod allows you to easily load and unload different software packages, manage environment variables, and maintain a consistent and organized computational environment. You might find that, at times, Lmod might offer some advantages over other allowable module management tools. Generally, Lmod helps to:
Simplify Environment Management: Lmod streamlines the process of setting up the software environment. Instead of manually setting environment variables and paths for different software tools, you can load pre-configured modules with a simple command.
Handle Software Dependencies: Complex software often has dependencies on specific versions of libraries and other tools. Lmod ensures that the correct versions are loaded automatically, reducing the risk of conflicts and errors.
Switch Easily Between Environments: Users often need to switch between different versions of software for different projects. Lmod allows you to load and unload modules quickly, making it easy to switch environments without manually adjusting settings.
Support Consistency Across Sessions: Using modules ensures that the environment is consistent across different sessions. If you load the same module, you acquire the same environment every time, which helps in maintaining reproducibility in research.
Customize Environments: Users can create their own module files to customize their environment according to their specific needs. This is especially useful for setting up complex environments that are tailored to particular workflows or projects.
See links for online support in Documentation and support or the User Guide for Lmod for more information on Lmod. See also Lmod details in the HPE Cray Supercomputing Programming Environment Installation Guide: HPCM on HPE Cray Supercomputing EX and HPE Cray Supercomputing Systems S-8022 or HPE Cray Supercomputing Programming Environment Installation Guide: CSM on HPE Cray Supercomputing EX Systems S-8003.
CPE and commonly used commands
This section lists and provides examples of commonly used CPE commands. These commands allow you to effectively manage your software environments, submit and control computational jobs, compile and run applications, transfer and manage data, and perform profiling and debugging tasks. These commands are essential for maximizing the efficiency and productivity of high-performance computing tasks on HPE Cray EX systems. See Understanding CPE Modules and Running Sample Jobs for more details on using CPE commands.
Environment module management commands
Command |
Description |
Example |
|---|---|---|
|
Loads a specific software module. |
|
|
||
|
||
|
||
|
Unloads a specific software module. |
|
|
Lists all currently loaded modules. |
|
|
Displays all available modules. |
|
|
Shows information about a specific module. |
|
Compilation and Execution commands
cc/CC/ftn: Compiles C, C++, and Fortran programs using HPE Cray compilers.
cc -o my_app my_app.c CC -o my_app my_app.cpp ftn -o my_app my_app.f90
make: Builds applications using a Makefile.
makempicc/mpicxx/mpif90: Compiles MPI programs.
mpicc -o mpi_app mpi_app.c mpicxx -o mpi_app mpi_app.cpp mpif90 -o mpi_app mpi_app.f90
mpirun: Launches parallel applications on Cray systems (if applicable).
mpirun -n 16 ./my_app
Data Management commands
scp: Securely copies files between hosts.
scp localfile user@remotehost:/path/to/destination/rsync: Synchronizes files and directories between two locations.
rsync -avz /source/directory/ /destination/directory/tar: Archives and compresses files.
tar -czvf archive.tar.gz /path/to/directory/h5dump: Inspects HDF5 files (common in scientific data).
h5dump data.h5
Performance profiling and debugging commands
pat_build: Instruments an application for performance profiling.
pat_build -O my_appsrun: Runs an instrumented application.
srun -n 16 ./my_app+patpat_report: Generates a performance report from the collected data.
pat_report my_app+pat+*.xf > performance_report.txtapp3: Launches HPE Cray Apprentice3 for visual performance analysis on Apple® Mac® or Microsoft® Windows® computer systems.
app3gdb4hpc: Debugs an HPC application.
gdb4hpc dbg all> launch $app{16} ./my_app
Common Linux commands
ls: Lists directory contents.
ls -lcd: Changes directory.
cd /path/to/directory/cp: Copy files and directories.
cp source_file destination_filemv: Move or rename files and directories.
mv old_name new_namerm: Remove files or directories.
rm file_namegrep: Search text using patterns.
grep "search_term" filenameawk: Pattern scanning and processing.
awk '{print $1}' filenamegit clone: Clone a GitHub repository.
git clone https://github.com/username/your_repository.githuge pages: Accesses huge pages.
To use 2 megabyte
hugepages:user@hostname> module load craype-hugepages2M
Running sample jobs
This chapter provides step-by-step instructions for using CPE and its toolset to run, compile, debug, and analyze example jobs. These examples are for reference only and can assist you with getting starting with using CPE tools and include:
Example: Running, compiling, debugging, and analyzing an MPI banking-based job
Example: Choosing a library for a job and accessing Apprentice3 reporting functions
Example job: Compiling, debugging, and analyzing a job using the C Clang compiler, LLVM, and HIP
Example: Running, compiling, and debugging a sample Fortran health industry job
For more detailed information on using CPE tools, refer to the HPE Cray Supercomputing Programming Environment online documentation website.
Example: Running and debugging a simple job
Getting started with CPE in conjunction with HPE Cray System Management (CSM) or HPE Performance Cluster Manager (HPCM) software system, involves a series of steps to ensure you have the right environment and tools for your development needs.

To begin:
Access the CPE system using your login credentials. Ensure that you have a user account and the necessary permissions to log in to the system. Contact your CPE administrator for more information on login information.
Use an SSH client to log in to the CSM or HPCM system. For example:
ssh your_username@csm_system_addressLoad CPE modules, as needed. CPE uses modules to manage different software packages. To list available modules, from the prompt, enter:
module availLoad the CPE environment module. For example:
module load PrgEnv-crayIf you need a specific version, specify it like this:
module load PrgEnv-cray/version_numberSet up your development environment. Depending on your requirements, you may want to load additional modules for compilers, libraries, and tools. Common modules include:
Compilers:
cce,gcc,intel, and so forth.Libraries:
cray-mpich,cray-libsci, and so forth.Tools:
craype,craypat, and so forth.
For example, to load the Cray compiler and MPI library, enter:
module load cce module load cray-mpich
Write a parallel program. For example:
#include <stdio.h> #include <mpi.h> int main(int argc, char** argv) { MPI_Init(NULL, NULL); int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); printf("Hello from processor %d out of %d processors\n", world_rank, world_size); MPI_Finalize(); return 0; }With the necessary modules loaded, compile your code. For example, to compile a C program using the CPE compiler, enter:
cc -o my_program my_program.cFor Fortran programs, enter:
ftn -o my_program my_program.f90
For C++ programs, enter:
CC -o my_program my_program.cpp
Submit your job to the job scheduler, which could be SLURM, PBS, or another scheduler. For example, if you are using SLURM, you can create a job script (for example,
job_script.sh):#!/bin/bash #SBATCH --job-name=my_job #SBATCH --output=output.txt #SBATCH --error=error.txt #SBATCH --ntasks=4 #SBATCH --time=01:00:00 srun ./my_program
Submit the job with:
sbatch job_script.shAnalyze the output by checking the output.txt file:
cat output.txtSee Reviewing an example output file for more information on how to review the
output.txtfile.Perform scalability testing:
Modify the
submit_job.shscript to use more nodes or tasks per node:#SBATCH --nodes=4 #SBATCH --ntasks-per-node=4
Submit the modified job and observe how the program scales with the increased number of processors.
Demonstrate portability by compiling and running the same program on another system with an HPE Cray Supercomputing EX environment. The HPE Cray compiler ensures that your code remains portable across different HPE Cray Supercomputing EX systems.
Perform debugging and analysis tasks using the HPE Cray Perftools using included tools for performance analysis and debugging:
HPE Cray Performance Analysis Tool (CrayPAT): For performance analysis.
HPE Cray gdb4hpc: For debugging.
To perform debugging with gdb4hpc:
From the prompt, enter:
module load gdb4hpc
Run the program with the debugger:
gdb4hpc dbg all> launch $app{4} ./mpi_exampleUse gdb4hpc to set breakpoints, inspect variables, and step through the code.
To perform analysis:
Load the necessary modules:
module load perftools module load ddt
Recompile your program with instrumentation:
cc -o mpi_example_instrumented mpi_example.c -h profile_generateRun the instrumented program:
`srun ./mpi_example_instrumented`
After the run, generate a performance report:
pat_report
Example: Running, compiling, debugging, and analyzing an MPI banking-based job
This section demonstrates how to use mpicc to compile and run an MPI program in a banking context with CPE. This example simulates a simple MPI program that calculates the total balance across multiple bank accounts distributed across different processors.
MPI-based Banking Program Simulation: Calculate Balances Across Multiple Accounts Across Different Processors
Code: banking_mpi.c:
This example demonstrates a C program where each MPI process represents a bank branch with a set of account balances. The program computes the total balance across all branches.
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Each branch has 5 accounts
int num_accounts = 5;
float balances[5] = {1000.0f, 1500.0f, 2000.0f, 2500.0f, 3000.0f};
// Calculate the total balance for this branch
float local_total = 0.0f;
for (int i = 0; i < num_accounts; i++) {
local_total += balances[i];
}
printf("Branch %d local total: %.2f\n", world_rank, local_total);
// Reduce all local totals to a global total on rank 0
float global_total = 0.0f;
MPI_Reduce(&local_total, &global_total, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);
if (world_rank == 0) {
printf("Global total balance: %.2f\n", global_total);
}
MPI_Finalize();
return 0;
}
Job Script:
#!/bin/bash
#PBS -N BankingMPIJob
#PBS -l select=2:ncpus=32:mpiprocs=32
#PBS -l walltime=00:10:00
#PBS -j oe
cd $PBS_O_WORKDIR
module load cray-mpich
mpirun -n 64 ./banking_mpi
To run the job:
Save the MPI program:
Log in to the HPE Cray system using your provided credentials.
Create
banking_mpi.cfile using a text editor like nano or vi.nano banking_mpi.czWrite the code:
#include <mpi.h> #include <stdio.h> #include <stdlib.h> int main(int argc, char** argv) { MPI_Init(&argc, &argv); int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); // Each branch has 5 accounts int num_accounts = 5; float balances[5] = {1000.0f, 1500.0f, 2000.0f, 2500.0f, 3000.0f}; // Calculate the total balance for this branch float local_total = 0.0f; for (int i = 0; i < num_accounts; i++) { local_total += balances[i]; } printf("Branch %d local total: %.2f\n", world_rank, local_total); // Reduce all local totals to a global total on rank 0 float global_total = 0.0f; MPI_Reduce(&local_total, &global_total, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD); if (world_rank == 0) { printf("Global total balance: %.2f\n", global_total); } MPI_Finalize(); return 0; }Save the file and exit the editor.
Save the Job Script:
Create the
banking_job.pbsfile. This file defines how the job is submitted to the scheduler.nano banking_job.pbsCopy and paste the job script into the
nano banking_job.pbsfile:#!/bin/bash #PBS -N BankingMPIJob #PBS -l select=2:ncpus=32:mpiprocs=32 #PBS -l walltime=00:10:00 #PBS -j oe cd $PBS_O_WORKDIR module load cray-mpich mpirun -n 64 ./banking_mpi
Compile the MPI program:
Ensure you are in the directory where
banking_mpi.cis saved.Load the HPE Cray MPI module, and use mpicc to compile the program:
module load cray-mpich # Load the Cray MPI module if it's not already loaded mpicc -o banking_mpi banking_mpi.c
This command compiles
banking_mpi.cand creates thebanking_mpiexecutable.
Analyze application performance:
Load the HPE Perftools module to analyze the performance of the application. This command creates the
banking_mpi+patinstrumented executable.module load perftoolsInstrument the program for performance analysis.
pat_build banking_mpiThe
pat_build banking_mpicommand creates thebanking_mpi+patinstrumented executable.Modify the job script to run the instrumented executable:
nano banking_job.pbsReplace the existing
mpirunline with:mpirun -n 64 ./banking_mpi+patSubmit the job script to the job scheduler:
qsub banking_job.pbsAfter the job completes, process the performance data:
pat_report *.xfThe system generates a performance report. Use the report to analyze and understand application behavior and performance.
Submit the job script to the job scheduler using the
qsubcommand:qsub banking_job.pbsCheck the status of your job using the
qstatcommand:qstat -u $USERAfter the job completes, the output is written to a file (for example,
BankingMPIJob.o<job_id>) in the directory where you submitted the job.Use the
catorlesscommand to view the output:cat BankingMPIJob.o<job_id>The output appears, with each branch reporting its local total and the global total reported by rank 0:
Output:
Branch 0 local total: 10000.00 Branch 1 local total: 10000.00 Branch 2 local total: 10000.00 ... Global total balance: 640000.00
Generate the performance data. The
pat_report.xfcommand generates a performance report that allows you to analyze and understand the applications behavior and performance.pat_report*.xfUse the HPE Cray debugging tools (for example, gdb4hpc) to debug the MPI program:
Recompile the program with debugging information:
mpicc -g -o banking_mpi banking_mpi.cRun the program under the gdb4hpc debugger:
module load gdb4hpc gdb4hpc dbg all> launch $app{64} ./banking_mpi
Example: Running a parallel MPI automotive-based CFD job
This section demonstrates an HPCM system Computational Fluid Dynamics
(CFD) simulation. This example demonstrates a parallel CFD simulation
using MPI, which is a common task in aerodynamic simulations for the
auto industry. The job script loads the necessary modules, compiles the
code using mpicc, and then runs the program using srun.
Setting up the parallel CFD simulation using MPI
Code: cfd_simulation.c
This example demonstrates a parallel CFD simulation using MPI.
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#define NX 100
#define NY 100
#define STEPS 1000
void initialize(double grid[NX][NY]) {
for (int i = 0; i < NX; i++) {
for (int j = 0; j < NY; j++) {
grid[i][j] = (i == 0 || j == 0 || i == NX-1 || j == NY-1) ? 0.0 : 1.0;
}
}
}
void compute(double grid[NX][NY]) {
for (int step = 0; step < STEPS; step++) {
for (int i = 1; i < NX-1; i++) {
for (int j = 1; j < NY-1; j++) {
grid[i][j] = 0.25 * (grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]);
}
}
}
}
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
double grid[NX][NY];
initialize(grid);
compute(grid);
if (rank == 0) {
printf("Simulation completed.\n");
}
MPI_Finalize();
return 0;
}
Job Script: cfd_simulation_job.sh
#!/bin/bash
#SBATCH --job-name=cfd_simulation
#SBATCH --output=cfd_simulation_output.txt
#SBATCH --error=cfd_simulation_error.txt
#SBATCH --ntasks=4
#SBATCH --time=01:00:00
module load PrgEnv-cray
module load cray-mpich
mpicc -o cfd_simulation cfd_simulation.c
srun ./cfd_simulation
Running, compiling, analyzing, and debugging the job
To run the provided code and job script within CPE and utilize its various performance and debugging tools:
Ensure that you have the MPI code and job script saved in the appropriate files:
Save the MPI code in a
cfd_simulation.cfile.Save the job script in a
cfd_simulation_job.shfile.
Compile and run the code. The job script provided compiles the code and runs the simulation using SLURM.
Before running the job script, make sure it has the necessary permissions:
chmod +x cfd_simulation_job.shSubmit the job script to the scheduler:
sbatch cfd_simulation_job.sh
Deploy HPE Cray Perftools to analyze the performance of your application. You can use HPE Cray Performance Analysis Tools (CrayPAT) to instrument your application:
Modify the job script to include performance instrumentation using
pat_build. Job Script with Instrumentation:#!/bin/bash #SBATCH --job-name=cfd_simulation #SBATCH --output=cfd_simulation_output.txt #SBATCH --error=cfd_simulation_error.txt #SBATCH --ntasks=4 #SBATCH --time=01:00:00 module load PrgEnv-cray module load cray-mpich module load perftools-base module load perftools mpicc -o cfd_simulation cfd_simulation.c pat_build -O cfd_simulation srun ./cfd_simulation+pat
Submit the modified job script:
sbatch cfd_simulation_job.shGenerate and analyze the performance report.
After the job completes, use
pat_reportto generate a performance report from the generated.xffiles:pat_report cfd_simulation+pat+*.xf > performance_report.txtVisualize the performance data using HPE Cray Apprentice3:
app3 cfd_simulation+pat+*.ap3
Using HPE Cray Debugging tools, debug your application. You can use tools like HPE Cray DDT or GDB. For gdb4hpc debugging:
Compile the code with debugging information:
mpicc -g -o cfd_simulation cfd_simulation.c
Run the application under gdb4hpc:
module load gdb4hpc gdb4hpc dbg all> launch $app{4} ./cfd_simulationAfter the job starts, a prompt appears allowing you to set breakpoints, run the application, and inspect variables.
If you have access to HPE Cray DDT, optionally launch it, and use the GUI to debug your application:
Load the HPE Cray DDT module:
module load ddtLaunch the HPE Cray DDT:
ddt &In the HPE Cray DDT interface, open your application (
cfd_simulation) and configure the run parameters, as needed.
Example: Running, setting up, and using a makefile
This chapter:
Illustrates how to set up and use a Makefile to compile a program in CPE, and
Demonstrates building a simple MPI-based application using the
makecommand.
Note that the:
Makefiledefines how to compile and link the program. It includes rules for compiling source files into object files and linking them into an executable.Command
Description
CC = ccSpecifies the HPE Cray compiler for C.
CFLAGS = -O2 -WallCompiler flags for optimization and warnings.
LDFLAGS = -lmLinker flags (for example, linking the math library).
SRC_DIR,OBJ_DIR,BIN_DIRDirectories for source files, object files, and the binary executable.
SRCSLists all
.csource files.OBJSTranslates source files to object files.
TARGETSpecifies the name and location of the executable.
allThe default rule for building the executable.
cleanA rule for cleaning up the build files.
Job Script (run_job.sh)submits a job to the SLURM Workload Manager scheduler. It specifies job parameters, loads the required modules, and runs the compiled program usingsrun.For this example, assume that you have a project with the following directory structure and script:

src/main.c:
#include <mpi.h> #include <stdio.h> #include "compute.h" int main(int argc, char** argv) { MPI_Init(&argc, &argv); int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); double result = compute(rank, size); if (rank == 0) { printf("Computed result: %f\n", result); } MPI_Finalize(); return 0; }src/compute.c
The
compute.cfile contains:#include "compute.h" double compute(int rank, int size) { return rank * 1.0 / size; }src/compute.h
The
compute.hfile contains:#ifndef COMPUTE_H #define COMPUTE_H double compute(int rank, int size); #endif // COMPUTE_H
Makefile
The makefile defines the rules for building the project and contains:
# Compiler and flags
CC = cc
CFLAGS = -O2 -Wall
LDFLAGS = -lm
# Directories
SRC_DIR = src
OBJ_DIR = obj
BIN_DIR = bin
# Source files
SRCS = $(wildcard $(SRC_DIR)/*.c)
# Object files
OBJS = $(patsubst $(SRC_DIR)/%.c, $(OBJ_DIR)/%.o, $(SRCS))
# Executable name
TARGET = $(BIN_DIR)/my_program
# Default rule
all: $(TARGET)
# Rule to build the executable
$(TARGET): $(OBJS)
@mkdir -p $(BIN_DIR)
$(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
# Rule to build object files
$(OBJ_DIR)/%.o: $(SRC_DIR)/%.c
@mkdir -p $(OBJ_DIR)
$(CC) $(CFLAGS) -c -o $@ $<
# Clean rule to remove generated files
clean:
rm -rf $(OBJ_DIR) $(BIN_DIR)
.PHONY: all clean
Building and running the project
To build and run the project:
Log in to the HPE Cray Supercomputing EX system:
ssh your_username@cray_system_addressFrom the OS prompt (for example, Linux), load the CPE environment and required modules:
module load PrgEnv-cray module load cray-mpich
Navigate to the project directory:
cd my_projectBuild the project using make:
makeThe make command compiles the source files and creates the
my_programexecutable in the bin directory.Create a job script to run the program. For example, create a job script file named
run_job.sh:#!/bin/bash #SBATCH --job-name=my_program #SBATCH --output=output.txt #SBATCH --error=error.txt #SBATCH --ntasks=4 #SBATCH --time=00:10:00 srun ./bin/my_program
Submit the job script:
sbatch run_job.shCheck the job status:
squeueView the output:
cat output.txtAfter the job completes, you can view the output in the output.txt file.
Example: Choosing a library for a job and accessing Apprentice3 reporting functions
CPE-included libraries provide optimized, reusable, and scalable implementations of common computational tasks, enabling developers to write efficient and high-performance applications. By leveraging these libraries, you can focus on the core logic of your applications while benefiting from the performance and productivity gains offered by the libraries. Libraries available CPE jobs include:
Library Type |
Description |
|---|---|
Numerical/Scientific Libraries |
- HPE Cray LibSci (cray-libsci): Contains optimized scientific and |
engineering mathematical subroutines, including BLAS, LAPACK, and ScaLAPACK. |
|
- FFTW (cray_fftw, cray_fftw3): A library for computing discrete Fourier |
|
Transforms, based on FFTW3. |
|
- HPE Cray LibSci ACC (cray-libsci_acc): An extension of HPE Cray LibSci, |
|
provides scientific and mathematical routines that leverage GPU acceleration. |
|
I/O Libraries |
- HDF5 (cray-hdf5 and cray-hdf5-parallel): For managing large amounts of data. |
- NetCDF (NetCDF and cray-parallel-netcdf): For array-oriented scientific data. |
This section provides an example CPE job that integrates:
A CPE-included scientific library (
libsci),A sample C++ program using OpenMPI for parallel processing,
HPE Cray debugging (DDT),
Performance tools (HPE Perftools and CrayPAT), and
Apprentice3 to further analyze performance.
For more information on CPE-included libraries, see Understanding the key CPE components.
Running, compiling, and debugging a C++ OpenMPI program
Log in to the HPE Cray Supercomputing EX system:
ssh your_username@cray_system_addressFrom the OS prompt (for example, Linux), load the CPE environment and required modules:
module load PrgEnv-cray module load cray-mpich module load cray-libsci module load perftools-base module load perftools module load ddt
Create a sample C++ program that uses OpenMPI and HPE Cray LibSci, and name the source file
mpi_example.cpp:#include <iostream> #include <mpi.h> #include <cmath> #include <cray/lapack.h> int main(int argc, char **argv) { MPI_Init(&argc, &argv); int world_size, world_rank; MPI_Comm_size(MPI_COMM_WORLD, &world_size); MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); // Print a hello world message std::cout << "Hello world from processor " << world_rank << " of " << world_size << std::endl; // Example: Use a Cray LibSci function (LAPACK) for computing eigenvalues if (world_rank == 0) { int n = 3; double a[9] = {4.0, 1.0, 1.0, 1.0, 4.0, 1.0, 1.0, 1.0, 4.0}; double w[3]; int info; int lwork = 26; double work[26]; // Diagonalize the matrix a dsyev_("V", "U", &n, a, &n, w, work, &lwork, &info); std::cout << "Eigenvalues:" << std::endl; for (int i = 0; i < n; i++) { std::cout << w[i] << std::endl; } } MPI_Finalize(); return 0; }Compile the
mpi_example.cppprogram. Use the HPE Cray compiler, and link it with the HPE Cray LibSci library.CC -o mpi_example mpi_example.cpp -lm -lsci_crayCreate a job script (
job_script.slurm) needed to submit the job to the batch scheduling system of the cluster (for example, SLURM or PBS):#!/bin/bash #SBATCH --job-name=mpi_example #SBATCH --output=mpi_example.out #SBATCH --error=mpi_example.err #SBATCH --time=00:10:00 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=16 # Load the necessary modules module load PrgEnv-cray module load cray-mpich module load cray-libsci # Run the program using OpenMPI srun --ntasks=32 ./mpi_example
Submit the job script to the scheduler:
sbatch job_script.slurmCreate a debugging script with the HPE Cray Distributed Debugging Tool (DDT):
#!/bin/bash #SBATCH --job-name=debug_mpi_example #SBATCH --output=debug_mpi_example.out #SBATCH --error=debug_mpi_example.err #SBATCH --time=00:30:00 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=16 #SBATCH --partition=debug # Load the necessary modules module load PrgEnv-cray module load cray-mpich module load cray-libsci module load ddt # Run the program with ddt ddt --connect srun --ntasks=32 ./mpi_example
Submit the debugging script:
sbatch debug_script.slurmUse HPE Perftools with CrayPAT to set up a job script:
Recompile the program with performance analysis instrumentation:
pat_build -u -O apa ./mpi_exampleThis command creates a new executable, typically with a suffix like +pat.
Create a
perf_script.slurmfile:#!/bin/bash #SBATCH --job-name=perf_mpi_example #SBATCH --output=perf_mpi_example.out #SBATCH --error=perf_mpi_example.err #SBATCH --time=00:10:00 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=16 # Load the necessary modules module load PrgEnv-cray module load cray-mpich module load cray-libsci module load perftools-base module load perftools # Run the program with performance analysis srun --ntasks=32 ./mpi_example+pat
Submit the performance analysis script:
sbatch perf_script.slurmAllow the job to complete.
Generate the performance report:
pat_report -o report.ap2 ./mpi_example+pat.xf pat_report -o report.ap3 ./mpi_example+pat.xf
Wait for the system to generate the Apprentice3 report.
Transfer the
.ap2file to your local machine:scp your_username@cray_system_address:/path/to/report.ap2Open the
.ap2file on your local machine to view it:apprentice2 report.2Transfer the
.ap3file to your local machine, and open it:apprentice3 report.ap3
Refer to Reviewing an example performance report and Reviewing example Apprentice3 reports for additional information on Perftools and Apprentice3.
Example job: Compiling, debugging, and analyzing a job using the C Clang compiler, LLVM, and HIP
C Clang, LLVM, and HIP are essential tools in CPE for developing high-performance, portable, and scalable applications. Clang and LLVM provide optimized compilation and code generation, while HIP enables GPU-accelerated computing across multiple architectures. You can integrate these tools into CPE and leverage them for science-based industry applications, such as computational chemistry simulations. The provided sample end user reporting shows insights into the performance of the application and helps identify areas for optimization.
Example: Science-based computational chemistry simulation
For this simulation assume performing a simple computation on the sum of two large vectors representing molecular properties. This computation uses HIP (hipcc) for GPU acceleration. This example uses the GPUs on the HPE Cray Supercomputing EX system to perform the vector addition efficiently.
Setting up a HIP program
The following HIP program performs vector addition functions. The
provided vector_addition.hip code performs element-wise addition of
two vectors on the GPU.
// vector_addition.hip
#include <hip/hip_runtime.h>
#include <iostream>
__global__ void vector_add(const float* A, const float* B, float* C, int N) {
int i = hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x;
if (i < N) {
C[i] = A[i] + B[i];
}
}
int main() {
int N = 10000;
size_t size = N * sizeof(float);
float *h_A = (float*)malloc(size);
float *h_B = (float*)malloc(size);
float *h_C = (float*)malloc(size);
for (int i = 0; i < N; ++i) {
h_A[i] = static_cast<float>(i);
h_B[i] = static_cast<float>(i);
}
float *d_A, *d_B, *d_C;
hipMalloc(&d_A, size);
hipMalloc(&d_B, size);
hipMalloc(&d_C, size);
hipMemcpy(d_A, h_A, size, hipMemcpyHostToDevice);
hipMemcpy(d_B, h_B, size, hipMemcpyHostToDevice);
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
hipLaunchKernelGGL(vector_add, dim3(blocksPerGrid), dim3(threadsPerBlock), 0, 0, d_A, d_B, d_C, N);
hipMemcpy(h_C, d_C, size, hipMemcpyDeviceToHost);
// Verify the result
bool success = true;
for (int i = 0; i < N; ++i) {
if (h_C[i] != h_A[i] + h_B[i]) {
success = false;
break;
}
}
if (success) {
std::cout << "Vector addition successful!" << std::endl;
} else {
std::cout << "Vector addition failed!" << std::endl;
}
hipFree(d_A);
hipFree(d_B);
hipFree(d_C);
free(h_A);
free(h_B);
free(h_C);
return 0;
}
Logging in, loading modules, and executing the program
Log in to the HPE Cray Supercomputing EX system:
ssh your_username@cray_system_addressLoad the CPE module:
module load PrgEnv-crayLoad Clang, LLVM, and HIP modules:
module load clang module load llvm module load rocm # ROCm is AMD's software stack that includes HIP
Run and execute the written HIP program, allocating the appropriate resources (for example, GPUs) using a job scheduler, like SLURM:
srun --gres=gpu:4 ./vector_addition
Generating the performance report
Instrument the program with HPE Cray CaryPAT:
module load perftools pat_build -w ./vector_addition srun -n 1 --gres=gpu:4 ./vector_addition+pat
Generate the performance report
pat_reportReview the report.
Example program runtime performance report:
Table 1: Program Runtime Time% | Time | Imb. Time | Imb. Time% | Calls | Group ------|--------|-----------|------------|-------|---------------- 100.0 | 1.23s | 0.00s | 0.0% | 1 | Total 80.0 | 0.98s | 0.00s | 0.0% | 1 | GPU Computation 20.0 | 0.25s | 0.00s | 0.0% | 3 | Memory Transfer
Report Explanation:
Report section
Section Description
Time%
The percentage of total runtime spent in each activity.
Time
The total time spent in each activity.
Imb. Time
The imbalance time, indicating how much time different processes
spent waiting for each other.
Imb. Time%
The percentage of imbalance time relative to the total time.
Calls
The number of calls made to each group of routines.
Group
The group of routines (for example, GPU Computation, Memory
Transfer).
Total
Represents the overall execution time of the program.
GPU Computation
Indicates the time spent on GPU computations. In this example, it
accounts for 80% of the total runtime.
Memory Transfer
Represents the time spent on transferring data between host and
device memory, accounting for 20% of the total runtime.
The example performance report shows that the majority of the time (80%) is spent on GPU computations, which is expected for GPU-accelerated workloads. Memory transfer accounts for 20% of the total runtime, indicating that some overhead is associated with data movement between the CPU and GPU. No significant imbalance time is reported, suggesting that the workload is well-distributed across the available resources.
Example: Running, compiling, and debugging a sample Fortran health industry job
Integrating Fortran code with CPE involves several steps, from writing and compiling the code to running it and using debugging and performance tools. This section provides a details for setting up, running and compiling a sample health industry-related program.
Setting up the Fortran health industry job
For this example, the Fortran code simulates a SIR model Fortran program. In the SIR model:
S - represents the number of susceptible individuals.
I - represents the number of infected individuals.
R - represents the number of recovered (and immune) individuals.
The model is governed by differential equations:
dS/dt = B S I (where B=beta)
dI/dt = B S I - Y * I (where B=beta and Y=gamma)
dR/dt = Y * I (where Y=gamma)
where:
B is the infection rate (where B=beta).
Y is the recovery rate (where Y=gamma).
The Fortran code comprises:
! sir_model.f90
program sir_model
implicit none
! Parameters
real(8), parameter :: beta = 0.3d0 ! Infection rate
real(8), parameter :: gamma = 0.1d0 ! Recovery rate
real(8), parameter :: dt = 0.1d0 ! Time step
integer, parameter :: num_steps = 160 ! Number of time steps
! Variables
real(8) :: S, I, R ! S, I, R compartments
real(8) :: dS, dI, dR ! Derivatives of S, I, R
real(8) :: t ! Time variable
integer :: step ! Loop counter
! Initial conditions
S = 0.99d0 ! 99% of the population is susceptible
I = 0.01d0 ! 1% of the population is infected
R = 0.0d0 ! 0% of the population is recovered
! Time loop
print *, 'Time', ' ', 'Susceptible', ' ', 'Infected', ' ', 'Recovered'
do step = 0, num_steps
t = step * dt
! Calculate derivatives
dS = -beta * S * I
dI = beta * S * I - gamma * I
dR = gamma * I
! Update compartments
S = S + dS * dt
I = I + dI * dt
R = R + dR * dt
! Print results
print '(F6.1, 3F12.6)', t, S, I, R
end do
end program sir_model
The Fortran code is to a file named sir_model.f90.
Load CPE, run, and compile the job
To load CPE and then run and compile the job:
Load CPE modules:
module load PrgEnv-crayCompile the Fortran code:
ftn -O2 -o sir_model sir_model.f90Run the program:
./sir_modelReview the output:
Time Susceptible Infected Recovered 0.0 0.990000 0.010000 0.000000 0.1 0.987030 0.012870 0.000100 0.2 0.983059 0.016711 0.000230 0.3 0.977941 0.021536 0.000523 ...
Continue to Debugging and reviewing the performance of your job to debug your job.
Debugging and reviewing the performance of your job
For this example, use gdb4hpc and DDT to debug your job:
module load gdb4hpc gdb4hpc ./sir_model
For the gdb4hpc session, enter:
(gdb) break 20 ! Set a breakpoint at line 20 (gdb) run ! Run the program
To use DDT, enter:
module load ddt ddt ./sir_model
The graphical debugger opens, and you can then set breakpoints, inspect variables, and step through the code.
To review the performance of your application using HPE Perftools and HPE Apprentice3:
Instrument the program with HPE CrayPAT:
module load perftools pat_build -w ./sir_model
Run the instrumented executable:
./sir_model+patGenerate a performance report:
pat_reportVisualize the performance data with HPE Apprentice3:
app3 sir_model+pat+*.ap3
Example: Running and compiling a deep learning job using Horovod
CPE seamlessly integrates and deploys deep learning capabilities through third-party tools, such as Horovod and DeepSpeed. Horovod is a distributed deep learning framework designed to scale training efficiently across multiple GPUs and nodes. Horovod’s scalability and efficient communication make it well-suited for large-scale deep learning workloads on HPE Cray Supercomputing EX systems. By leveraging Horovod and HPE’s high-speed interconnects, you can achieve optimal performance for deep learning tasks. Furthermore, it integrates seamlessly with TensorFlow, PyTorch, and other frameworks, leveraging advanced communication algorithms and high-speed interconnects, such as InfiniBand, often found in HPE Cray Supercomputing EX systems. This procedure demonstrates how to load modules, set up the environment, write and run a deep learning script, and analyze the results.
Example: Predicting Retail Sales with Horovod
This example demonstrates how to use Horovod for distributed training of a deep learning model that predicts retail sales based on historical data.
Example Code: Predicting Sales with TensorFlow and Horovod
```screen
# sales_prediction_horovod.py
import horovod.tensorflow.keras as hvd
import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Initialize Horovod
hvd.init()
# Pin GPU to local rank to avoid cross-talk
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
# Load the dataset
data = pd.read_csv('sales_data.csv')
# Preprocess the dataset
data = pd.get_dummies(data, columns=['store', 'product'])
X = data.drop('sales', axis=1).values
y = data['sales'].values
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Build the model
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)
])
# Adjust the learning rate based on the number of workers
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001 * hvd.size())
# Wrap the optimizer with Horovod's distributed optimizer
optimizer = hvd.DistributedOptimizer(optimizer)
# Compile the model
model.compile(optimizer=optimizer, loss='mse', metrics=['mae'])
# Broadcast initial variable states from rank 0 to all workers
hvd.broadcast_variables(model.variables, root_rank=0)
hvd.broadcast_variables(optimizer.variables, root_rank=0)
# Define callbacks for distributed training
callbacks = [
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
hvd.callbacks.MetricAverageCallback(),
]
# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2, callbacks=callbacks)
# Evaluate the model (only rank 0 prints results)
loss, mae = model.evaluate(X_test, y_test)
if hvd.rank() == 0:
print(f'Mean Absolute Error: {mae}')
# Save the model (only rank 0 saves the model)
if hvd.rank() == 0:
model.save('sales_prediction_model.h5')
```
Explanation of the Code
Horovod Initialization: Horovod is initialized to enable distributed training across nodes.
GPU Pinning: Each worker is assigned a specific GPU to avoid resource contention.
Data Loading and Preprocessing: Categorical features are one-hot encoded, and the dataset is split into training and testing sets.
Distributed Optimizer: The Horovod
DistributedOptimizeris used for efficient gradient communication across workers.Callbacks: Horovod callbacks ensure proper synchronization of variables and metrics across all workers.
Model Saving: The model is saved only by rank 0 to avoid redundant writes.
Setting up CPE with Horovod
To set up CPE with Horovod:
Load the programming environment and additional modules required for distributed computing:
module load PrgEnv-cray module load cray-python module load cray-mpich
Set up a Python virtual environment to manage dependencies for your deep learning project:
python -m venv horovod_env source horovod_env/bin/activate
Install Horovod, TensorFlow, or PyTorch, and any additional libraries your project requires:
pip install horovod tensorflow pandas numpy matplotlib scikit-learn
Running the code
To run the script, use srun for distributed execution on HPE Cray
Supercomputing EX systems:
module load PrgEnv-cray
module load cray-python
module load cray-mpich
source horovod_env/bin/activate
srun --gres=gpu:4 -n 8 python sales_prediction_horovod.py
Note that --gres=gpu:4 allocates four GPUs per node, and -n 8
specifies eight processes for distributed training (adjust based on
available resources).
Debugging (TensorFlow)
To debug the program:
Enable TensorFlow debugging hooks:
tf.debugging.set_log_device_placement(True)
Set the Horovod log level to debug for more detailed output:
export HOROVOD_LOG_LEVEL=DEBUG
Load and use debugging tools, such as gdb4hpc or DDT:
module load gdb4hpc gdb4hpc --batch -ex "run" -ex "bt" ./sales_prediction_horovod.py
Analyzing the Performance
Use CPE performance tools to analyze the performance of distributed training.
Instrument the script with CrayPAT:
module load perftools pat_build -w python sales_prediction_horovod.py
Run the instrumented application by executing the instrumented script:
srun --gres=gpu:4 -n 8 ./sales_prediction_horovod.py+pat
A new directory is created to contain the performance data that is collected. The default name of that directory begins with the name of the instrumented executable, followed by a string that guarantees that the directory name is unique, but that is not predictable. The directory created in this step will be designated
$data_directoryin the succeeding steps.Generate and examine the default text report with
pat_report:pat_report $data_directoryLoad the performance data into Apprentice for visualization:
app3 $data_directory
Review the performance report:
**Time%** **Time(s)** **Imbalance Time(s)** **Imbalance Time%** **Calls** **Group** 100.0 5.00 0.00 0.0% 1 Total 80.0 4.00 0.00 0.0% 1 GPU Computation 15.0 0.75 0.00 0.0% 3 Memory Transfer 5.0 0.25 0.00 0.0% 2 Model Evaluation
Explanation
GPU Computation: Accounts for 80% of the runtime, indicating that the workload is effectively utilizing GPU resources. This scenario is typical for GPU-accelerated deep learning tasks.
Memory Transfer: Details memory transfer between CPU and GPU accounts for 15% of the runtime. While this is acceptable, optimizing data movement could further improve performance.
Model Evaluation: Indicates that the evaluation step accounts for 5% of the runtime, which involves testing the model on the test dataset. This result is expected in scenarios where inference is lightweight compared to training.
Imbalance Time: Shows no imbalance time, meaning all workers are efficiently synchronized during distributed training.
Reviewing reports
This chapter provides a sample of some of the reporting capabilities available through CPE and its toolset for reference. Specifically, these reports include:
For additional support information, see Documentation and support.
Reviewing an example output file
This chapter provides example of an output.txt file after running, for
example, an mpi_example.c program in CPE. This example is followed by
an explanation of what the content indicates.
Example
Hello from processor 0 out of 8 processors
Hello from processor 1 out of 8 processors
Hello from processor 2 out of 8 processors
Hello from processor 3 out of 8 processors
Hello from processor 4 out of 8 processors
Hello from processor 5 out of 8 processors
Hello from processor 6 out of 8 processors
Hello from processor 7 out of 8 processors
Error output explanation
Parallel Execution Confirmation:
The output indicates that the program has successfully executed in parallel across multiple processors. Each line represents a message from a different processor (rank).
The presence of messages from different ranks (0 to 7) confirms that the MPI environment is properly set up and functioning.
Processor Count:
The message
Hello from processor X out of 8 processorsconfirms that the program was run using 8 processors.This suggests that the job was correctly configured to use multiple processors/nodes as specified in the job script (
submit_job.sh).
MPI Initialization and Finalization:
The consistent format of the output from all processors suggests that the MPI initialization (
MPI_Init) and finalization (MPI_Finalize) calls are correctly placed and functioning.It indicates that each processor is able to communicate within the MPI world, as evidenced by the correct rank and size information.
Scalability:
If the output shows messages from more or fewer processors than expected, it may indicate a need to adjust the
--nodesand--ntasks-per-nodeparameters in the job script to better match the desired level of parallelism.Consistent output across different runs with varying processor counts can help in understanding how well the application scales.
Portability:
If the program produces similar output on different HPE Cray Supercomputing EX systems, it suggests that the code is portable across different environments.
Any discrepancies in the output when run on different systems may indicate issues related to portability or system-specific configurations.
Debugging:
If there are missing or out-of-order messages, it may suggest issues with how the MPI ranks are being managed or potential bugs in the code.
Consistent and correct output helps in verifying that there are no immediate runtime errors.
Efficiency:
While this simple output does not provide detailed performance metrics, consistent and quick completion of the job indicates efficient execution.
For more detailed efficiency analysis, additional profiling tools like HPE CrayPat should be used.
By analyzing the content of the output.txt file, you can gain insights
into the successful execution, scalability, portability, and potential
areas for debugging and optimization of your parallel MPI program in
CPE. Further actions that can be taken include:
Verifying the MPI setup: You should ensure that the MPI environment is correctly set up and that all processors are being utilized as expected.
Testing scalability: You can experiment with different numbers of nodes and tasks to observe how the application scales and to identify optimal configurations.
Debugging: If the output is not as expected, use debugging tools like the HPE Cray gdb4hpc Debugger to step through the code and diagnose issues.
Analyzing performance: Use profiling tools to analyze and improve the efficiency of the program, especially for more complex applications.
Checking portability: Run the program on different systems to ensure portability and to identify any system-specific issues.
Reviewing an example debugging report
This chapter demonstrates an example HPE Cray Debugging Tool report. The DDT tool is a powerful graphical debugger used to find and fix bugs in parallel applications. While DDT itself is an interactive tool, it allows you to generate reports that summarize the debugging session.
For this example, assume that you have run the debugging session and generated a report.
DDT example report
Detailed Explanation of the Performance Report
==========================================================
DDT Debugging Report
==========================================================
Session Information
--------------------
Start Time: 2023-10-15 10:00:00
End Time: 2023-10-15 10:30:00
Duration: 30 minutes
User: username
Application: /home/user/my_application
Number of Processes: 4
MPI Implementation: MPICH
Hostname: cray-login.example.com
Breakpoints
-----------
Filename: cfd_simulation.c
Line Number: 24
Condition: None
Hit Count: 4
Variable Watch
--------------
Variable Name: grid
Type: double[NX][NY]
Current Value: Displayed in GUI
Watched Expressions: grid[50][50]
Call Stack
----------
Process: 0
Stack Frames:
- compute (cfd_simulation.c:24)
- main (cfd_simulation.c:36)
Memory Usage
------------
Process: 0
Total Memory: 120 MB
Used Memory: 75 MB
Free Memory: 45 MB
Messages
--------
Warnings: None
Errors: None
Debug Messages: 5
==========================================================
End of Report
==========================================================
Report explanation
The DDT report sections include:
Session Information
Breakpoints
Variable Watch
Call Stack
Memory Usage
Messages
This section details information about each section of the DDT report.
Session Information
This section provides an overview of the debugging session.
Section |
Indicates the: |
|---|---|
Start Time |
Start time of the debugging session. |
End Time |
End time of the debugging session. |
Duration |
Total duration of the debugging session. |
User |
Username of the person running the debugging session. |
Application |
Path to the application being debugged. |
Number of Processes |
Number of MPI processes used in the session. |
MPI Implementation |
MPI library being used (e.g., MPICH, OpenMPI). |
Hostname |
Hostname of the machine where the debugging session took place. |
Breakpoints
This section lists the breakpoints set during the debugging session.
Section |
Indicates: |
|---|---|
Filename |
The source file where the breakpoint is set. |
Line Number |
The line number in the source file where the breakpoint is set. |
Condition |
Any condition associated with the breakpoint (for example, break when |
a certain variable equals a specific value). |
|
Hit Count |
The number of times the breakpoint was hit during the session. |
Variable Watch
This section provides information about variables being watched during the debugging session.
Section |
Indicates: |
|---|---|
Variable Name |
The name of the variable being watched. |
Type |
The data type of the variable. |
Current Value |
The current value of the variable (this is typically displayed in |
the DDT GUI). |
|
Watched Expressions |
Specific expressions or array indices being watched (for example, |
grid[50][50]). |
Call Stack
This section provides the call stack information for a specific process at a breakpoint or when an error occurs.
Section |
Indicates the: |
|---|---|
Process |
Process ID for which the call stack is displayed. |
Stack Frames |
List of function calls leading to the current point of execution, |
including the source file and line number. |
|
Example: |
|
|
|
The compute function at line 24 in cfd_simulation.c. |
|
|
|
The main function at line 36 in cfd_simulation.c. |
Memory Usage
This section provides information about memory usage for a specific process.
Section |
Indicates the: |
|---|---|
Process |
Process ID for which the memory usage is displayed. |
Total Memory |
Total amount of memory allocated to the process. |
Used Memory |
Amount of memory currently used by the process. |
Free Memory |
Amount of memory currently free. |
Messages
This section provides a summary of messages generated during the debugging session.
Section |
Indicates: |
|---|---|
Warnings |
Any warnings generated during the session. |
Errors |
Any errors encountered during the session. |
Debug Messages |
Any debug messages generated by the application or DDT. |
Reviewing an example performance report
This chapter demonstrates an example performance report generated using HPE Cray Perftools. Perftools is a CPE tool that allows you determine how efficiently your application performs. It provides details about execution time, load balance, function profiles, MPI communication, and I/O operations. Understanding this report helps identify performance bottlenecks and optimize code for better efficiency on Cray supercomputers.
In this example, assume that you have run your instrumented application
and generated performance data files with the extension .xf. To
generate a performance report, you enter:
pat_report my_application+pat+*.xf > performance_report.txt
After issuing the pat_report command, an example
performance_report.txt could include:
==================================================================
Table of Contents
==================================================================
1. Summary
2. Loop Work Estimates
3. Load Balance
4. Time Histogram
5. Text Profile
6. Functions Profile
7. Message Profile
8. I/O Profile
==================================================================
1. Summary
==================================================================
Experiment: pat_build -O my_application
Executable: my_application
Path: /home/user/my_application
Number of PEs: 4
Sampling Rate: 10 ms
Elapsed Time: 120.45 seconds
Total CPU Time: 480.90 seconds
I/O Time: 15.30 seconds
MPI Time: 30.60 seconds
==================================================================
2. Loop Work Estimates
==================================================================
Loop at line 18 in compute():
- Total work: 300000
- Max work: 75000
- Min work: 70000
- Imbalance: 5000
==================================================================
3. Load Balance
==================================================================
PE 0: 120.00 seconds
PE 1: 120.00 seconds
PE 2: 120.00 seconds
PE 3: 120.00 seconds
Maximum Load Imbalance: 0.00 seconds
==================================================================
4. Time Histogram
==================================================================
Time (seconds) PEs
0 - 30 4
30 - 60 4
60 - 90 4
90 - 120 4
120 - 150 4
==================================================================
5. Text Profile
==================================================================
Function Time (seconds) Time (%)
-----------------------------------------------------------------
initialize 10.00 8.3%
compute 90.00 75.0%
MPI_Init 5.00 4.2%
MPI_Finalize 5.00 4.2%
MPI_Comm_rank 2.00 1.7%
MPI_Comm_size 2.00 1.7%
Other 6.45 5.4%
==================================================================
6. Functions Profile
==================================================================
PE 0:
Function Time (seconds) Time (%)
-----------------------------------------------------------------
initialize 2.50 2.1%
compute 22.50 18.8%
MPI_Init 1.25 1.0%
MPI_Finalize 1.25 1.0%
MPI_Comm_rank 0.50 0.4%
MPI_Comm_size 0.50 0.4%
Other 1.61 1.3%
PE 1:
Function Time (seconds) Time (%)
-----------------------------------------------------------------
initialize 2.50 2.1%
compute 22.50 18.8%
MPI_Init 1.25 1.0%
MPI_Finalize 1.25 1.0%
MPI_Comm_rank 0.50 0.4%
MPI_Comm_size 0.50 0.4%
Other 1.61 1.3%
PE 2:
Function Time (seconds) Time (%)
-----------------------------------------------------------------
initialize 2.50 2.1%
compute 22.50 18.8%
MPI_Init 1.25 1.0%
MPI_Finalize 1.25 1.0%
MPI_Comm_rank 0.50 0.4%
MPI_Comm_size 0.50 0.4%
Other 1.61 1.3%
PE 3:
Function Time (seconds) Time (%)
-----------------------------------------------------------------
initialize 2.50 2.1%
compute 22.50 18.8%
MPI_Init 1.25 1.0%
MPI_Finalize 1.25 1.0%
MPI_Comm_rank 0.50 0.4%
MPI_Comm_size 0.50 0.4%
Other 1.61 1.3%
==================================================================
7. Message Profile
==================================================================
MPI Function Count Time (seconds) Time (%)
-----------------------------------------------------------------
MPI_Init 4 5.00 4.2%
MPI_Finalize 4 5.00 4.2%
MPI_Comm_rank 4 2.00 1.7%
MPI_Comm_size 4 2.00 1.7%
==================================================================
8. I/O Profile
==================================================================
I/O Operation Count Time (seconds) Time (%)
-----------------------------------------------------------------
read 100 5.00 4.2%
write 100 10.30 8.6%
Report explanation
The performance report comprises eight sections:
Summary
Loop Work Estimates
Load Balance
Time Histogram
Text Profile
Functions Profile
Message Profile
I/O Profile
This section provides an overview on how to interpret each of the report sections.
Summary
This section provides an overview of the performance data collected during the execution of your application.
Section |
Indicates the: |
|---|---|
Experiment |
Instrumentation method used ( |
Executable |
The name of the executable ( |
Path |
The directory path of the executable. |
Number of PEs |
The number of Processing Elements (PEs) used (4 in this case). |
Sampling Rate |
The rate at which performance data was sampled (10 ms). |
Elapsed Time |
Total wall-clock time for the application to run (120.45 seconds). |
Total CPU Time |
Aggregate CPU time over all PEs (480.90 seconds). |
I/O Time |
Time spent on I/O operations (15.30 seconds). |
MPI Time |
Time spent in MPI communication (30.60 seconds). |
Loop Work Estimates
This section provides estimates of the work done in loops within your application.
Section |
Indicates the: |
|---|---|
Loop at line 18 in compute() |
Information about the loop at line 18 in the compute function. |
Total work |
Total amount of work done. |
Max work |
Maximum work done by any PE. |
Min work |
Minimum work done by any PE. |
Imbalance |
Difference between maximum and minimum work. |
Load Balance
This section shows the load balance across different PEs.
Section |
Indicates the: |
|---|---|
PE 0-3 |
Time spent by each PE. |
Maximum Load Imbalance |
The maximum difference in time spent between any two PEs. |
Time Histogram
This section shows a histogram of time spent by PEs in different time intervals.
Section |
Indicates the: |
|---|---|
Time (seconds) |
Time intervals. |
PEs |
Number of PEs spending time in each interval. |
Text Profile This section provides a summary of time spent in different functions.
Section |
Indicates the: |
|---|---|
Function |
Name of the function. |
Time (seconds) |
Time spent in the function. |
Time (%) |
Percentage of total time spent in the function. |
Functions Profile
This section provides detailed function profiles for each PE.
Section |
Indicates the: |
|---|---|
PE 0-3 |
Time spent by each PE in different functions. |
Function |
Name of the function. |
Time (seconds) |
Time spent in the function. |
Time (%) |
Percentage of total time spent in the function. |
Message Profile
This section provides information about MPI communication.
Section |
Indicates the: |
|---|---|
MPI Function |
Name of the MPI function. |
Count |
Number of times the function was called. |
Time (seconds) |
Time spent in the function. |
Time (%) |
Percentage of total time spent in the function. |
I/O Profile
This section provides information about I/O operations.
Section |
Indicates the: |
|---|---|
I/O Operation |
Type of I/O operation (for example, read, write). |
Count |
Number of times the operation was performed. |
Time (seconds) |
Time spent on the operation. |
Time (%) |
Percentage of total time spent on the operation. |
Reviewing example Apprentice3 reports
The Apprentice3 tools included as part of CPE generate reports and provide comprehensive insights into the performance of applications running on HPE Cray EX Supercomputing EX systems. The significance of these reports lies in their ability to help developers understand, analyze, and optimize their parallel applications. Specifically, these tools provide:
Report provisions |
Provides: |
|---|---|
Performance Profiling |
Detailed profiling of application execution, allowing developers to see |
how time is being spent across different parts of the code. This includes: |
|
- Function-level profiling: Time spent in each function, number of calls, |
|
and average time per call. |
|
- Call graphs: Visual representation of calling relationships between |
|
functions and the time spent in each function. |
|
- Timeline views: Detailed time-based view of program execution across |
|
different processes and threads. |
|
MPI and Communication Analysis |
Insights into communication patterns and performance, including: |
- MPI function profiling: Time spent in MPI communication functions, such |
|
as |
|
- Communication matrices: Visualization of communication patterns between |
|
different MPI ranks. |
|
- Message sizes and counts: Analysis of the sizes and frequency of messages |
|
exchanged between processes. |
|
Load Imbalance Detection |
Identifies load imbalances in parallel applications. Load imbalance can lead |
to inefficient utilization of resources, and addressing it can significantly |
|
improve performance. The reports provide: |
|
- Imbalance metrics: Quantitative measures of load imbalance across |
|
processes. |
|
- Visualization of imbalance: Graphical representations that highlight which |
|
parts of the code and which processes are experiencing imbalance. |
|
Scalability Analysis |
Information on how an application scales with the number of processors for |
optimizing performance on large systems. Apprentice reports include: |
|
- Strong and weak scaling analysis: Metrics and visualizations that show |
|
how performance changes as the number of processors varies. |
|
- Scalability bottlenecks: Identification of parts of the code that do |
|
not scale well, helping developers focus their optimization efforts. |
|
Detailed Metrics and Event Tracing |
Detailed metrics and event tracing information, including: |
- Hardware counters: Metrics such as CPU cycles, cache misses, and FLOP |
|
counts. |
|
- I/O performance: Analysis of input/output operations and their impact |
|
on overall performance. |
|
- Synchronization events: Detection and analysis of synchronization |
|
primitives and their influence on performance. |
|
User-friendly Visualization |
User-friendly and visually intuitive reporting. For example, the reports |
include: |
|
- Interactive graphs and charts: Allowing users to explore performance |
|
data interactively. |
|
- Heat maps: Visual representation of various metrics across |
|
application execution. |
|
- Filters and drill-down capabilities: Enabling users to zoom in on |
|
specific functions, processes, or time periods for more detailed analysis. |
|
Guidance for Optimization |
Detailed and actionable insights that guide developers in optimizing their |
applications. This includes: |
|
- Identifying hotspots: Functions or code regions where most time is |
|
spent. |
|
- Reducing communication overhead: Optimizing MPI communication patterns. |
|
- Balancing load: Ensuring that work is evenly distributed across all |
|
processes. |
|
- Improving scalability: Making the application more efficient as the |
|
number of processors increases. |
Reviewing the Apprentice3 report
Performance Overview
--------------------
Total Execution Time: 100.0 seconds
Total MPI Time: 10.0 seconds (10%)
Total Computation Time: 90.0 seconds (90%)
Function Metrics
----------------
Function | Time (s) | Calls | Avg Time/Call (ms)
---------------------------------------------------
main | 90.0 | 10 | 9000
MPI_Barrier | 5.0 | 10 | 500
MPI_Send | 3.0 | 20 | 150
MPI_Recv | 2.0 | 20 | 100
Report Section |
Section Description |
|---|
|Performance Overview |Summary of the program’s performance, including total execution | | |time and key metrics. | |Function Metrics |Detailed breakdown of performance metrics by function, including | | |time spent, number of calls, and average time per call. |
Supported systems
This publication supports installing CPE 25.09 on HPE Cray Supercomputing EX systems with supported applicable HPE Cray Supercomputing EX systems. Depending on the HPE Cray Supercomputing EX system, supported architectures and operating systems (OS) versions vary. This chapter provides information on supported systems for this release.
IMPORTANT: In releases before the COS 25.9 release, COS comprised three components: COS Base, HPE USS, and HPE SLES. With the COS 25.9 and CPE 25.09 releases, it should be noted that COS Base has been replaced with SLES 15 SP6. Starting with this CPE 25.09 release, COS 25.9 (and later) comprises:
HPE Cray Supercomputing User Services Software (USS)
HPE SUSE Linux Enterprise Server
This release also supports v20.0.0 of the HPE Cray Compiler Environment (CCE). See the CPE 25.09 Release Announcements on the CPE Online Documentation website for other supported dependencies.
Supported systems for CPE on CSM
This publication supports the installation of CPE 25.09 on HPE Cray Supercomputing EX systems with specific configurations:
Management Software & Version |
COS Version |
Operating System |
Architecture |
GCC Version |
|---|---|---|---|---|
CSM 1.7.X |
COS 25.9 (USS 1.4.X) |
SLES 15 SP6 |
X86 |
14.0 |
CSM 1.7.X |
COS 25.9 (USS 1.4.X) |
SLES 15 SP6 |
AArch64 |
14.0 |
This release also supports v20.0.0 of the HPE Cray Compiler Environment (CCE).
IMPORTANT: In releases before the COS 25.9 release, COS comprised three components: COS Base, HPE USS, and HPE SLES. With the COS 25.9 and CPE 25.09 releases, it should be noted that COS Base has been replaced with SLES 15 SP6. Starting with this CPE 25.09 release, COS 25.9 (and later) comprises:
HPE Cray Supercomputing User Services Software (USS)
HPE SUSE Linux Enterprise Server
See the CPE 25.09 Release Announcements on the CPE Online Documentation website for other supported dependencies.
Supported systems for CPE with HPCM
This publication supports installing CPE 25.09 on HPE Cray Supercomputing EX systems with specific configurations:
Management Software & Version |
COS Version |
Operating System |
Architecture |
GCC Version |
|---|---|---|---|---|
HPCM 1.14 |
COS 25.9 (USS 1.4.X) |
SLES 15 SP7 |
X86 |
Not Applicable |
HPCM 1.14 |
COS 25.9 (USS 1.4.X) |
SLES 15 SP6 |
X86 |
Not Applicable |
HPCM 1.14 |
COS 25.9 (USS 1.4.X) |
SLES 15 SP7 |
AArch64 |
Not Applicable |
HPCM 1.14 |
COS 25.9 (USS 1.4.X) |
SLES 15 SP6 |
AArch64 |
Not Applicable |
HPCM 1.14 |
Not Applicable |
RHEL 9.6 |
X86 |
14.0 |
HPCM 1.14 |
Not Applicable |
RHEL 9.5 |
X86 |
14.0 |
HPCM 1.14 |
Not Applicable |
RHEL 8.10 |
X86 |
14.0 |
HPCM 1.14 |
Not Applicable |
RHEL 9.6 |
AArch64 |
14.0 |
HPCM 1.14 |
Not Applicable |
RHEL 9.5 |
AArch64 |
14.0 |
This release also supports v20.0.0 of the HPE Cray Compiler Environment (CCE).
IMPORTANT: In releases before the COS 25.9 release, COS comprised three components: COS Base, HPE USS, and HPE SLES. With the COS 25.9 and CPE 25.09 releases, it should be noted that COS Base has been removed. Starting with this CPE 25.09 release, COS 25.9 (and later) comprises:
HPE Cray Supercomputing User Services Software (USS)
HPE SUSE Linux Enterprise Server
See the CPE 25.09 Release Announcements on the CPE Online Documentation website for other support dependencies.
Supported systems for CPE on the HPE Cray XD2000
For this release, CPE is supported on HPE Cray XD2000 systems with designated operating systems and architectures:
Management Software & Version |
Operating System |
Architecture |
|---|---|---|
HPCM 1.14 |
RHEL 8.10 |
X86 |
This release also supports v20.0.0 of the HPE Cray Compiler Environment (CCE).
IMPORTANT: CPE versions 25.03 (and earlier) previously supported MOFED versions 5.8 (or earlier) as directed in installation instructions. However, with the CPE 25.09 release, HPE recommends that MOFED/DOCAFED-dependent users with HPE Slingshot 10 (SS10) refrain upgrading CPE beyond the 25.03 CPE release. HPE observed a system bug, the Extended Reliable Connection (XRC) bug in MOFED. This system bug adversely affects CPE and SS10 functionality. The bug was introduced by NVIDIA in early 2023, and HPE reported details of the bug to NIDIA in April 2023. The bug is currently unresolved and is not expected to be fixed during the transition from MOFED to DOCA OFED. Until a resolution or workaround is introduced, CPE users should not upgrade past the CPE 25.03 release.
See the CPE 25.09 Release Announcements on the CPE Online Documentation website for other support dependencies.
Documentation and support
CPE installation and getting started guides
HPE CPE documentation comprises user and installation guides:
Title |
Document Part Number |
|---|---|
HPE Cray Supercomputing Programming Environment Installation Guide: CSM on HPE Cray Supercomputing EX Systems |
S-8003 |
HPE Cray Supercomputing Programming Environment Installation Guide: HPCM on HPE Cray Supercomputing EX and HPE Cray Supercomputing Systems |
S-8022 |
HPE Cray Supercomputing Programming Environment Installation Guide: HPE Cray XD2000 Systems |
S-8012 |
HPE Cray Supercomputing Programming Environment Getting Started User Guide: HPE Cray Supercomputing EX Systems |
S-9934 |
HPE Cray Supercomputing Programming Environment Getting Started Administrator Guide: HPE Cray Supercomputing EX Systems |
S-9935 |
Other resources
HPE additionally provides CPE documentation and support through various online sources:
Retrieve a range of HPE resources through the HPE Support Center, including access to support issues; the latest guides (as listed in CPE installation and getting started guides), including guide revisions; software download information; the HPE knowledge base; product information; and other resources.
To help you to get the most out of the CPE online, access the CPE Online Documentation website to obtain initially released installation and Getting Started guides, in addition to general user procedures, release announcements, and best practice manuals.
Important: Be sure to regularly check for guide revisions on the HPE Support Center. Revisions of installation and Getting Started guides that are posted to the HPE Support Center are presumed more current than those posted on the CPE Online Documentation website.
Join the CPE #hpe-cray-programming-environment Slack channel through the HPE Developer Community Slack web page for interactive and collaborative CPE interactions.
Access CCE help using CCE module commands:
man crayccorman crayCC- Returns HPE Cray C and C++ compiler man pages. (Alias for man clang.)craycc–help - Returns a summary of the command line options and arguments.man crayftn- Returns HPE Cray Fortran compiler man pages.crayftn --help- Returns a summary of the command line options and arguments.The complete Clang reference manual is included in HTML format in the
/opt/cray/pe/cce/<version>.0.0/doc/html/index.htmlfile system location. Note that the man page is presumed to be more current if content differences exist.
For CPE and software installation and update information, see My HPE Software Center for general CPE information.
Access the HPE Cray Supercomputing Programming Environment Software QuickSpecs online.
Access third-party documentation resources online, including:
Glossary
This section provides a listing of CPE general terms and definitions.
A
Adaptive Routing (AR): A technology that dynamically selects the best path for data packets in a network to improve performance and fault tolerance.
Appentice3: A performance analysis tool that provides a graphical interface for visualizing performance data collected by HPE CrayPAT.
Command:
app3Module:
module load app3
B
Batch System: Software that manages and schedules jobs on a supercomputer, ensuring efficient use of computational resources.
C
Cache Optimization: Techniques for optimizing data structures and algorithms to take advantage of cache locality to improve performance.
CCE (Cray Compiling Environment): HPE Cray’s native compiler suite for C, C++, and Fortran, optimized for Cray hardware.
Commands:
cc for C
CC for C++
ftn for Fortran
CrayPAT (Cray Performance Analysis Tools): A suite of tools for collecting and analyzing performance data of parallel applications.
Commands:
pat_buildto instrument an applicationpat_reportto generate a performance report
Module:
module load perftools
D
DataWarp: A technology for accelerating I/O by using SSD-based storage to provide a high-speed buffer between compute nodes and the parallel file system.
Distributed Debugging Tool (DDT): A specialized debugger for debugging parallel applications, including MPI and OpenMP programs. Allows developers to determine the performance state of processes running together across cluster nodes. CPE supports the integration of DDTs, such as Perforce TotalView and Allinea DDT.
Command:
ddtModule:
module load ddt
E
Environment Groups: Logical groupings of environment variables and module settings to simplify switching between different development environments.
Commands:
envmgr activate <group_name>envmgr deactivate <group_name>
Environment Variables: Variables used to configure the runtime environment, such as
PATH, LD_LIBRARY_PATHandMODULEPATH.
F
File Striping: A method of dividing a file into segments and distributing them across multiple disks to improve I/O performance.
Command:
lfs setstripe -s 1M -c -1 <path>
Finite Element Analysis (FEA): A computational technique used to approximate solutions to complex structural engineering problems.
FFTW (cray-fftw): An optimized and scalable library for computing Fast Fourier Transforms (FFTs) on HPE Cray EX Supercomputing systems, facilitating efficient FFT computations for various scientific and engineering applications.
Module:
module load cray-fftw; gcc -o my_fft_program my_fft_program.c -lfftw3
G
GCC (GNU Compiler Collection): A widely-used alternative compiler suite that supports various programming languages.
Commands:
gccfor Cg++for C++gfortranfor Fortran
gdb4hpc (HPE Cray gdb-based HPC Debugger): Advanced HPC debugger for complex applications at scale.
Command:
gdb4hpcModule:
module load gdb4hpc
H
HDF5 (cray-hdf5 and cray-hdf5-parallel): A data model, library, and file format for storing and managing large amounts of data.
Module:
module load cray-hdf5
Hybrid Parallel Programming: Combining MPI with OpenMP or other parallel programming models to leverage both inter-node and intra-node parallelism.
Huge pages: A Linux kernel feature that allows operating systems to manage memory in larger chunks as opposed to 4KB pages. Used to improve the efficiency of virtual memory systems. See About huge pages.
I
Intel Compiler: A suite of compilers optimized for Intel architectures.
Commands:
iccfor Cicpcfor C++ifortfor Fortran
J
Job Arrays: A method to submit multiple similar jobs using a single job script.
SLURM Command:
sbatch --array=0-9 my_job_script.shPBS Command:
qsub -t 0-9 my_job_script.sh
Job Scheduler: A system that manages and schedules jobs on a supercomputer, ensuring efficient use of computational resources.
SLURM:
sbatch,squeue,scancelPBS:
qsub,qstat,qdel
L
Low Level Virtual Machine (LLVM): A LLVM Foundation compiler and toolchain technology. Builds compilers, debuggers, and other software-based development tools. For CPE, specialized and used in conjunction with Clang for optimized coding for improved performance. HPE Clang C and C++ is based on Clang/LLVM. See the HPE Cray Clang C and C++ Quick Reference documentation for information on HPE Clang C and C++, Clang documentation for more information on Clang, or LLVM documentation for more information on LLVM.
Lustre: A type of parallel distributed file system, primarily used for large-scale cluster computing.
Command:
lfs
LibSci (cray-libSci): A collection of scientific libraries optimized for Cray systems, including LAPACK, BLAS, and ScaLAPACK.
Module:
module load cray-libsci
LibSci_ACC (cray-libsci_acc): An extension of HPE Cray LibSci that includes GPU-accelerated versions of mathematical routines, leveraging GPU hardware to improve performance in scientific computations on HPE Cray EX Supercomputing systems equipped with GPUs.
Module:
module load cray-libsci_acc; nvcc -o my_gpu_program my_gpu_program.cu \ -L${CRAY_LIBSCI_ACC_PREFIX_DIR}/lib -lsci_acc
Lmod - A Lua-based module management software tool.
M
Makefile: A file containing a set of directives used by the make build automation tool to compile and link programs.
Command:
make
Modules: A system for dynamically modifying user environments through modulefiles. Modules can be loaded and unloaded to manage different software packages and versions.
Commands:
module load <module_name>module unload <module_name>module avail
MPI (Message Passing Interface): A standard for parallel programming that allows processes to communicate with each other by sending and receiving messages.
Common Functions:
MPI_Init,MPI_Comm_rank,MPI_Comm_size,MPI_Send,MPI_Recv
N
NetCDF (cray-netcdf and cray-netcdf-hdf5parallel): Libraries supporting the creation, access, and sharing of array-oriented scientific data in the Network Common Data Form (NetCDF), offering parallel I/O support to improve performance and scalability on large-scale HPE Cray EX Supercomputing systems.
Module:
module load cray-netcdf; gcc -o my_netcdf_program my_netcdf_program.c -lnetcdf
NUMA (Non-Uniform Memory Access): An architecture where memory access time depends on the memory location relative to the processor.
O
OpenACC (for Fortran): A directive-based parallel programming model for offloading computations to GPUs.
Command:
ftn -hacc -o my_program my_program.f90Directives:
!$acc parallel, !$acc kernels
OpenMP: An API for parallel programming that supports multi-platform shared memory and GPU parallel programming.
Common Directives:
#pragma omp parallel,#pragma omp for,#pragma omp critical,#pragma omp barrier
P
Parallel NetCDF (cray-parallel-netcdf): A high-performance parallel I/O library for NetCDF files, enabling efficient handling and management of large, distributed data sets in scientific applications running on HPE Cray EX Supercomputing systems.
Module:
module load cray-parallel-netcdf; gcc -o my_pnetcdf_program my_pnetcdf_program.c -lpnetcdf
PBS (Portable Batch System): A job scheduler used on some HPE Cray EX Supercomputing systems.
Commands:
qsub,qstat,qdel
Performance-Guided Optimization (PGO): Using profiling data to guide optimizations. Involves:
Compiling with profiling enabled:
cc -h profile_generate -o my_program my_program.cRunning the program to generate profile data.
Recompiling with profile data:
cc -h profile_use -o my_program my_program.c
R
Resource Constraints: Specify memory, CPU, and other resource constraints for job scheduling.
SLURM Command:
sbatch --mem=4G --cpus-per-task=8 my_job_script.shPBS Command:
qsub -l mem=4G,ncpus=8 my_job_script.sh
S
SHMEM (Cray SHMEM): A library for one-sided communication in parallel applications.
Module:
module load cray-shmem
SLURM (Simple Linux Utility for Resource Management): A job scheduler used on many Cray systems.
Commands:
sbatch: Submit a job script.squeue: Check the status of jobs.scancel: Cancel a job.
T
TensorFlow: An open-source platform for machine learning.
Module:
module load tensorflow
U
User Access Node (UAN): A critical component that acts as a “gateway” to the supercomputer. It is a dedicated server or node where you log in to interact with the system, submit jobs, manage files, and perform development tasks. High-performance compute nodes (the powerful “brain” of the supercomputer) is not directly accessed for these activities—instead, you use the UAN to prepare your work.
UAN Key Features:
Development Environment: The UAN provides tools for coding, compiling, debugging, and optimizing your programs. It is where you set up applications before running them on the compute nodes.
Job Submission: From the UAN, submit workloads (such as simulation or analysis tasks) to the job scheduler, which then runs tasks on the compute nodes.
File Management: The UAN is where you can access and manage files stored in the system.
Access Point: Users connect to the UAN through protocols like SSH (Secure Shell) to securely log in and work on the supercomputer.
The UAN as the central point for interaction with the larger computing system.
V
Vectorization: Techniques for optimizing code to take advantage of vector instructions.
Compiler Flags:
-h vector3Directives:
#pragma ivdep
W
Workload Managers: Software that orchestrates the execution of jobs in a high-performance computing environment. Examples include SLURM and PBS.