HPE Cray Programming Environment User Guide: HPCM

About

The HPE Cray Programming Environment User Guide: HPCM includes programming environment and user access concepts, configuration information, component overviews, and relevant examples.

This publication is intended for software developers, engineers, scientists, and other users of the HPE Cray Programming Environment (CPE).

Cray Programming Environment (CPE) Components

The CPE provides tools designed to maximize developer productivity, application scalability, and code performance. It includes compilers, analyzers, optimized libraries, and debuggers. It also provides a variety of parallel programming models that allow users to make appropriate choices, based on the nature of existing and new applications.

CPE components include:

HPE Cray Compiling Environment (CCE) - CCE consists of compilers that perform code analysis during compilation and automatically generate highly optimized code. Compilers support numerous command-line arguments to provide manual control over compiler behavior and optimization. Supported languages include Fortran, C and C++, and UPC (Unified Parallel C).
HPE Cray Scientific and Math Libraries (CSML) - CSML is a set of high performance libraries that provide portability for scientific applications by implementing APIs for arrays (NetCDF), dense linear algebra (BLAS, LAPACK, ScaLAPACK) and fast Fourier transforms (FFTW).
HPE Cray Message Passing Toolkit (CMPT) - CMPT is a collection of software libraries, including the Message Passing Interface (MPI), that enable data transfers between nodes running in parallel applications. CMPT libraries support practical, portable, efficient, and flexible mechanisms for performing data transfers between parallel processes.
HPE Cray Environment Setup and Compiling Support (CENV) - CENV provides libraries that support code compilation and setting up the development environment. It comprises compiler drivers, and the CPE API, which is a software package used for building module files.
HPE Cray Performance Measurement & Analysis Tools (CPMAT) - CPMAT provides tools to analyze the performance and behavior of programs that are run on Cray systems, and a Performance API (PAPI).
HPE Cray Debugging Support Tools (CDST) - CDST provides debugging tools, including gdb4hpc, Valgrind4hpc and Sanitizers4hpc.

Cray Environment setup and compiling support

HPE Cray Environment (CENV) provides libraries that support code compilation and development environment setup. It comprises compiler drivers, utilities, and the CPE API (craype-api), which is a software package used for building module files.

Modules and modulefiles

CPE Environment Modules enables users to modify their environment dynamically by using modulefiles. The module command provides a user interface to the Modules package. The module command system interprets modulefiles, which contain Tool Command Language (Tcl) code, and dynamically modifies shell environment variables such as PATH and MANPATH.

Sites can alternatively enable Lmod to handle modules. Both module systems use the same module names and syntax shown in command-line examples.

Tip: Use either Environment Modules or Lmod on a per-system basis. The systems are mutually exclusive and cannot both run on the same system.

The /etc/cray-pe.d/cray-pe-configuration.sh and /etc/cray-pe.d/cray-pe-configuration.csh configuration files allow sites to customize the default environment. The system administrator can also create modulefiles for a product set to support user-specific needs. For more information about the Environment Modules software package, see the module(1) and modulefile(4) manpages.

Lmod

In addition to the default Environment Modules system, CPE offers support for Lmod as an alternative module management system. Lmod is a Lua-based module system that loads and unloads modulefiles, handles path variables, and manages library and header files. The CPE implementation of Lmod is hierarchical, managing module dependencies and ensuring any module a user has access to is compatible with other loaded modules. Features include:

Lmod is set up to automatically load a default set of modules. The default set includes one each of compiler, network, CPU, and MPI modules. Users may choose to load different modules. However, it is recommended that, at minimum, a compiler, network, CPU, and an MPI module be loaded, to ensure optimal assistance from Lmod.
Lmod supports loading multiple different compiler modules concurrently by loading a dominant core-compiler module and one or more support mixed-compiler modules. See Lmod Mixed Compiler Support for details.
Lmod uses “families” of modules to flag circular conflicts, which is most apparent when module details are displayed through module show and when users attempt to load conflicting modules.
Environment Modules and Lmod modules use the same names, so all command examples work similarly.

Tip: Environment Modules and Lmod are mutually exclusive, and both cannot run on the same system. Contact the system administrator about setting Lmod as the default module management system.

For more Lmod information, see The User Guide for Lmod.

About Cray Compiling Environment

Module: PrgEnv-cray

Command: ftn, cc, CC

Compiler-specific manpages: crayftn(1), craycc(1), crayCC(1) - available only when the compiler module is loaded

Online help: ftn -help, cc -help, CC -help

Documentation: See Additional Resources

To use the Cray Compiling Environment (CCE), load the PrgEnv-cray module.

user@hostname> module load PrgEnv-cray

CCE provides Fortran, C and C++ compilers that perform substantial analysis during compilation and automatically generate highly optimized code. The compilers support numerous command-line arguments that enable manual control over compiler behavior and optimization. For more information about the Cray Fortran, C, and C++ compiler command-line arguments, see the crayftn(1), craycc(1), and crayCC(1) manpages, respectively.

PrgEnv modules provide wrappers (cc, CC, ftn) for both CCE and third-party compiler drivers. These wrappers call the correct compiler with appropriate options to build and link applications with relevant libraries as required by modules loaded. (Only dynamic linking is supported.) These wrappers replace direct calls to compiler drivers in Makefiles and build scripts. For more information about compiler pragmas and directives, see the intro_directives(1) manpages.

One of the most useful compiler features is the ability to generate annotated loopmark listings showing what optimizations were performed and their locations. Together with compiler messages, these listings can help locate areas in the code that are compiling without error but are not fully optimized. For more detailed information about generating and reading loopmark listings, see crayftn(1), craycc(1), and crayCC(1) manpages, the Cray Fortran Reference Manual (S-3901), and HPE Performance Analysis Tools User Guide (S-8014). See Additional Resources for direct links to these publications.

In many cases, code that is not properly optimizing can be corrected without substantial recoding by applying the right pragmas or directives. For more information about compiler pragmas and directives, see the intro_directives(1) man page.

Third-Party compilers

CPE supports the following third-party compilers:

AOCC
AMD ROCm
Intel
GNU
NVIDIA

The compilers and their respective dependencies, including wrappers and mappings (for example, mapping cc to gcc in PrgEnv-gnu) are loaded using the module load <modulename> command. For example:

user@hostname> module load PrgEnv-gnu

About AOCC

Module: PrgEnv-aocc

Command: ftn, cc, CC

Documentation: AOCC Documentation

CPE enables, but does not bundle, the AMD Optimizing C/C++ Compiler (AOCC). CPE provides a bundled package of support libraries to install into the programming environment to enable AOCC and CPE utilities such as debuggers and performance tools.

If not available on the system, contact a system administrator to install AOCC and the support bundle.
To us AOCC, load the PrgEnv-aocc module:
```
user@hostname> module load PrgEnv-aocc
```

About the AMD ROCm compiler

Module: PrgEnv-amd

Command: ftn, cc, CC

Documentation: https://rocmdocs.amd.com/en/latest/

CPE enables, but does not bundle, the AMD ROCm Compiler. CPE provides a bundled package of support libraries to install into the programming environment to enable this compiler and CPE utilities, such as debuggers and performance tools. Contact your system administrator to install ROCm and the support bundle if these resources are not available on the system.

The “amd” module provided by CPE is auto loaded when PrgEnv-amd is loaded. This module supports AMD ROCm C/C++/Fortran Compilers. The AMD compiler module enables access to AMD compatible libraries.

The “rocm” toolkit provided by CPE is optional and must be loaded by the user. The ROCm toolkit module extends the AMD compiler module to enable support for ROC Profiler, ROC Tracer, HIP, and ROCm. The ROCm module enables access to AMD accelerators for all programming environments.

Load the PrgEnv-amd module to use AMD: user@hostname> module load PrgEnv-amd

Load the PrgEnv-amd module to use AMD:

user@hostname> module load PrgEnv-amd

About the Intel compiler

Module: PrgEnv-intel

Command: ftn, cc, CC

Documentation: Intel oneAPI Website

CPE enables, but does not bundle, the Intel® oneAPI for Linux compiler. CPE provides a bundled package of support libraries to install into the programming environment to enable the Intel compiler and CPE utilities such as debuggers and performance tools.

Intel oneAPI includes their “classic” compilers (icc, icpc and ifort) as well as new versions of each of those (icx, icpx, and ifx). Because ifx is an experimental Fortran compiler, Intel encourages users to stay with the “classic” Fortran (ifort) compiler along with their new C/C++ (icx and icpx).

Because all of the Intel compilers come in the same package, the PrgEng-intel meta-module now has three options for the “intel” sub-module. They are:

intel ( icx, icpx, ifort ) - PrgEnv-intel defaults to this given it is Intel’s recommendation
intel-classic ( icc, icpc, ifort ) - All “classic” Intel compilers
intel-oneapi ( icx, icpx, ifx ) - All “new” Intel compilers, where ifx is “beta” per Intel
- To use the Intel DPC+/C++ compiler, load the PrgEnv-intel module:
```
user@hostname> module load PrgEnv-intel
```
- To use the Intel C++ Compiler Classic instead, switch to the intel-classic module:
```
user@hostname> module swap intel intel-classic
```

About the NVIDIA compiler

Module: PrgEnv-nvidia

Command: ftn, cc, CC

Documentation: NVIDIA HPC Compilers User’s Guide

CPE enables, but does not bundle, the Nvidia Compilers. CPE provides a bundled package of support libraries to install into the programming environment to enable this compiler and CPE utilities, such as debuggers and performance tools. Contact your system administrator to install Nvidia and the support bundle if these resources are not available on the system.

The “nvidia” module provided by CPE is auto loaded when PrgEnv-nvidia is loaded. This module supports Nvidia C/C++/Fortran Compilers. This compiler module enables access to Nvidia compatible libraries.

The “cuda” toolkit module extends the Nvidia compiler module to enable support for the NVIDIA CUDA compiler, libraries, debuggers, profilers, and other utilities for developing applications targeting NVIDIA GPUs. This module is required to interface with NVIDIA accelerators for all programming environments.

Load the PrgEnv-nvidia module to use Nvidia: user@hostname> module load PrgEnv-nvidia

Load the PrgEnv-nvidia module to use Nvidia:

user@hostname> module load PrgEnv-nvidia

About GNU

Module: PrgEnv-gnu

Command: ftn, cc, CC

Compiler-specific manpages: gcc(1), gfortran(1), g++(1) - available only when the compiler module is loaded.

Documentation: GCC Online Documentation

CPE bundles and enables the open-source GNU Compiler Collection (GCC).

To use GCC, load the PrgEnv-gnu module:
```
user@hostname> module load PrgEnv-gnu
```

Programming languages

The following programming languages are bundled with and supported by the CCE:

Fortran - The CCE Fortran compiler supports the Fortran 2018 standard (ISO/IEC 1539:2018), with some exceptions and deferred features as noted elsewhere.
- Documentation is available in HPE Cray Fortran Reference Manual (S-3901) and also in manpages, beginning with the crayftn(1) manpage. Where information in the manuals differs from the manpage, the information in the manpage is presumed to be more current.
- For the current direct link to the reference manual, see Additional Resources.
C/C++ - The default C/C++ compiler is based on Clang/LLVM.
- Supports Unified Parallel C (UPC), an extension of the C programming language designed for high performance computing on large-scale parallel systems.
- Documentation is provided in Clang Documentation, clang(1) man page, and HPE Cray Clang C and C++ Quick Reference (S-2179).
- For the current direct link to the quick reference, see Additional Resources.

The following third-party programming languages are bundled with the Programming Environment:

Python
R

HPE Cray Scientific and Math Libraries

Modules: cray-libsci, cray-libsci_acc, cray-fftw, cray-hdf5, cray-hdf5-parallel, cray-netcdf, cray-netcdf-hdf5parallel

Manpages: intro_libsci(3s), intro_libsci_acc(s), intro_fftw3(3s) - available only when the associated module is loaded.

The HPE Cray Scientific and Math Libraries (CSML) are a collection of numerical routines optimized for best performance on HPE Cray Supercomputer systems. These libraries satisfy dependencies for many commonly used applications on HPE Cray systems for a wide variety of domains. If the module for a CSML package is loaded, all relevant headers and libraries for these packages are added to the compile and link lines of the cc, ftn, and CC CPE drivers. You must load the cray-hdf5 module (a dependency) before loading the cray-netcdf module.

The CSML collection contains the following scientific libraries:

BLAS (Basic Linear Algebra Subprograms)
CBLAS (Collection of wrappers providing a C interface to the Fortran BLAS library)
LAPACK (Linear Algebra PACKage)
LAPACKE (C interfaces to LAPACK Routines)
BLACS (Basic Linear Algebra Communication Subprograms)
ScaLAPACK (Scalable Linear Algebra PACKage)
FFTW3 (the Fastest Fourier Transforms in the West, release 3)
HDF5 (Hierarchical Data Format)
NetCDF (Network Common Data Format)

Cray Message Passing Toolkit

Module: cray-mpich

Manpage: intro_mpi(3) - available only when the associated module is loaded.

Website: http://www.mpi-forum.org/

Cray Message Passing ToolKit (CMPT) is a collection of message-passing libraries to aid in parallel programming.

MPI is a widely used parallel programming model that establishes a practical, portable, efficient, and flexible standard for passing messages between ranks in parallel processes. Cray MPI is derived from Argonne National Laboratory MPICH and implements the MPI-3.1 standard as documented by the MPI Forum in MPI: A Message Passing Interface Standard, Version 3.1.

MPI supports both OpenFabrics Interfaces (OFI) and Unified Communication X (UCX) network modules with OFI typically run as the default version. In some situations, we recommend unloading the OFI module and then loading the UCX module, rerunning, and comparing for optimal performance. These versions are binary compatible; therefore, recompiling or relinking an application is not necessary. In addition to some performance differences where one module might perform better than the other for a given application, the OFI version has a known limitation when establishing initial connections for applications that use an all-to-all communication pattern or a many-to-one pattern at very high scale. In these situations, we recommend trying the UCX version and comparing it to the performance of running with the OFI network module.

Support for MPI varies depending on system hardware. To see which functions and environment variables the system supports, check the intro_mpi(3) manpage.

Debugger support tools

CPE includes the following debugging tools:

Gdb4hpc - A command line interactive parallel debugger that allows debugging of the application at scale. Helps diagnose hangs and crashes. A good all-purpose debugger to track down bugs, analyze hangs, and determine the causes of crashes.
Valgrind4hpc - A parallel debugging tool used to detect memory leaks and parallel application errors.
Sanitizers4hpc - A parallel debugging tool used to detect memory access or leak issues at runtime using information from LLVM sanitizers.
Stack Trace Analysis Tool (STAT) - A single merged stack backtrace tool used to analyze application behavior at the function level. Helps trace down the cause of crashes.
Abnormal Termination Processing (ATP) - A scalable core file generation and analysis tool for analyzing crashes, with a selection algorithm to determine which core files to dump. ATP helps to determine the cause of crashes.
Cray Comparative Debugger (CCDB) - Not a traditional debugger, but rather a tool to run and step through two versions of the same application side-by-side to help determine where they diverge.

All CPE debugger tools support C/C++, Fortran, and Universal Parallel C (UPC).

Tool infrastructure

CPE provides several tools for tool developers to enhance their own debuggers for use with the CPE:

Common Tools Interface (CTI) - Offers a simple, WLM agnostic API to support tools across all HPE Cray Supercomputing systems.
Multicast Reduction Network (MRNET) - Provides a scalable communication tool for libraries.
Dyninst - Provides dynamic instrumentation libraries.

HPE Cray Performance Measurement and Analysis Tools

The HPE Cray Performance Measurement and Analysis Tools (CPMAT) suite reduces the time needed to port and tune applications. It provides an integrated infrastructure for measurement, analysis, and visualization of computation, communication, I/O, and memory utilization to help users optimize programs for faster execution and more efficient computing resource usage.

The toolset allows developers to perform sampling, profile, and trace experiments on executables, extracting information at the program, function, loop, and line level. It supports programs written in Fortran and C/C++ (including UPC) and HIP, with MPI, OpenMP, CUDA, or a combination of these programming languages and models. It also supports profiling applications built with CCE, AMD, and GNU compilers.

Performance analysis consists of three basic steps:

Instrument the program to specify what kind of data to collect under what conditions.
Execute the instrumented executable to generate and capture data.
Analyze the resulting data.

Three programming interfaces exist:

perftools-lite-* - Simple interface that produces reports to stdout. There are four perftools-lite submodules:
- perftools-lite - Lowest overhead sampling experiment identifies key program bottlenecks.
- perftools-lite-events - Produces a summarized trace; a good tool for detailed MPI statistics, including synchronization overhead.
- perftools-lite-loops - Provides loop work estimates (must be used with CCE).
- perftools-lite-hbm - Reports memory traffic information (CCE, x86-64 systems only). See the perftools-lite(4) manpage for details.
perftools - Advanced interface provides full-featured data collection and analysis capability, including full traces with timeline displays. It includes the following components:
- pat_build - Utility instruments programs for performance data collection.
- pat_report - After using pat_build to instrument the program, set runtime environment variables, and executing the program, use pat_report to generate text reports from the resulting data and export the data for use in other applications. See the pat_report(1) manpage for details.
- CrayPat runtime library - Collects specified performance data during program execution. See the intro_craypat(1) manpage for details.
pat_run - Launches a dynamically linked program instrumented for performance analysis. After successfully run, collected data may be explored further with the pat_report and Cray Apprentice2 tools. See the pat_run(1) manpage for details.

Also included:

PAPI - The PAPI library, from the Innovative Computing Laboratory at the University of Tennessee in Knoxville, is distributed with the performance tools. PAPI allows applications or custom tools to interface with hardware performance counters made available by the processor, network, or accelerator vendor. Performance tools’ components use PAPI internally for CPU, GPU and network performance counter collection for derived metrics, observations, and performance reporting. A simplified user interface, which does not require the source code modification of using PAPI directly, is provided for accessing counters.
Cray Apprentice2 - An interactive X Window System tool for visualizing and manipulating performance analysis data captured during program execution.
pat_view - Aggregates and presents multiple sampling experiments for program scaling analysis. See the pat_view(1) manpage for more information.
Reveal - Extends performance tools technology by combining performance statistics and program source code visualization with compiler optimization feedback to better identify and exploit parallelism, and to pinpoint memory bandwidth sensitivities in an application. Reveal lets users navigate source code to highlighted dependencies or bottlenecks during optimization. Using the program library provided by CCE and the performance data collected, the user can navigate source code to understand which high-level loops could benefit from OpenMP parallelism from loop-level optimizations such as exposing vector parallelism. Reveal provides dependency and variable scoping information for those loops and assists the user with creating parallel directives.

Use performance tools to:

Identify bottlenecks
Find load-balance and synchronization issues
Find communication overhead issues
Identify loops for parallelization
Map memory bandwidth utilization
Optimize vectorization within application code
Collect application energy consumption information
Collect scaling information for application code
Interpret performance data

More information is available in the HPE Performance Analysis Tools User Guide (S-8014). For the current direct link to this publication, see Additional Resources.

About CPE Deep Learning Plugin

Modules: craype-dl-plugin-py3, craype-dl-plugin-py2

Commands: import dl_comm as cdl, help(cdl), help(cdl.gradients)

Manpage: intro_dl_plugin(3)

The CPE Deep Learning Plugin (CPE DL Plugin) is a highly tuned communication layer for performing distributed deep learning training. The CPE DL Plugin provides a high performance gradient-averaging operation and routines to facilitate process identification, job size determination, and broadcasting of initial weights and biases. The routines can be accessed through the plugin’s C or Python APIs. The Python API provides support for TensorFlow, PyTorch, Keras, and NumPy.

For more information about CPE DL Plugin directives, see the intro_dl_plugin(3) manpage.

Configure the development environment with modules

Each modulefile contains information needed to configure the shell for an application. After the Modules package is initialized, the environment can be modified on a per-module basis using the module command. Typically, modulefiles instruct the module command to alter or set shell environment variables, such as $PATH, $MANPATH, and so forth. Multiple users can share modulefiles on a system, and users can create their own to supplement or replace the shared modulefiles.

Add or remove modulefiles from the current environment as needed. The environment changes contained in a modulefile also can be summarized through the module command. If no arguments are given, a summary of the module usage and subcommands are shown. The subcommand and its associated arguments describe the action for the module command to take.

Unless noted otherwise, the commands described in this section work for both the default module system and Lmod. Also, modules and modulefiles listed in the examples in this section are for demonstration purposes only. Actual versions may differ from versions on the current system.

Getting started

After logging in, load the needed programming environment. For example:

user@hostname> module load PrgEnv-<compiler>

Module versions are for example purposes only and may vary from those on the system.

Listing loaded modules

To list loaded modules, enter:

user@hostname> module list
Currently Loaded Modulefiles:
1) craype-x86-rome           4) perftools-base/23.12.0     7) cray-mpich/8.1.28
2) craype-network-ofi        5) cce/17.0.0                 8) cray-libsci/23.12.5
3) libfabric/1.13.1          6) craype/2.7.30              9) PrgEnv-cray/8.4.0

Module versions are for example purposes only and may vary from those on the system.

Listing available programming modules

To list available programming modules, enter:

…

user@hostname> module avail PrgEnv
--------------------------- /opt/cray/pe/modulefiles ---------------------------
PrgEnv-amd/8.6.0      (D)   PrgEnv-cray/8.6.0    (L,D)   PrgEnv-intel/8.6.0   (D)
PrgEnv-aocc/8.6.0     (D)   PrgEnv-gnu-amd/8.6.0 (D)     PrgEnv-nvidia/8.6.0  (D)
PrgEnv-cray-amd/8.6.0 (D)   PrgEnv-gnu/8.6.0     (D)

Module versions are for example purposes only and may vary from those on the system.

Listing available modules

To list all available modules, enter:

user@hostname> module avail

To list all available modules of a certain type (for example module avail cce), enter:

user@hostname> module avail cce
--------------------------- opt/cray/pe/lmod/modulefiles/mix_compilers ---------------------------
  cce-mixed/16.0.1    cce-mixed/16.0.0    cce-mixed/17.0.0 (D)

------------------------------ /opt/cray/pe/lmod/modulefiles/core --------------------------------
  cce/16.0.0    cce/16.0.1    cce/17.0.0 (L,D)

Load modules

To load the default version of a module, enter, for example:

user@hostname> module load cce

To load a specific version of a module, enter, for example:

user@hostname> module load cce/<version>

Unloading modules

To remove a module, enter, for example:

user@hostname> module unload cray-libsci

Changing module versions

To swap out the default module for a specific version, enter, for example:

user@hostname> module switch cce cce/<version>

Changing module versions using the `cpe` module

The cpe module specifies all CPE modules associated with a given monthly CPE release. The module name is cpe/<date>, where <date> is the release date in the format yy.mm. The purpose of cpe is to enable users to switch currently loaded CPE modules to the version provided in a given monthly release by using a single command. All subsequently loaded modules treat the version associated with cpe/<date> as the default version.

For example, if cray-mpich/8.1.3 and cray-libsci/21.09.1.1 are included in cpe/21.09 (March 2021) and the user currently has cray-mpich/8.1.2 and cray-libsci/20.12.1.2 loaded, switch both to the March 2021 versions by entering:

user@hostname> module load cpe/21.09

Unloading cpe does not restore the previously loaded module versions and, in fact, will have no effect on currently loaded modules. To compensate for this deficiency, the cpe directory contains restore_system_default scripts:

user@hostname> source /opt/cray/pe/cpe/21.03/restore_system_defaults.sh

Changing programming environments

Use module swap to change between programming environments. For example:

user@hostname> module swap PrgEnv-cray PrgEnv-gnu

Displaying module information

To display information about module conflicts and links, enter, for example:

user@hostname> module show perftools
--------------------------------------------------------------------------------------------------------------
   /opt/cray/pe/lmod/modulefiles/perftools/23.05.0/perftools.lua:
--------------------------------------------------------------------------------------------------------------
prereq("perftools-base")
family("perftools")
help([[
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
This instrumentation module enables the full functionality of CrayPat, which
includes a wealth of performance measurement, analysis and presentation options
collection through pat_build and the CrayPat runtime Environment variables.
specified counter overflows, and tracing experiments, which count some event
such as the number of times a specific function is executed.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

]])
setenv("CRAYPAT_COMPILER_OPTIONS","1")

user@hostname> module show PrgEnv-cray
-------------------------------------------------------------------
family("PrgEnv")
help([[    The PrgEnv-cray modulefile loads the Cray Programming Environment,
    which includes the Cray Compiling Environment (CCE).
    This modulefile defines the system paths and environment variables
    needed to build an application using CCE for supported
    Cray systems.

    This module loads the following modules:
     - cray-dsmml
     - cray-mpich
     - cray-libsci

    NOTE: This list is defined in /etc/cray-pe.d/cray-pe-configuration.sh.]])
whatis("Enables the Programming Environment using the cray compilers.")
setenv("PE_ENV","CRAY")
load("craype")
load("cray-dsmml")
load("cray-mpich")
load("cray-libsci")

Swapping other programming environment components

Switching the module environment does not completely change the run-time environment for products that contain dynamically linked libraries, such as MPI. This outcome occurs because the runtime linker caches dynamic libraries as specified by /etc/ld.so.conf. Prepend <CRAY_LD_LIBRARY_PATH> to <LD_LIBRARY_PATH> to use a nondefault version of a dynamic library at run time.

To revert the environment to an earlier version of cray-mpich 8.0 include, enter:

user@hostname> module swap cray-mpich/8.0.5.4 cray-mpich/8.0.3
user@hostname> export LD_LIBRARY_PATH=$<CRAY_LD_LIBRARY_PATH>:$<LD_LIBRARY_PATH>

Change the run-time linking search path behavior by using the PE_LD_LIBRARY_PATH environment variable. For example, enter:

user@hostname> export PE_LD_LIBRARY_PATH=system
user@hostname> module swap cray-mpich/8.0.5.4 cray-mpich/8.0.3

If the PE_LD_LIBRARY_PATH environment variable is set to system, CPE modules directly interact with LD_LIBRARY_PATH. Otherwise, if the variable is not set or set to any other value, CPE modules retain the default behavior, leaving LD_LIBRARY_PATH under user control.

Lmod mixed compiler support

CPE Lmod supports loading multiple different compiler modules concurrently by loading a dominant core-compiler module and one or more support mixed-compiler modules. Core-compiler modules are located in the core directory. Loading a core-compiler module sets Lmod hierarchy variables. After the core-compiler module is loaded, module avail lists the mixed-compiler modules available for loading.

The CPE Lmod mixed compiler support provides the flexibility for users to choose which compiler modules (including user generated compiler modules) to mix together; however, only compiler modules released by CPE are supported. Because users can generate their own compiler modules, CPE cannot guarantee that all mix-compilers shown for core-compiler modules are compatible.

Example:

Loading the CCE and GCC modules together with CCE as the dominant core compiler and GCC as the supporting mixed compiler (Note that command output text is abbreviated, and module versions are generalized.):

user@hostname> module load PrgEnv-cray
user@hostname> module avail
...
----- /opt/cray/pe/lmod/modulefiles/mix_compilers ----
   ...
    gcc-mixed/[version]
   ...

user@hostname> module load gcc-mixed
user@hostname> module list

Currently Loaded Modules:
  #)cce/[version] #)PrgEnv-cray/[version] #)gcc-mixed/[version]

Compiling an application

Build an MPI application

PREREQUISITES

cray-mpich must be loaded.

OBJECTIVE

Build an MPI application using Cray’s cc compiler driver.

PROCEDURE

Verify the correct modules are loaded.

hostname$ module list
Currently Loaded Modulefiles:
1) craype-x86-rome
2) craype-network-ofi
3) libfabric/1.13.1
4) perftools-base/23.12.0
5) cce/17.0.0
6) craype/2.7.30
7) cray-mpich/8.1.28
8) cray-libsci/23.12.5
9) PrgEnv-cray/8.4.0

Change to the directory where the application is located.
```
hostname$ cd /lus/<USERNAME>/
```
Create an application.

See Example MPI Program Source for a sample “Hello World” MPI application.

Build the application.

hostname$ cc mpi_hello.c -o mpi_hello.x

Example MPI program source

mpi_hello.c

/* MPI hello world example */
#include <stdio.h>
#include <mpi.h>
int main(int argc, char **argv)
{
   int rank;
   MPI_Init(&argc, &argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   printf("Hello from rank %d\n", rank);
   MPI_Finalize();
   return 0;
}

Running an application

You can control HPE Slingshot network resources on systems running Slurm or PBS Pro with PALS.

Running an application with Slurm in batch mode

PREREQUISITES

Slurm is installed and configured
The application is compiled (see Build an MPI application)

OBJECTIVE

This procedure creates a launch script and submits it as a job using Slurm.

PROCEDURE

Load CPE modules:

MPI: MPI modules are loaded by default.
If using Cray MPICH, Slurm must use the cray_shasta MPI type. Determine the default MPI type for Slurm:
```
user@hostname> scontrol show config | grep MpiDefault
MpiDefault              = cray_shasta
```
If the default Slurm MPI type is not cray_shasta, either add the --mpi=cray_shasta option to each srun command, or set the SLURM_MPI_TYPE environment variable to cray_shasta:
```
user@hostname> export SLURM_MPI_TYPE=cray_shasta
```
Change to the directory where the application is located:
```
user@hostname> cd /lus/<USERNAME>/
```
Create a launch script.

IMPORTANT: If your login shell does not match the batch script shell (for example, your login shell is tcsh, but the batch script uses bash), the module environment might not be initialized. To fix this issue, add -l to the first line of the batch script (for example, #!/bin/bash -l).

To launch the application with sbatch, add srun to the launch script.

MPI: This example is specific to the “Hello World” MPI application running on four nodes (see Example MPI program source).
```
#!/bin/bash
#SBATCH -N4
#SBATCH --ntasks-per-node=1
ulimit -s unlimited ## in case not set by default
srun -N4 --ntasks-per-node=1 ./mpi_hello.x
exit 0
```
Assign permissions to the launch.sh script to ensure it is executable:
```
user@hostname> chmod u+x launch.sh
```

Launch the batch script:

user@hostname> sbatch launch.sh
Submitted batch job 1065736

Check job output:

MPI:

user@hostname> cat slurm-1065736.out
Hello from rank 1
Hello from rank 3
Hello from rank 0
Hello from rank 2

Troubleshooting: Add ldd to the job script to ensure that the correct libraries are being loaded.

user@hostname> ldd ./mpi_hello.x

Running an application with Slurm in interactive mode

PREQUISITES

Slurm must be installed and configured.
The application must be compiled (see Build an MPI application).

OBJECTIVE

This procedure launches an application using the Slurm srun command.

PROCEDURE

Load CPE modules.

Note: MPI modules are loaded by default.
Determine the default MPI type for Slurm. Slurm must use the cray_shasta MPI type if you are using Cray MPICH.
```
user@hostname> scontrol show config | grep MpiDefault
MpiDefault              = cray_shasta
```
If the default Slurm MPI type is not cray_shasta, either add the --mpi=cray_shasta option to each srun command, or set the SLURM_MPI_TYPE environment variable to cray_shasta:
```
user@hostname> export SLURM_MPI_TYPE=cray_shasta
```
Change to the directory where the application is located:
```
user@hostname> cd /lus/<USERNAME>/
```

Execute the application with srun:

MPI:

user@hostname> srun -N<ranks> --ntasks-per-node=<number_tasks_per_node> ./<app_exe>

Example using the Example MPI program source:

user@hostname> srun -N4 --ntasks-per-node=1 ./mpi_hello.x
Hello from rank 1
Hello from rank 2
Hello from rank 3
Hello from rank 0

Example using UCX instead of OFI:

user@hostname> module swap craype-network-ofi craype-network-ucx
user@hostname> module swap cray-mpich cray-mpich-ucx
user@hostname> srun -N4 --ntasks-per-node=1 ./mpi_hello.x
Hello from rank 1
Hello from rank 2
Hello from rank 3
Hello from rank 0

Running an application with PBS Pro in batch mode

PREREQUISITES

PBS Pro workload manager must be installed and configured.
The application must be compiled (see Build an MPI application).

OBJECTIVE

This procedure creates a launch script and submits it as a PBS job using PALS.

IMPORTANT: Due to the way PALS is integrated with PBS, the job-specific temporary directory (TMPDIR) is only created on the head node of the job. This scenario can cause application failure if it tries to create temporary files or directories. To work around this problem, add export TMPDIR=/tmp to the job script before calling aprun or mpiexec.

PROCEDURE

Change to the directory where the application is located:
```
user@hostname> cd /lus/<USERNAME>
```
Create a launch script launch.sh.

IMPORTANT: If your login shell does not match the batch script shell (for example, your login shell is tcsh, but the batch script uses bash), the module environment might not be initialized. To fix this issue, add -l to the first line of the batch script (for example, #!/bin/bash -l).

MPI: This example launch script is specific to the “Hello World” MPI application (see Example MPI program source) running on four nodes.
```
#!/bin/bash
#PBS -l walltime=00:00:30
echo start job $(date)
module load cray-pals
echo "mpiexec hostname"
mpiexec hostname
echo "mpiexec -n 4 /lus/<USERNAME>/hello_mpi"
mpiexec -n4 /lus/<USERNAME>/mpi_hello.x
echo end job $(date)
exit 0
```
Assign the required permissions to the launch.sh script to verify it is executable:
```
user@hostname> chmod u+x launch.sh
```

Launch the batch script:

user@hostname> qsub -l select=4,place=scatter launch.sh

Check job output:

user@hostname> cat launch.sh.o426757
Hello from rank 3
Hello from rank 2
Hello from rank 1
Hello from rank 0

Running an application with PBS Pro in interactive mode

PREREQUISITES

PBS Pro workload manager must be installed and configured.
The application must be compiled (see Build an MPI application).

OBJECTIVE

This procedure interactively submits a job to PBS using the PALS mpiexec command.

PROCEDURE

Initiate an interactive session:

user@hostname> qsub -I
qsub: waiting for job 4071.pbs-host to start
qsub: job 4071.pbs-host ready
user@hostname>

Load the PrgEnv-cray, cray-pals, and cray-pmi modules:

user@hostname> module load PrgEnv-cray; module load cray-pals; module load cray-pmi

Acquire information about mpiexec:

user@hostname> type mpiexec
mpiexec is /opt/cray/pe/pals/<version>/bin/mpiexec

Change to the directory where the application is located:
```
user@hostname> cd /lus/<USERNAME>/
```

Run the executable MPI program:

user@hostname> mpiexec -n4 ./mpi_hello.x
Hello from rank 1
Hello from rank 2
Hello from rank 3
Hello from rank 0

Example using UCX Instead of OFI:

user@hostname> module swap craype-network-ofi craype-network-ucx

Inactive Modules:
  1) cray-mpich

user@hostname> module swap cray-mpich cray-mpich-ucx
user@hostname> mpiexec -n4 ./mpi_hello.x
Hello from rank 1
Hello from rank 2
Hello from rank 3
Hello from rank 0

Debugging an application

Debug a hung application using gdb4hpc

PREREQUISITES

Before completing this procedure, make sure that the targeted application has been:

Compiled with -g to display debugging symbols. Compiling with -O0 to disable compiler optimizations is advised but not required.
Launched with srun.

OBJECTIVE

The procedure details how to debug a hung application using gdb4hpc.

PROCEDURE

Load the gdb4hpc module:
```
$ module load gdb4hpc
```
Launch gdb4hpc:
```
$ gdb4hpc
```
Attach to the application with gdb4hpc:

a. Choose a process set handle. All process sets are represented as a named debugger variable. Debugger variables are prefixed with a $ in the form of $<name> (for example, $app or $a). This variable name can be any variable name of your choosing, and whatever is easiest to remember.

b. Determine the application identifier. For Slurm applications launched with srun, this identifier is a jobid.stepid determined using a mechanism, such as squeue for the jobid and sstat jobid for the stepid. By default, jobs only have one stepid and start at 0. For applications run with mpiexec, the pid of the mpiexec process can be supplied.

c. Using the process set handle and application identifier information in the previous steps, use the attach command to attach onto the running application. In the following example, $a is the process set handle and 1840118.0 is the application identifier.
```
```screen
$ dbg all> attach $a 1840118.0
```
```
Conduct a traditional parallel debugging session. gdb4hpc commands include:
- backtrace - Displays stack frames. Pass in an argument to limit the number of stack frames displayed (such as backtrace -5). See help backtrace.
- frame - Displays only the current stack frame.
- up - Moves the current stack frame up. Pass in an argument to move up the specified number of stack frames (up -3).
- down - Moves the current stack frame down. Pass in an argument to move down the specified number of stack frames (down -5).
- watchpoint - Sets an access or write watchpoint.
- assign - Assigns a debugger convenience variable or application variable (see help assign for more details).
- gdbmode - Drops directly into the gdb interpreter (see help gdbmode for more details).
- kill - Kills the application with SIGKILL and ends the debug session.
- release - Detaches and resumes the application. Ends the debug session.
- quit - Exits gdb4hpc and releases the application.
Fix any pinpointed bugs, recompile, and verify that the fix works.

Tips:
- Prevent printing of backtrace information for all ranks if you are attaching an application:
```
$ dbg all> set print entry-frame false
```
- Set the value to true to re-enable display of entry frames.
- While gdb4hpc is running, use the help command to get more information about command usage. For example, to find information about the launch command, enter:
```
$ help launch
```

Debug hung applications with STAT

PREREQUISITES

The STAT module must be loaded. To load the system default version:
```
module load cray-stat
```

OBJECTIVE

This procedure debugs a hung application using the Cray Stack Trace Analysis Tool (STAT).

PROCEDURE

Attach to the hung application with STAT with the workload manager job launcher, using either stat-gui or stat-cl. As seen in Figure 1, 19800 is the pid of the srun process.
```
$ stat-cl 19800
```

Attach A Hung Application In STAT

STAT launches its daemons, gathers a stack trace from each process, and merges them into a prefix tree, as seen in Figure 2.

STAT Prefix Tree

Analyze the merged backtrace using stat-view or stat-gui.
Choose additional debugging steps based on the nature of the hang. Available stat-gui tools allow you to:
- Narrow down to the trace steps exhibiting the bug by clicking Shortest Path (or Longest Path).
- Adjust the sample size to look at the function level, or down to the function and line level, by clicking Sample.
- Identify stack traces visited by the least or most numbers of tasks to identify outliers by clicking Least Task or Most Task.- Step though the temporal order of the stack trace by clicking Back TO or Forward TO. Right-click on tasks that have made the least progress to View Source code.
- Gather X number of stack traces over time by clicking Sample Multiple.
- Choose a subset of equivalent classes to feed to a debugger by clicking Eq C.
Narrow down the search space to a specific function, and use a traditional debugger like gbd4hpc or valgrind4hpc (depending on the bug’s nature) to find the bug.

Debug crashed applications with ATP

PREQUISITES

ATP module must be loaded (this is the default).
Target application must be dynamically or statically linked against the ATP support library.
Target application must be compiled with -g to keep debugging symbols.
(Optional) Include /opt/cray/pe/atp/libAtpDispatch.so in /etc/opt/slurm/plugstack.conf to enable the ATP Slurm plugin.

OBJECTIVE

This procedure debugs a crashed application using the Cray Abnormal Termination Processing (ATP) debugger. ATP is a first-line tool for diagnosing crashing applications.

If a parallel application crashes, ATP can produce a merged stack trace tree, providing an overview of the entire job state. ATP also selectively produces core files from crashing processes (or ranks). If further debugger support is required, the user may opt to rerun the job under the Cray parallel debugger, gdb4hpc.

PROCEDURE

Load the ATP module if not already loaded:
```
$ module load atp
```
Set ATP_ENABLED=1.
(Optional) Set environmental variables.

With the exception of ATP_ENABLED, ATP does not usually need other environment variables. If necessary, runtime and output behavior can be modified using:
- ATP_CONSOLE_OUTPUT: Default enabled. If enabled, ATP produces an overview of the crashed program and writes it to a standard error. This overview provides rank information, the signal that caused the crash, and crash location and assertion, if available.
- ATP_HOLD_TIME: Default 0 minutes. If set to a nonzero value, ATP pauses for the specified number of minutes after detecting a crash. The job is held in a stopped state so a debugger, like GDB4hpc, can attach for further debugging.
- ATP_MAX_ANALYSIS_TIME: Default 300 seconds. After sending a crash analysis request to the ATP backend process, the ATP frontend process waits the given number of seconds for crash analysis to occur. If this timeout expires, ATP assumes that the backend process was unsuccessful and continues job termination.
- ATP_MAX_CORES: Default 20 core files. After crash analysis completion, ATP selects a subset of ranks from which to dump core files. The maximum number of core files is limited by this variable. If set to 0, core file dumping is disabled.
- ATP_CORE_FILE_DIRECTORY: Default current directory. ATP writes core files from the selected subset of ranks to the given directory.
Run the application.
Examine the ATP output. While handling a crash, ATP prints the following message:
```
Application is crashing. ATP analysis proceeding...
```
It proceeds to list each process and the reason for its failure, if crashed:
```
Processes died with the following statuses:
  <0 > Reason: '<RUNNING>' Address: 0x7ffff7bab697 Assertion: ''
  <1 2 3 > Reason: 'SIGSEGV /SEGV_MAPERR' Address: 0x0 Assertion: ''
```
In example directly above, process 0 did not crash and was still running. Processes 1, 2, and 3 experienced segfaults.
Gather the merged backtrace and core files. After displaying a summary of the job status, ATP writes selected core files and a graph visualization of the complete stack trace tree:
```
 Producing core dumps for ranks 3 1 2
 3 cores written in /cray/css/users/adangelo/stash/atp/tests
 View application merged backtrace tree with: stat-view atpMergedBT.dot
 You may need to: module load cray-stat
```
By default, ATP writes the files atpMergedBT.dot and atpMergedBT_line.dot in the current working directory. atpMergedBT.dot is a function-level view of the stack trace tree, and atpMergedBT_line.dot is a source-line level view of the stack trace tree. Core files can be analyzed using the GNU Debugger, gdb.

Examine those files with stat-view or gdb4hpc:

$ module load cray-stat
$ stat-view atpMergedBT_line.dot

Fix any pinpointed bugs, recompile, and verify that the fixes work.

Debug crashing applications with gdb4hpc

PREREQUISITES

The targeted application must be compiled with -g -O0 to keep debugging symbols and disable compiler optimizations.

OBJECTIVE

This procedure debugs a crashing application using gdb4hpc.

PROCEDURE

Load the gdb4hpc module:
```
$ module load gdb4hpc
```
Before launching the application:

a. Determine the process set handle. Choose a name that is easy to remember, such as $app or $a. Array syntax notation implies the number of processing elements, equivalent to the Slurm srun -n option.

b. Determine additional WLM-specific settings.

c. Determine any additional arguments. See help launch for available launcher arguments.
Launch the application under gdb4hpc control:
```
$ launch $a{<number of inferiors/ranks>} [--launcher-args="<optional WLM specific settings>"] [<optional_launch_args>] <path_to_executable>
```
After launch is complete, the initial entry point is displayed. Note the process set notation of the application {0..1023}; each notation represents a process element.
Conduct a traditional parallel debugging session. gdb4hpc commands include:
- backtrace - Displays stack frames. Pass in an argument to limit the number of stack frames displayed (such as backtrace -5). See help backtrace.
- frame - Displays only the current stack frame.
- up - Moves up the current stack frame. Pass in an argument to move up the specified number of stack frames (up -3).
- down - Moves the current stack frame down. Pass in an argument to move down the specified number of stack frames (down -5).
- watchpoint - Sets an access or write watchpoint.
- assign - Assigns a debugger convenience variable or application variable.
- gdbmode - Drops directly into the gdb interpreter.
- kill - Kills the application with SIGKILL and ends the debug session.
- release - Detaches and resumes the application. Ends the debug session.
- quit - Exits gdb4hpc and releases the application.
Fix any pinpointed bugs, recompile, and verify that the fix works.

Debug applications with valgrind4hpc to find common errors

PREREQUISITES

The target application must be:

Dynamically linked.
Compiled with -g to keep debugging symbols.

OBJECTIVE

Find common issues like memory leaks using valgrind4hpc.

Valgrind4hpc is a Valgrind-based debugging tool which detects memory leaks and errors in parallel applications. Valgrind4hpc aggregates any duplicate messages across ranks to help provide an understandable picture of program behavior. Valgrind4hpc manages starting and redirecting output from many copies of Valgrind, as well as deduplicating and filtering Valgrind messages.

PROCEDURE

Load the valgrind4hpc module (if not already loaded):
```
$ module load valgrind4phc
```
Run the memcheck tool to look for memory leaks:
```
$ valgrind4hpc -n1024 --launcher-args="--exclusive --ntasks-per-node=32" \
$PWD/build_cray/apps/transpose_matrix -- -c -M 31 -n 1000
```
Use these common valgrind4hpc arguments:
- -n, --num-ranks=<n> - Number of job ranks to pass to the workload manager (for example, Slurm).
- -l, --launcher-args="<args>" - Additional workload manager arguments, such as rank distribution settings.
- -o, --outputfile=<file> - Redirects all Valgrind4hpc error output to file.
- -v, --valgrind-args="<args>" - Arguments to pass to the Valgrind instance.
  - For example, --valgrind-args="--track- origins=yes --leak-check=full" will track the exact origin of every memory leak, at the cost of performance.
Examine the Valgrind4hpc output.

Valgrind4hpc detects a potential memory error, such as an uninitialized read/write or a memory leak, and displays an error block containing the affected ranks, and a backtrace of where the error occurred. Note that with errors stemming from an invalid use of system library routines, the backtrace will mention internal library functions.
Fix any pinpointed bugs, recompile, and verify that the fixes worked.

Debugging applications with Sanitizers4hpc to find common errors

PREREQUISITES

The target application must be:

Built with instrumentation for LLVM or GPU sanitizers, e.g. -fsanitize=address.
Compiled with -g to keep debugging symbols.

OBJECTIVE

Find memory access or leak issues at runtime using information from LLVM Sanitizers.

Sanitizers4hpc is an aggregation tool to collect and analyze LLVM Sanitizers output at scale. The Clang AddressSanitizer, LeakSanitizer, and ThreadSanitizer tools are supported. Additionally, the AMD GPU Sanitizer library and the Nvidia Compute Sanitizer are also supported. Sanitizers4hpc manages the launch of your job through the currently running workload manager. See the sanitizers4hpc(1) man page for details.

PROCEDURE

Load the sanitizers4hpc module:
```
$ module load sanitizers4hpc
```
Run the application by supplying workload manager job launch arguments and the target binary. For example, to run the binary a.out with four ranks:
```
$ sanitizers4hpc --launcher-args="-n4" -- ./a.out binary_argument
```
When a memory error is encountered, the linked LLVM Sanitizer produces error reports for each affected rank. Sanitizers4hpc processes these error reports and aggregates them for easier analysis. For example, when the application encounters an invalid read off the end of a buffer on four ranks, Sanitizers4hpc generates a single error report noting the error on all four ranks but reporting it, based on AddressSanitizer information, at the same place in the source file.
```
RANKS: <0-3>
AddressSanitizer: heap-buffer-overflow on address
READ of size 4 at 0x61d000002680 thread T0
...
 #1 0x328dc3 in main /source.c:37:22
...
SUMMARY: AddressSanitizer: heap-buffer-overflow /source.c:52:15 in main
```

The sanitizers4hpc(1) man page also includes information on where to find documentation for the supported sanitizer tools.

Debug two versions of the same application side-by-side with CCDB

PREREQUISITES

Two versions of the same application must be available for comparison.

OBJECTIVE

Debug an application by comparing it to a similar, yet different and working application, to see differences in data at every stage of execution.

LIMITATIONS

Applications must be similar enough to step through execution steps in parallel, but different enough to see data changes through those execution steps.

PROCEDURE

Load the CCDB module.
```
$ module load cray-ccdb
```
Launch CCDB.
```
$ ccdb
```
The CCDB window appears.
Populate launch specifications for both applications.

a. Enter Application, Launcher Args, and Number of PEs details if resources have already been allocated and a qsub session started.

b. Enter a Batch Type in each Launch Specification if resources have not been allocated.
Double-click the source file for each test application to view the source code.
Generate an Assertion Script.

a. Left-click on any line number.

b. Click Build Assert.

The Assertion Script Dialog window opens.

c. Enter Name of the Source File, Line number, Variable, and Decomposition information for both App0 and App1.

d. Click Add Assert.

e. Select Save Script to save the Assertion script.

f. Select Start to run the assertion script.
Open an Assertion Script.

a. Select View from the menu bar.

b. Hover over Assertion Scripts in the View drop-down list.

c. Select an Assertion Script.

The CCDB Assertion Script dialog opens with the assertion loaded.
Alternately, use CCDB controls to step through the two applications to determine where they break down or diverge.
Click the red FAILURE boxes to access failure details.

Tip: For more information on using CCDB, click ? in the current window or Help from the main CCDB window menu bar.
After narrowing down the search space, use a traditional debugger like gbd4hpc or valgrind4hpc (depending on the bug’s nature) to find the bug. After fixing it, run the old and new versions side-by-side in CCDB to verify that the bug was fixed.

Profiling an application

Identify application bottlenecks

PREREQUISITES

CPE must be installed.

OBJECTIVE

This procedure instruments applications, runs them, and creates detailed output highlighting application bottlenecks.

PROCEDURE

Load the perftools-base module if it is not already loaded:
```
$ module load perftools-base
```
Load the perftools-lite instrumentation module:
```
$ module load perftools-lite
```
Compile and link the program:
```
$ make program
```
Run the program:
```
$ srun a.out
```
After program execution completes, perftools-lite generates:
- A text report to stdout, profiling program behavior, identifying where the program spends its execution time, and offering recommendations for further analysis and possible optimizations.
- An experiment data directory, containing files which can be used to examine program behavior more closely using Cray Apprentice2 or pat_report.
- A report file, data-directory/rpt-files/RUNTIME.rpt, containing the same information written to stdout.
Review the profiling reports written to stdout. To get additional information without re-running, use the pat_report utility on the experiment directory (such as my_app.mpi+68976-16s) produced from a profiling run to generate new text reports. For example:
```
$ pat_report -O calltree+src my_app.mpi+68976-16s
```
Tip: For additional help, run pat_help from the command line. Also, refer to the HPE Performance Analysis Tools User Guide (S-8014).

HPE Cray Programming Environment User Guide: HPCM

About

Cray Programming Environment (CPE) Components

Cray Environment setup and compiling support

Modules and modulefiles

Lmod

About Cray Compiling Environment

Third-Party compilers

About AOCC

About the AMD ROCm compiler

About the Intel compiler

About the NVIDIA compiler

About GNU

Programming languages

HPE Cray Scientific and Math Libraries

Cray Message Passing Toolkit

Debugger support tools

Tool infrastructure

HPE Cray Performance Measurement and Analysis Tools

About CPE Deep Learning Plugin

Configure the development environment with modules

Getting started

Listing loaded modules

Listing available programming modules

Listing available modules

Load modules

Unloading modules

Changing module versions

Changing module versions using the cpe module

Changing programming environments

Displaying module information

Swapping other programming environment components

Lmod mixed compiler support

Compiling an application

Build an MPI application

Example MPI program source

Running an application

Running an application with Slurm in batch mode

Running an application with Slurm in interactive mode

Running an application with PBS Pro in batch mode

Running an application with PBS Pro in interactive mode

Debugging an application

Debug a hung application using gdb4hpc

Debug hung applications with STAT

Debug crashed applications with ATP

Debug crashing applications with gdb4hpc

Debug applications with valgrind4hpc to find common errors

Debugging applications with Sanitizers4hpc to find common errors

Debug two versions of the same application side-by-side with CCDB

Profiling an application

Identify application bottlenecks

Changing module versions using the `cpe` module