perftools-lite

an easy to use version of the CrayPat Performance Measurement and Analysis Tool

Author:

Hewlett Packard Enterprise Development LP.

Copyright:

Copyright 2019-2021,2023-2024 Hewlett Packard Enterprise Development LP.

Manual section:

4

DESCRIPTION

Perftools-Lite is a simplified, easy-to-use version of the Cray Performance Measurement and Analysis Tool set. Perftools-Lite provides basic performance analysis information automatically, with a minimum of user interaction, and yet offers information useful to users wishing to explore their program’s behavior further using the full CrayPat tool set.

To use Perftools-Lite, follow these steps. Assuming the perftools-base module is already loaded:

Load one of the perftools-lite instrumentation modules.

$ module load perftools-lite

Available modules:

  • perftools-lite — Default profile

  • perftools-lite-events — Event profile

  • perftools-lite-gpu — GPU kernel and data movement along with event profile (see WARNINGS below)

  • perftools-lite-loops — Loop estimates, for use with Reveal (see WARNINGS below)

  • perftools-lite-hbm — High bandwidth memory data (for X86-64 systems only)

Compile and link the program.

$ make program

Run the program on the Cray system.

$ srun a.out

At the end of normal program execution, Perftools-Lite produces the following output:

  • A text report to stdout, profiling the program’s behavior, identifying where the program spends its execution time, and offering recommendations for further analysis and possible optimizations.

  • An experiment data directory, with contents including the following.

  • A rpt-files subdirectory, which contains a copy of the text report.

  • An ap2-files subdirectory, which contains processed data files, in the format required to examine the program’s behavior more closely using Cray Apprentice2 or pat_report.

  • An xf-files subdirectory, which contains the raw data files written by the instrumented executable.

  • One or more MPICH_RANK_ORDER_FILE files (each with different suffixes), containing suggestions for optimizing MPI rank placement in subsequent program runs. The number and types of files produced is determined by the information captured during program execution. The files can include rank reordering suggestions based on sent message data from MPI functions, time spent in user functions, or a hybrid of the two.

OPTIONS

The Perftools-Lite instrumentation modules support four basic experiments:

  • perftools-lite — A sampling experiment, which reports execution time, aggregate MFLOP count, the top time-consuming functions and routines, MPI behavior in user functions (if the application is an MPI program), and generates the data files listed above.

  • perftools-lite-events — A tracing experiment, which generates a profile of the top functions traced as well as node observations and possible rank order suggestions.

  • perftools-lite-gpu — A tracing experiment that focuses on the program’s use of GPU accelerators. Includes data movement between host and device(s), in addition to the functionality of the perftools-lite-events module (see WARNINGS below).

  • perftools-lite-loops — Loop estimates, for use with Reveal (see WARNINGS below).

  • perftools-lite-hbm — (X86-64 systems only) Identifies program locations responsible for high bandwidth from memory, on Intel processors.

To perform a different experiment, swap to one of the other instrumentation modules and relink or recompile the application.

To disable Perftools-Lite during a build, unload the perftools-lite instrumentation module. To re-enable Perftools-Lite, load the desired perftools-lite instrumentation module. Once built using Perftools-Lite, an executable is instrumented and will initiate CrayPat functionality at runtime whether or not the Perftools-Lite module is loaded.

EXAMPLES

A typical Perftools-Lite session follows these steps. Assuming the perftools-base module is loaded already, load the perftools-lite module.

$ module load perftools-lite

Compile and link the program.

$ make program

Any .o files generated during this step are saved automatically.

Run the program.

$ srun a.out

Review the resulting reports from the profiling experiment. When you’re ready to continue with another experiment, delete or rename the a.out file.

$ rm a.out

This will force a subsequent make utility to relink the program for a new experiment.

Swap to a different instrumentation module.

$ module swap perftools-lite perftools-lite-events

Rerun make to relink the program.

$ make program

Since your .o files were saved in step 2, this merely relinks your program.

Run the program again.

$ srun a.out

Review the resulting reports and data files, and determine whether you want to explore your program’s behavior further using the full CrayPat tool set or use one of the MPICH_RANK_ORDER_FILE files to create a customized rank placement. (For more information about customized rank placements, see the instructions contained in the MPICH_RANK_ORDER_FILE and the intro_mpi(3) man page.

If you decide to continue using the full CrayPat tool set, unload the perftools-lite instrumentation module and load the perftools instrumentation module.

$ module unload perftools-lite
$ module load perftools

To determine whether a binary has already been instrumented using CrayPat or CrayPat-lite, use the strings utility to search for CrayPat/X. For example, if the binary is named a.out, use the following command line. If the binary is instrumented, it will return the CrayPat version number and other information. I.e.,

$ strings a.out | grep CrayPat/X
CrayPat/X: Version 7.1.2 Revision 8b25b9a92 11/01/19 19:44:58

WARNINGS

Those executables created as part of a make invocation and subsequently used as managing make-related tasks are by default instrumented if a perftools-lite module is loaded. To control which executables are or are not instrumented, users may set the environment variables CRAYPAT_LITE_WHITELIST and/or CRAYPAT_LITE_BLACKLIST to the corresponding executable file name(s), respectively. See the pat_build(1) man page for more details.

Perftools-Lite requires compute nodes to have shared root access. If your program run ends with a cannot access …. No such file or directory error message, this indicates that DSL shared root is not initialized. In this case, setting CRAY_ROOTFS to DSL will resolve the problem and permit Perftools-Lite to work.

By default, data files are written to the execution directory. This directory must reside on a file system that supports record locking, such as the Lustre file system or a similar high-performance file system. If necessary, set the environment variable PAT_RT_EXPDIR_BASE to point to an existing directory that resides on a high-performance file system.

The behavior of CrayPat when writing data files is described in more detail in the environment variable PAT_RT_EXPFILE_MAX description. Both the environment variables PAT_RT_EXPDIR_BASE and PAT_RT_EXPFILE_MAX are described in the intro_craypat(1) man page.

perftools-lite-loops disables all OpenMP optimizations, including API calls such as omp_get_wtime(). In order to compile codes containing such OpenMP API calls, conditional compilation should be used. For implementations supporting a preprocessor, this can be done using the _OPENMP macro. For example:

#if defined(_OPENMP)
  time = omp_get_wtime();
#endif

In order to conditionally compile Fortran code, conditional compilation sentinels recognized by the OpenMP standard should be used.

perftools-lite-loops does not support full-trace mode (PAT_RT_SUMMARY set to 0) in that it does not record temporal information about loop execution.

perftools-lite-gpu includes the -g cuda tracegroup. This may obscure OpenMP regions in the reported profile if these regions are called within CUDA functions. This experiment is therefore not recommended for profiling OpenMP regions, and users seeking this information are instead directed to the perftools-lite-events module.

SEE ALSO

intro_craypat(1), pat_build(1), pat_help(1), pat_report(1), pat_run(1), grid_order(1), reveal(1)

intro_mpi(3)

perftools-base(4)

accpc(5), cray_pm(5), cray_rapl(5), hwpc(5), cray_cassini(5), uncore(5), papi_counters(5)