Creating a sampling experiment with perftools-lite

Author:

Hewlett Packard Enterprise Development LP.

Copyright:

Copyright 2024-2025 Hewlett Packard Enterprise Development LP.

Setting up RajaPerf (optional)

For reference, at the time of writing these steps set up raja-perf to build with CCE and cray-mpich:

> mkdir build_mpi_cce
> cd build_mpi_cce
> cmake -DENABLE_MPI=On -DMPI_CXX_HEADER_DIR=$MPICH_DIR/include \
-DMPI_CXX_COMPILER=$MPICH_DIR/bin/CC \
-DMPI_libmi.so_LIBRARY=$MPICH_DIR/lib/libmpi.so \
-DCMAKE_CXX_COMPILER=CC -DMPI_CXX_LIB_NAMES=libmi.so ..
   ... about 75 lines of cmake output ...

Building and running with perftools-lite

Load the perftools-lite module before building the application:

> module load perftools-lite

Build the application:

> make -j
[  1%] Building CXX object blt/thirdparty_builtin/googletest-master-2020-01-07/googletest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
[  1%] Building CXX object blt/tests/smoke/CMakeFiles/blt_mpi_smoke.dir/blt_mpi_smoke.cpp.o
[  2%] Building CXX object tpl/RAJA/tpl/camp/CMakeFiles/camp.dir/src/errors.cpp.o
      ...
[100%] Built target test-raja-perf-suite.exe

And run it, note that everything before `./bin/raja-perf.exe` will be specific to your environment:

> srun -p bardpeak -n 2 --exclusive ./bin/raja-perf.exe -pftol 0.05  -k Apps_HALOEXCHANGE_FUSED
srun: job 9135947 queued and waiting for resources
srun: job 9135947 has been allocated resources
CrayPat/X:  Version 24.11.0 Revision 31a512b4d sles15.5_x86_64
10/02/24 19:11:06
... output from the test program ...

#################################################################
#                                                               #
#            CrayPat-lite Performance Statistics                #
#                                                               #
#################################################################

CrayPat/X:  Version 24.11.0 Revision 31a512b4d sles15.5_x86_64  10/02/24 19:11:06
Experiment:                  lite  lite-samples
Number of PEs (MPI ranks):      2
Numbers of PEs per Node:        2
... the rest of the perftools default report ...

The default report

The printed default report contains a set of tables that give a good basic view of what is going on in you program. It’s contents are described on the gettting started page. The default report

Going further

The perftools-lite module is one of a family of “lite” options that provide sampling enabled for different contexts.

  • perftools-lite — Default profile

  • perftools-lite-events — Event profile

  • perftools-lite-gpu — GPU kernel and data movement along with event profile

  • perftools-lite-loops — Loop estimates (Cray CCE compiler only)

  • perftools-lite-hbm — High bandwidth memory data (for X86-64 systems only)

The workflow for these are the same as perftools-lite, they just generate more specialized reports. You can see the man page for more details: perftools-lite

perftools-lite saves the experiment results in the current directory. It’s filed with the name of the executable with decoration to distinguish different runs, in this example `raja-perf.exe+349876-26340827s`. You can `pat_report` or Apprentice to look at these results.

pat_report

Apprentice 3