Getting started with perftools
- Author:
Hewlett Packard Enterprise Development LP.
- Copyright:
Copyright 2024-2025 Hewlett Packard Enterprise Development LP.
Introduction
Perftools is a sophisticated suite of tools to investigate and uncover a broad range of bottlenecks in an HPC application: cpu computation, gpu computation, communication, I/O, and more. There are extensive options that let a user deeply understand what his happening inside their application.
But that isn’t this document.
This guide aims to get you up and running with the _minumum_ amount you need to know.
The basic idea
Perftools starts with an instrumented run of your application set up to record performance information in an experiment directory you can examine later.
There are two basic mechanisms for instrumenting your program you need to know:
sampling is the less invasive strategy; it stops your running program periodically and records what it’s doing. Typically, you start with sampling to determine the nature of your bottlenecks.
tracing modifies the program by inserting code to explicitly count how many times a particular block of code is executed. It can be used to get more detailed information not available in the sampling, but requires rebuilding your application.
Creating your first sampling experiment
For the example walkthroughs we will be using the RajaPerf suite available on github. It’s not required to used perftools, but it makes a good workbench for trying it out since it can be built for multiple configurations and different sized jobs.
Perftools is designed to work in multiple environment, so running the experiment will depend on your environment.
Using pat_run
The `pat_run`
utility instruments and runs the application in a
single command and is often the easiest way to get started.
Instructions are here:
Using the compiler wrappers from CCE
The Cray Programming Envionment is integrated so sampling can be
enabled simply by loading one of the `perftools-lite`
modules.
Instructions are here:
The default report
The default report has tables chosen to point you toward the most common potential bottlenecks; it might suggest a solution or where how to investigate further.
It can contain the following sections:
A summary of the run parameters: when, where and how the experiment was run.
The flat profile: the functions in your program sorted by how much time the application spend in them.
The hierarchical table: this shows a call stack oriented view of the time usage.
MPI utilization
Memory caching efficiency
Energy and power usage
File I/O time.
Lustre filesystem information
The exact set will depend on the data collected when running the experiment.
The report contains a brief description for each section with instructions to get more documentation. There’s a sample output here:
Viewing an experiment
The raw data for an experiment is always saved into an experiment
directory whose name contains the name of your executable plus an
index to distinguish different runs. For this particular run, that
was `raja-perf.exe+592077-23709654s`
. For this tiny example it
was 11 Mb, but for a long running job collecting a lot of data it could
be in the terabytes.
Apprentice 3
Perftools supplies the graphical tool Apprentice 3 to easily navigate
the available text reports as well as multiple graphical
visualizations. There are desktop clients for windows and mac that
can access your experiment via ssh (no need to copy that 4 Tb
experiment), or use a remote desktop or X redirection to run from the
login node: run as `app3`
See the guide here:
pat_report
`pat_report`
can generate a wide variety of text reports from a
saved experiment directory. Note though, that the set of reports
that are available will depend on what data was gathered for the
experiment.
Since pat_report needs to be powerful enough to handle the gamut of
HPC applications and their users, `man pat_report`
can be a bit
daunting:
A good place to start is with the ‘-O’ option which output a large set
of predefined reports, for example `pat_report -O callers
<experiment-directory>`
.
You should also be aware of the ‘-T’ option to disable
thresholds. `pat_report`
normally filters out tiny entries; these
clutter up the results of real run, but might remove everything if
you’re trying out a “hello world” test app.