Getting started with perftools

Author:: Hewlett Packard Enterprise Development LP.
Copyright:: Copyright 2024-2025 Hewlett Packard Enterprise Development LP.

Introduction

Perftools is a sophisticated suite of tools to investigate and uncover a broad range of bottlenecks in an HPC application: cpu computation, gpu computation, communication, I/O, and more. There are extensive options that let a user deeply understand what his happening inside their application.

But that isn’t this document.

This guide aims to get you up and running with the _minumum_ amount you need to know.

The basic idea

Perftools starts with an instrumented run of your application set up to record performance information in an experiment directory you can examine later.

There are two basic mechanisms for instrumenting your program you need to know:

sampling is the less invasive strategy; it stops your running program periodically and records what it’s doing. Typically, you start with sampling to determine the nature of your bottlenecks.
tracing modifies the program by inserting code to explicitly count how many times a particular block of code is executed. It can be used to get more detailed information not available in the sampling, but requires rebuilding your application.

Creating your first sampling experiment

For the example walkthroughs we will be using the RajaPerf suite available on github. It’s not required to used perftools, but it makes a good workbench for trying it out since it can be built for multiple configurations and different sized jobs.

Perftools is designed to work in multiple environment, so running the experiment will depend on your environment.

Using pat_run

The `pat_run` utility instruments and runs the application in a single command and is often the easiest way to get started.

Instructions are here:

Running a sampling experiment with pat_run

Using the compiler wrappers from CCE

The Cray Programming Envionment is integrated so sampling can be enabled simply by loading one of the `perftools-lite` modules.

Instructions are here:

Creating a sampling experiment with perftools-lite

The default report

The default report has tables chosen to point you toward the most common potential bottlenecks; it might suggest a solution or where how to investigate further.

It can contain the following sections:

A summary of the run parameters: when, where and how the experiment was run.
The flat profile: the functions in your program sorted by how much time the application spend in them.
The hierarchical table: this shows a call stack oriented view of the time usage.
MPI utilization
Memory caching efficiency
Energy and power usage
File I/O time.
Lustre filesystem information

The exact set will depend on the data collected when running the experiment.

The report contains a brief description for each section with instructions to get more documentation. There’s a sample output here:

Sample default report

Viewing an experiment

The raw data for an experiment is always saved into an experiment directory whose name contains the name of your executable plus an index to distinguish different runs. For this particular run, that was `raja-perf.exe+592077-23709654s`. For this tiny example it was 11 Mb, but for a long running job collecting a lot of data it could be in the terabytes.

Apprentice 3

Perftools supplies the graphical tool Apprentice 3 to easily navigate the available text reports as well as multiple graphical visualizations. There are desktop clients for windows and mac that can access your experiment via ssh (no need to copy that 4 Tb experiment), or use a remote desktop or X redirection to run from the login node: run as `app3`

See the guide here:

Apprentice 3

pat_report

`pat_report` can generate a wide variety of text reports from a saved experiment directory. Note though, that the set of reports that are available will depend on what data was gathered for the experiment.

Since pat_report needs to be powerful enough to handle the gamut of HPC applications and their users, `man pat_report` can be a bit daunting:

pat_report

A good place to start is with the ‘-O’ option which output a large set of predefined reports, for example `pat_report -O callers <experiment-directory>`.

You should also be aware of the ‘-T’ option to disable thresholds. `pat_report` normally filters out tiny entries; these clutter up the results of real run, but might remove everything if you’re trying out a “hello world” test app.