Cray Performance and Analysis Tools
- Author:
Hewlett Packard Enterprise Development LP.
- Copyright:
Copyright 2023-2025 Hewlett Packard Enterprise Development LP.
Overview
The Performance Analysis Tools (Perftools) are a suite of utilities that enable users to capture and analyze performance data generated during program execution, thereby reducing the time to port and tune applications. These tools provide an integrated infrastructure for measurement, analysis, and visualization of computation, communication, I/O, and memory utilization to help users optimize programs for faster execution and more efficient computing resource usage. The data collected and analyzed by these tools help users answer two fundamental developer questions: What is the performance of my program? and How can I make it perform better?
The toolset allows developers to perform profiling, sampling, and tracing experiments on executables, extracting information at the program, function, loop, and line level. Programs written in Fortran, C/C++ (including UPC), Python, MPI, SHMEM, OpenMP, CUDA, HIP, OpenACC, or a combination of these languages and models, are supported. Profiling applications built with the HPE Cray Compiling Environment (CCE), AMD, AOCC, GNU, Intel, Intel OneAPI, or Nvidia HPC SDK compilers are supported. However, not all combinations of programming models are supported, and not all compilers are supported on all platforms.
Use performance tools to:
Identify bottlenecks
Find load-balance and synchronization issues
Find communication overhead issues
Identify loops for parallelization
Map memory bandwidth utilization
Optimize vectorization
Collect application energy consumption information
Collect scaling information
Interpret performance data
Introduction
Performance analysis consists of three basic steps:
Instrument the program to specify what kind of data to collect under what conditions.
Execute the instrumented executable to generate and capture designated data.
Analyze the data.
Tip
If you haven’t used perftools before, there’s a walkthrough of the process here: Getting started with perftools
There are three programming interfaces available:
Perftools-lite
Perftools-lite: Simple interface that produces reports to stdout. There are five Perftools-lite submodules:
perftools-lite - Lowest overhead sampling experiment identifies key program bottlenecks.
perftools-lite-events - Produces a summarized trace; a good tool for detailed MPI statistics, including synchronization overhead.
perftools-lite-loops - Provides loop work estimates (must be used with CCE).
perftools-lite-gpu - Focuses on the program’s use of GPU accelerators.
perftools-lite-hbm - Reports memory traffic information (must be used with CCE and only for Intel processors).
See the perftools-lite man page for details.
Perftools
Perftools: Advanced interface that provides full-featured data collection and analysis capability, including full traces with timeline displays. It includes the following components:
pat_build - Utility that instruments programs for performance data collection.
pat_report - After using pat_build to instrument the program, setting the runtime environment variables and executing the program, use pat_report to generate text reports from the resulting data and export the data to other applications. See the pat_report man page for details.
CrayPat runtime library - Collects specified performance data during program execution. See the intro_craypat man page for details.
Perftools-preload
Perftools-preload: Runtime instrumentation version of the performance analysis tools, which eliminates the instrumentation step by pat_build on an executable program. perftools-preload acquires performance data about the program, providing access to nearly all performance analysis features provided by executing a program instrumented with pat_build. See the perftools-preload man page for more details.
pat_run - An option for programs built with or without perftools-preload. The program is instrumented during runtime, and collected data can be explored further with pat_report and Apprentice3 tools. See the pat_run man page for details.
Experiments available include:
Sampling experiment - A lightweight experiment that interrupts the program at specific intervals to gather data.
Profiling experiment - A tracing experiment that summarizes collected data.
Tracing experiment - A full-trace experiment that provides detailed information.
Also included:
PAPI - The PAPI library, from the Innovative Computing Laboratory at the University of Tennessee in Knoxville, is distributed with HPE Performance Analysis Tools. PAPI allows applications or custom tools to interface with hardware performance counters made available by the processor, network, or accelerator vendor. Perftools components use PAPI internally for CPU, GPU, network, power, and energy performance counter collection for derived metrics, observations, and performance reporting. A simplified user interface, which does not require the source code modification of using PAPI directly, is provided for accessing counters.
Apprentice3 - An interactive X Window System tool for visualizing and manipulating performance analysis data captured during program execution. Mac and Windows clients are also available.
pat_view - Aggregates and presents multiple sampling experiments for program scaling analysis. See the pat_view man page for more information.
pat_info - Generates a quick summary statement of the contents of a CrayPat experiment data directory.
pat_opts - Displays compile and link options used to prepare files for performance instrumentation.
Overview of Apprentice3
Apprentice3 is the next generation GUI tools, showcasing new or updated features:
interactive performance reports
flame graph visualization
improved time line view
New features and improvements will be rolled out in Apprentice 3 in subsequent releases.
The user guide for Apprentice3 is here:
Man Pages
These man pages introduce and explain various components of the Performance Analysis Tools (Perftools):