intro_craypat

Introduction to the Cray Performance Analysis Tool

Author:

Hewlett Packard Enterprise Development LP.

Copyright:

Copyright 2019-2024 Hewlett Packard Enterprise Development LP.

Manual section:

1

DESCRIPTION

CrayPat is a performance analysis tool used to evaluate program behavior on Cray supercomputer systems. Its major components are described in the following man pages, which are accessiable when the perftools-base module is loaded.

perftools(1)

The Craypat runtime environment is a collection of libraries that collect and record performance data during program execution.

pat_run(1)

A utility used to link libraries from the runtime environment with an existing program at runtime (via LD_PRELOAD).

perftools-lite(4)

A collection of convenience modules. If one is loaded when a program is built, it automatically provides for the collection of a specific kind of performance data.

pat_build(1)

The utility used by the perftools-lite modules. When the perftools module is loaded, it may be used directly to customize the performance data to be collected.

pat_report(1)

The first-level data analysis tool, used to produce text reports and export data for additional analysis.

app2(1)

Apprentice2, a second-level data analysis tool, used to visualize, manipulate, explore, and compare sets of program performance data in a GUI environment

app3(1)

Apprentice3, another second-level data analysis tool, which will eventually replace Apprentice2.

papi(3)

Cray adds some components to PAPI for those who prefer to access performance counters using that API.

All Cray-developed performance analysis tools, including the man pages and help system, are available only when the perftools-base module is loaded, with the exception of the PAPI components, which can also be accessed when the papi module is loaded. The perftools-base and papi modules are mutually exclusive. One or the other may be loaded, but not both at the same time.

Note on terminology: Performance data can be collected either by tracing or sampling. In tracing, data is collected on entry and exit from instrumented functions or regions. In sampling, data is collected periodically based on expiration of a timer or overflow of a hardware counter. Both tracing and sampling support the following two methods of recording the data.

The default method of recording data is to summarize it at runtime. This method supports showing profiles of functions and regions with time spent or sample counts and any other collected data. The sizes of the data files produced are proportional to the number of unique program locations (including callstacks) for which data is collected, and so remain relatively small even for programs with long execution times. The data is typically written just prior to the end of execution.

The other data recording method is a full trace of collected data values with timestamps. This method supports showing data values on a time line, in addition to profiles. The sizes of the data files produced are proportional to the total execution time, and so can become very large. Data is typically written periodically during execution, but cannot be viewed until execution completes.

Note that for either sampling or tracing, full trace data files can be requested by setting the environment variable PAT_RT_SUMMARY to 0, or by using the -t option of pat_run.

Online Help

When the perftools-base module is loaded, the following tools and man pages are available, in addition to those listed above.

intro_craypat(1)

Same as the perftools(1) man page.

pat_help(1)

Extensive help and usage examples, including the CrayPat FAQ.

pat_view(1)

A viewer to compare CrayPat data from multiple executions. It can be used to visualize scaling issues.

pat_info(1)

The experiment data directory and data file query tool.

app3(1)

Cray Apprentice3: An update of Apprentice 2 with new graphics features.

grid_order(1)

An MPI rank order list generator.

intro_papi(3)

Same as the papi(1) man page.

perftools-base(4)

Introduction to the CrayPat fundamental module.

accpc(5)

Basic information about accelerator performance counters.

cray_cassini(5)

Basic information about Cassini NIC performance counters.

cray_pm(5)

Running Average Power Limit (RAPL) and Cray Power Management (PM) counters (HPE Cray EX series systems only).

hwpc(5)

Basic information about hardware performance counters.

papi_counters(5)

Introduction to the CrayPat PAPI event counters.

uncore(5)

Uncore (off-core) performance counters.

For complete lists of the hardware counter events currently supported organized by processor family, execute the pat_help utility and select the counters topic.

BASIC USAGE

CrayPat is a highly flexible tool with many options. Basic usage, however, is straightforward.

Start with a debugged and executable program. CrayPat is a performance analysis tool, not a debugging tool. The program must be capable of running to planned completion or intentional termination before CrayPat can be useful.

Ensure the perftools-base module is loaded. This provides access to man pages, to utilities such as pat_run, pat_report, pat_build, pat_help, grid_order, and Cray Apprentice2, and to additional instrumentation modules. This module does not affect program behavior and can be left loaded even when not collecting performance data.

$ module load perftools-base

Execute the program with the pat_run utility. For example, to execute a program using pat_run and the SLURM Workload Manager:

$ srun [srun-options] pat_run myprogram

This enables the recording of performance data collected by sampling, supplemented by tracing functions from selected libraries, without any special preparation of the program.

Report the results. If the pat_run -r option is used to execute the program, a report will be generated at the end of the execution. Otherwise, soon after execution completes use the pat_report(1) utility to generate a text report that displays and interprets the data that was collected.

$ pat_report experiment_data_directory

Afterwards, use the optional Cray Apprentice2 utility (app2(1)) to view the experiment_data_directory and manipulate the report data using GUI tools.

ADVANCED USAGE

As in basic usage, ensure that the perftools-base module is loaded.

Load a programming environment module. This ensures that the correct links and libraries are in place for your choice of compiler.

$ module load PrgEnv-compiler

Load an accelerator module. (Optional) If your system is equipped with GPU accelerators, this ensures that the correct links and libraries are in place for your choice of accelerator.

$ module load craype-accel-GPU

Select an instrumentation module. After the perftools-base module is loaded, use the module avail perftools command to see the list of instrumentation modules that are available.

$ module avail perftools

For a list of instrumentation modules, see the perftools-base(4) man page.

Load an instrumentation module. Loading an instrumentation module ensures that the correct links and libraries are in place for instrumenting your program. For example:

$ module load perftools

If necessary, recompile and relink your program. CrayPat requires access to the object (and archive, if any) files created during compilation. If you did not use your compiler’s option (typically, -c) to preserve all .o files (and .a files, if any) created during compilation, or if you are using the CrayPat API to insert specific instructions into your source code, you must recompile or remake your program after loading an instrumentation module. At a minimum, you should relink your program to ensure that it has access to the CrayPat libraries.

Instrument your program. Use the pat_build utility to insert analytical instructions at key points in the program. If using any of the lite modules, the instrumented copy is saved under the original program name. Otherwise, the instrumented copy is saved under a new name with the extension +pat and the original program remains unchanged.

For example, to instrument a program so that it traces MPI function calls and functions defined in the program source files, enter this command.

$ pat_build -g mpi -u myprogram

This produces the instrumented executable, myprogram+pat.

Note: Special characters in an executable file name, such as the colon (:) or at-sign (@) characters, may prevent an experiment data directory with the default name from being processed correctly. This can be remedied either in the pat_build invocation, by specifying a name for the instrumented program that avoids those characters, or at run-time, by using the PAT_RT_EXPDIR_NAME environment variable to specify a name for the directory that avoids those characters.

The pat_build utility is a powerful tool that supports a wide variety of options and arguments. For more information about using pat_build, see the pat_build(1) man page.

Set runtime environment variables. If desired, set environment variables to control the behavior of the instrumented executable during execution. For example, to collect full trace data rather than summarized data, enter this sh(1) command:

$ export PAT_RT_SUMMARY=0

The CrayPat runtime environment variables are described in greater detail elsewhere in this man page.

Execute the instrumented program. During execution, the specified performance analysis data is collected in real-time and written to one or more data files.

By default, data files are written to an experiment_data_directory under the execution directory. This directory must reside on a file system that supports record locking, such as the Lustre file system or a similar high-performance file system. If necessary, set the environment variable PAT_RT_EXPDIR_BASE to point to an existing directory that resides on a high-performance file system.

The behavior of CrayPat when writing data files is described in more detail in the environment variable PAT_RT_EXPFILE_MAX description.

To execute an instrumented program using the SLURM Workload Manager:

$ srun [srun-options] myprogram+pat

Report the results. The pat_report utility can produce a text report from the data files under the xf-files subdirectory of the experiment_data_directory, provided the instrumented executable (myprogram+pat) is available to provide the mapping from addresses to function names and source line numbers.

This mapping is done only once and preserved in the files under the ap2-files subdirectory. Afterwards, access to the instrumented executable is no longer required. For more information, see the FILES section of this man page and the pat_report(1) man page.

The default report produced by pat_report is tailored to the data that was collected, and may provide all the information needed for performance analysis and tuning, but many options are available to fine tune the report, expose more details, or provide alternative views. For example, the -v option expands the Notes preceding each table in a report to show how the table was specified and how the data was aggregated and thresholded. For more information see the pat_report(1) man page.

Generating Reports at Runtime

Optionally, CrayPat can be configured to generate reports automatically at the end of program execution. There are several ways to do this:

  • Use the -r option of pat_run.

  • Use Perftools-lite. Perftools-lite generates reports automatically by default. For more information, see the perftools-lite(4) man page.

  • Use a pat_build option: pat_build -D report=y a.out. For more information, see the pat_build(1) man page.

  • Set the environment variable PAT_RT_REPORT_CMD report-command argument to $CRAYPAT_ROOT/bin/pat_report and set the report-options argument to the desired pat_report options, before executing the instrumented executable. For more information, see the description of the environment variable PAT_RT_REPORT_CMD elsewhere in this man page and see the pat_report(1) man page.

RUNTIME ENVIRONMENT VARIABLES

The runtime environment variables communicate directly with an executing instrumented executable and affect how the data is collected and saved. Detailed descriptions of each environment variable are provided following the summary.

Runtime Environment Variables Summary

In the case of all toggles, a nonzero value (such as 1) is on (enabled) and 0 is off (disabled).

PAT_RT_ACC_ACC_TO_ACC_BINS

A comma-separated list of data transfer sizes used as the bin boundaries of the accelerator-to-accelerator data transfer histogram.

Default: 16,256,4kb,64kb,1mb,16mb

PAT_RT_ACC_ACTIVITY_BUFFER_SIZE

Specify the size in bytes of the buffer used to collect records for the accelerator time line view.

Default: 1MB

PAT_RT_ACC_ACTIVITY_QUEUE_BUFFER_SIZE

Specify the size in bytes of the buffer used to process accelerator activity records.

Default: 8MB

PAT_RT_ACC_CID_TO_EID_BUFFER_SIZE

Specify the size in bytes of the buffer used to collect ROC-tracer callback records.

Default: 1MB

PAT_RT_ACC_FORCE_SYNC

Toggle: force accelerator synchronization in order to enable collection of accelerator time for asynchronous events.

Default: 1

PAT_RT_ACC_HOST_FROM_ACC_BINS

A comma-separated list of data transfer sizes used as the bin boundaries of the host-from-accelerator data transfer histogram.

Default: 16,256,4kb,64kb,1mb,16mb

PAT_RT_ACC_HOST_TO_ACC_BINS

A comma-separated list of data transfer sizes used as the bin boundaries of the host-to-accelerator data transfer histogram.

Default: 16,256,4kb,64kb,1mb,16mb

PAT_RT_ACC_HOST_TO_HOST_BINS

A comma-separated list of data transfer sizes used as the bin boundaries of the host-to-host data transfer histogram.

Default: 16,256,4kb,64kb,1mb,16mb

PAT_RT_ACC_IGNORE_VERSION

Specify if the major version of CUDA or HIP used to build the CrayPat runtime library and the major version of CUDA or HIP library used by the instrumented program at runtime may differ.

Default: unset

PAT_RT_ACC_MODEL_NAME

Overrides the accelerator model name used at runtime to support performance counter groups. This environment variable can be useful when working with accelerator drivers that fail to fully populate system files.

Default: unset

PAT_RT_ACC_RECORD

Overrides the programming model for which performance data is collected.

Default: If CRAYPAT_OMP_TOOL is enabled, ompt. Otherwise, unset

PAT_RT_BUILD_ENV_IGNORE

Specify embedded runtime environment variables to be ignored.

Default: 1

PAT_RT_CALLSTACK

Specify the depth to which to trace call stacks.

Default: 500

PAT_RT_CALLSTACK_BUFFER_SIZE

Specify the size in bytes of the runtime summary buffer used to collect function call stacks.

Default: 4MB

PAT_RT_CALLSTACK_MODE

Specify how function call stacks are determined at runtime, using either libunwind, stack frames, traced functions, or hybrid.

Default: unwind

PAT_RT_COMMENT

Specify string to insert into experiment data files.

Default: unset

PAT_RT_CONFIG_FILE

Specify configuration file(s) containing runtime environment variables.

Default: unset

PAT_RT_EXIT_AFTER_INIT

Toggle: terminate execution after initialization of the CrayPat runtime library.

Default: 0

PAT_RT_EXPDIR_BASE

Identifies the path name of the directory in which to write the experiment data directory.

Default: the current directory

PAT_RT_EXPDIR_CLEANUP

Toggle: set to 0 to retain the experiment data directory upon a FATAL error.

Default: automatically delete on FATAL error

PAT_RT_EXPDIR_FSLOCK

Specify the type of file record-locking attribute.

Default: the attribute in /etc/mtab

PAT_RT_EXPDIR_NAME

Replace the name portion of the experiment data directory with this string.

Default: unset

PAT_RT_EXPDIR_REPLACE

Toggle: set to a nonzero value to enable overwriting the existing experiment data directory.

Default: 0

PAT_RT_EXPERIMENT

Specify the performance analysis experiment to perform.

Default: Automatic Profiling Analysis

PAT_RT_EXPFILE_FIFO

Toggle: create data file as named FIFO pipe instead of a regular file.

Default: 0

PAT_RT_EXPFILE_MAX

Specify the maximum number of data files created.

Default: depends on file system

PAT_RT_EXPFILE_PES

Specify the individual PEs from which to collect and record data.

Default: all PEs

PAT_RT_EXPFILE_THREADS

Specify the individual threads from which to collect data.

Default: all threads

PAT_RT_HEAP_BUFFER_SIZE

Specify the size in bytes of the buffer used to collect dynamic heap information.

Default: 2MB

PAT_RT_LIBLUSTRE_DSO

Specify the file name of the LUSTRE API dynamic shared object used to resolve runtime LUSTRE references in the CrayPat runtime library.

Default: /usr/lib64/liblustreapi.so

PAT_RT_LIBPAPI_DSO

Specify the file name of the PAPI dynamic shared object used to resolve runtime PAPI references in the CrayPat runtime library.

Default: libpapi.so.<papi-version>

PAT_RT_MPI_MSG_BINS

Specify the sizes of the histogram bins used to capture MPI messages.

Default: 16,256,4KB,64KB,1MB,16MB

PAT_RT_MPI_MSG_TRACKING

If set to 0, disable data collection for Cray Apprentice2 mosaic view.

Default: 1 (enabled)

PAT_RT_MPI_MSG_TRACKING_BUFFER_SIZE

Size of the buffer used to collect information for the mosaic view in Cray Apprentice2.

Default: 1MB

PAT_RT_MPI_P2P_MAP

If set to 0, disable the capture of the MPI point-to-point (P2P) map.

Default: 1 (enabled)

PAT_RT_MPI_SYNC

Toggle: measure MPI load imbalance by measuring the time spent in barrier and sync calls before entering the collective.

Default: 1 for tracing experiments, 0 for sampling experiments

PAT_RT_MPI_THREAD_REQUIRED

Specifies the MPI thread-level support.

Default: MPI_THREAD_SERIALIZED

PAT_RT_MSG_FILE

If set, specify the file to which to write messages generated by the runtime library.

Default: unset

PAT_RT_MSG_VERBOSE

If set, specify the PEs from which to accept and record info-level messages.

Default: unset

PAT_RT_OMPT_BUFFER_REQUEST_SIZE

Specify the buffer size used for CrayPat OMPT’s ompt_callback_buffer_request_t callback.

Default: 8MiB

PAT_RT_PARALLEL_MAX

Specifies the maximum number of unique call site entries to collect for OpenMP trace points.

Default: 1024

PAT_RT_PERFCTR

Specify the performance counter events to be counted.

Default: unset

PAT_RT_PERFCTR_DISABLE_COMPONENTS

Specifies the names of the PAPI components disabled.

Default: unset

PAT_RT_PERFCTR_FILE

Specify file(s) containing performance counter event specifications.

Default: unset

PAT_RT_PERFCTR_FILE_GROUP

Specify file(s) containing performance counter group definitions.

Default: unset

PAT_RT_PERFCTR_MPX

Controls multiplexing for CPU events.

Default: 0

PAT_RT_PERFCTR_REQUIRED

State of performance counters.

Default: 0

PAT_RT_PROG_MODELS

Specifies the programming models present in the executable file.

Default: determined by pat_build or pat_run during instrumentation

PAT_RT_PYTHON_DSO

For use with Python tracing and statically linked Python executables. Specifies the path to Python’s shared library interpreter (libpython.so)

Default: unset

PAT_RT_RECORD

Specifies the initial data collection and recording state.

Default: unset

PAT_RT_REGION_CALLSTACK

Specify the maximum stack depth for CrayPat API functions PAT_region_begin and PAT_region_end.

Default: 128

PAT_RT_REGION_MAX

Specify the largest numerical ID that may be used as an argument to CrayPat API functions PAT_region_begin and PAT_region_end.

Default: 100

PAT_RT_REPORT_CLEANUP

Specify if or how the data directory created during execution is removed after the runtime report is generated.

Default: skip

PAT_RT_REPORT_CMD

Specify the executable used to generate the on-completion report and the arguments to be passed to it.

Default: pat_report, none

PAT_RT_REPORT_METHOD

Specify the mechanism used to create the on-completion text report.

Default: team

PAT_RT_SAMPLING_DATA

Collect additional data with every sample, or with every n-th sample.

Default: unset

PAT_RT_SAMPLING_INTERVAL

Specify the sampling interval in microseconds.

Default: 10000

PAT_RT_SAMPLING_INTERVAL_TIMER

Specify the type of POSIX interval timer used for sampling-by-time experiments.

Default: 1

PAT_RT_SAMPLING_MASK

Specifies a bitmask that is AND’d with the PC address acquired during sampling.

Default: 0xffffffffff

PAT_RT_SAMPLING_MODE

Specify the mode (0, raw, or bubble) in which trace-enhanced sampling operates.

Default: 0

PAT_RT_SAMPLING_OVERFLOW

Override the overflow value in a predefined event group for sampling by overflow.

Default: unset

PAT_RT_SAMPLING_SIGNAL

Specify the signal issued when a POSIX interval timer expires or a hardware counter overflows.

Default: 27 (SIGPROF)

PAT_RT_SETUP_SIGNAL_HANDLERS

Toggle: ignore received signals in order to produce a more accurate traceback.

Default: 1

PAT_RT_STACK_SIZE

Specify runtime stack size in bytes.

Default: 64MB

PAT_RT_SUMMARY

Toggle: enable runtime summarization of data.

Default: 1

PAT_RT_THREAD_ALLOW

Specify how created threads are monitored and recorded.

Default: 1 (enabled)

PAT_RT_THREAD_CANCEL_NTRIES

Specify how long to wait for created threads to terminate.

Default: 120 (30 seconds)

PAT_RT_THREAD_MAX

Specify the maximum number of threads that can be created and recorded.

Default: 1,000,000

PAT_RT_TRACE_API

Toggle: enable recording of data generated by CrayPat API functions.

Default: 1

PAT_RT_TRACE_DEPTH

Specify the maximum depth of the runtime callstack.

Default: 512

PAT_RT_TRACE_HEAP

Toggle: collect dynamic heap information.

Default: 1

PAT_RT_TRACE_HOOKS

Toggle: enable/disable recording trace data for specified types of compiler-generated hooks.

Default: depends on PAT_RT_SUMMARY

PAT_RT_TRACE_NARGS

Specify the number of function argument values to record.

Default: 0

PAT_RT_TRACE_OVERHEAD

Specify the number of times calling overhead is sampled during program initialization and termination.

Default: 100

PAT_RT_TRACE_PYTHON_GROUPS

Specify a common-separated list of python trace groups to trace.

Default: unset

PAT_RT_TRACE_PYTHON_MODULES

A comma-separated list of Python modules to trace, where each element is of the form “package.module” (e.g. sound.effects.echo). All callables defined in the module are traced.

Default: unset

PAT_RT_WRITE_BUFFER_SIZE

Size of single thread data collection buffer in bytes.

Default: 8MB

Runtime Environment Variables Detail

PAT_RT_ACC_ACC_TO_ACC_BINS

A comma-separated list of data transfer sizes used as the bin boundaries of the accelerator-to-accelerator data transfer histogram, a histogram of data transfer frequencies by size.

Default: 16,256,4kb,64kb,1mb,16mb

PAT_RT_ACC_ACTIVITY_BUFFER_SIZE

Specifies the size in bytes of the buffer used to collect records for the accelerator time line view in Cray Apprentice2. Size is not case-sensitive and can be specified in kilobytes (KB), megabytes (MB), or gigabytes (GB).

Default: 1MB

PAT_RT_ACC_ACTIVITY_QUEUE_BUFFER_SIZE

Specify the size in bytes of the buffer used to process accelerator activity records. Size is not case-sensitive and can be specified in kilobytes (KB), megabytes (MB), or gigabytes (GB).

Default: 8MB

PAT_RT_ACC_CID_TO_EID_BUFFER_SIZE

Specify the size in bytes of the buffer used to collect ROC-tracer callback records. Size is not case-sensitive and can be specified in kilobytes (KB), megabytes (MB), or gigabytes (GB).

Default: 1MB

PAT_RT_ACC_FORCE_SYNC

Forces accelerator synchronization in order to enable collection of accelerator time for asynchronous events (see the WARNINGS section below if profiling CUDA codes).

Default: 1 (enabled)

PAT_RT_ACC_HOST_FROM_ACC_BINS

A comma-separated list of data transfer sizes used as the bin boundaries of the host-from-accelerator data transfer histogram, a histogram of data transfer frequencies by size.

Default: 16,256,4kb,64kb,1mb,16mb

PAT_RT_ACC_HOST_TO_ACC_BINS

A comma-separated list of data transfer sizes used as the bin boundaries of the host-to-accelerator data transfer histogram, a histogram of data transfer frequencies by size.

Default: 16,256,4kb,64kb,1mb,16mb

PAT_RT_ACC_HOST_TO_HOST_BINS

A comma-separated list of data transfer sizes used as the bin boundaries of the host-to-host data transfer histogram, a histogram of data transfer frequencies by size.

Default: 16,256,4kb,64kb,1mb,16mb

PAT_RT_ACC_IGNORE_VERSION

Specify if the major version of CUDA or HIP used to build the CrayPat runtime library and the major version of CUDA or HIP library used by the instrumented program at runtime may differ.

Default: unset

PAT_RT_ACC_RECORD

Overrides the programming model for which accelerator performance data is collected. The valid values are:

off

Disables collection of accelerator performance data.

cce

Collect performance data for applications compiled with the CCE and using OpenACC directives.

cuda

Collect performance data for CUDA applications.

hip

Collect performance data for HIP applications.

ompt

Collect performance data for OpenMP application target offload code. Requires CRAYPAT_OMP_TOOL be enabled to take effect.

Default: If CRAYPAT_OMP_TOOL is enabled, ompt. Otherwise, unset

PAT_RT_BUILD_ENV_IGNORE

Specifies a comma-separated list of the runtime environment variables values embedded in the instrumented executable to ignore. A substring that matches any part of an embedded environment variable name is allowed: for example, specifying PAT_RT_ will ignore all embedded runtime environment variables.

The default is to use the values of all environment variables that have been embedded in the instrumented executable using the pat_build rtenv directive. If the same environment variable is also set in the execution environment, the value specified at runtime takes precedence and overrides the value embedded in the instrumented executable.

For more information about the rtenv directive, see the pat_build man page.

Default: unset (use all embedded environment variables)

PAT_RT_CALLSTACK

Specifies the depth to which to trace the call stack for a given function when sampling or tracing. For example, if set to 1, only the caller of the function is recorded. If set to 0 no call stack is recorded.

Default: 500 or to the main function, whichever is less

PAT_RT_CALLSTACK_BUFFER_SIZE

Specifies the size in bytes per thread of the runtime summary buffer used to collect function call stacks. Size is not case-sensitive and can be specified in kilobytes (KB), megabytes (MB), or gigabytes (GB).

Default: 4MB

PAT_RT_CALLSTACK_MODE

Valid values are unwind, frames, trace, and hybrid, where unwind specifies using libunwind’s unw_backtrace2, frames specifies using runtime stack frames, and trace specifies using callers that were traced. hybrid uses frame pointer walking to unwind the call stack, supplementing with information from perftools’s trace stack when needed. Valid frame checks are attempted and on failure may short circuit call stack unwinding. In certain circumstances, libunwind may be used instead.

Default: unwind

PAT_RT_COMMENT

Specifies an arbitrary string that is inserted into the experiment data file. The string is included in the report analysis done by pat_report(1).

Default: unset

PAT_RT_CONFIG_FILE

Specifies one or more configuration files that contain environment variables. Multiple file names are separated with the comma (,) character. Lines in the file that begin with the # character are interpreted as comments and ignored. If the file name specified begins with a question mark (?) character, and the file does not exist or is otherwise inaccessible, no fatal error is generated.

Environment variables are of the form defined by sh(1): name=value

After all files specified by the environment variable PAT_RT_CONFIG_FILE are processed, if the file $HOME/.craypatrc exists, its contents are processed.

The environment variables appear in the file(s) one per line. Each subsequent environment variable name replaces the value of the previous one with the same name. Typically, the environment variable PAT_RT_CONFIG_FILE is used by site administrators to define default system-wide CrayPat runtime environment variables. Users should exercise caution when changing PAT_RT_CONFIG_FILE or adding additional configuration files to it.

Default: unset

PAT_RT_EXIT_AFTER_INIT

If nonzero, terminate execution after the initialization of the CrayPat runtime library is complete.

Default: 0

PAT_RT_EXPDIR_BASE

Identifies the path name of the directory in which to write the experiment data directory. For distributed memory applications, if PAT_RT_EXPFILE_MAX is not set to 0, the experiment data directory must reside on a file system that supports record locking, such as Lustre.

Default: the current working directory

PAT_RT_EXPDIR_CLEANUP

Remove the experiment data directory upon a FATAL error. Set this environment variable to 0 to retain the experiment data directory in the case of a FATAL error.

Default: automatically delete on FATAL error

PAT_RT_EXPDIR_FSLOCK

Specifies the type of file record-locking attribute to assign to the file system upon which the experiment data directory is created. This overrides any file record-locking attribute that may be assigned to the file system. The valid values are:

0 or none

No file record-locking is supported.

1 or all

File record-locking is supported across all compute nodes and within the node itself.

local

File record-locking is supported only within the node.

global

File record-locking is supported across all compute nodes.

Default: the attribute assigned to the file system as found in the /etc/mtab file at the time the instrumented executable is executed. For more information, see the mount(8) man page.

PAT_RT_EXPDIR_NAME

Specify a string to define the name of the experiment data directory. By default, the process ID, node ID, and experiment type qualifiers are added to the name of the instrumented executable to define the name of the experiment data directory. Note that a hash of the node name may be used as the node ID, if needed to uniquely identify the node within the system.

Default: unset

A variety of type specifications are also recognized:

%A

name of instrumented executable

%C

target CPU

%E

programming environment

%J

WLM job id

%N

node ID (or a hash of the node name) for node executing PE 0

%P

process ID executing on PE 0

%Q

equivalent to %P-%N%X

%R

number of MPI ranks

%S

date-timestamp

%T

number of OpenMP threads

%X

experiment type

The default using type specifications is %A+%P-%N%X.

PAT_RT_EXPDIR_REPLACE

If set to a nonzero value, enable overwriting of the experiment data directory, if it exists. All data in the original experiment data directory is lost.

Default: 0

PAT_RT_EXPERIMENT

Identifies the experiment to perform.

Default: if the pat_build -S option is not specified and none of the tracing options (-w, -u, -T, -t, or -g) are specified, pat_build will default to instrument the executable for Automatic Profiling Analysis, as if the -O apa option was specified. In this case, the PAT_RT_EXPERIMENT value does not override the default at runtime.

If the pat_build -S option is used to specify a sampling experiment, you can then use PAT_RT_EXPERIMENT to specify the sampling experiment to perform.

If one of the tracing options (-w, -u, -T, -t, or -g) is used with pat_build, you can use PAT_RT_EXPERIMENT to specify the tracing experiment to perform.

If a program is instrumented for tracing and you then use PAT_RT_EXPERIMENT to specify a sampling experiment, trace-enhanced sampling is performed, subject to the rules established by the environment variable PAT_RT_SAMPLING_MODE setting.

Depending on the options you select, it is possible to generate extremely large data files. For more information, see the section “Controlling the Size of Experiment Data Files,” below.

The valid experiments are:

samp_pc_time

Samples the program counter at a given time interval. This returns the total program time and the absolute and relative times each program counter was recorded. The default interval is 10,000 microseconds.

The default POSIX interval timer measures monotonic wall-clock time. This is changed using the runtime environment variable PAT_RT_SAMPLING_INTERVAL_TIMER.

samp_pc_ovfl

Samples the program counter at a given overflow of a hardware performance counter. The hardware counter and its overflow value are separated by the @ symbol and specified in a comma-separated list in the environment variable PAT_RT_PERFCTR, i.e., event-name@overflow-value.

samp_cs_time

Samples the call stack at a given time interval. This returns the total program time and the absolute and relative times each call stack counter was recorded, and is otherwise identical to the samp_pc_time experiment.

samp_cs_ovfl

Samples the call stack at a given overflow of a hardware performance counter. This experiment is otherwise identical to the samp_pc_ovfl experiment.

trace

When tracing experiments are done, selected functions are traced and produce a data record in the runtime experiment data file, if the function is executed.

The functions to be traced are defined by the pat_build -g, -u, -t, -T, or -w options specified when instrumenting the program. For more information about instrumenting programs for tracing experiments, see the pat_build(1) man page.

Note: Only true function calls can be traced. Function calls that are inlined by the compiler or that have local scope in a compilation unit cannot be traced.

Tracing experiments are also affected by the settings of other environment variables, all of which have names beginning with PAT_RT_TRACE_. These environment variables are described elsewhere in this man page.

PAT_RT_EXPFILE_FIFO

If nonzero, the experiment data file is created as named FIFO pipe instead of a regular file. The instrumented executable will block until the user executes another program that opens the pipe for reading. For more information, see the mkfifo(3) man page.

Default: 0

PAT_RT_EXPFILE_MAX

The maximum number of experiment data files created.

If no value is specified for this environment variable, the default value is the number of compute nodes, with all PEs on a given node writing to one file associated with that node.

If the file system does not support locking, the default value is the number of PEs (NPES).

If a value of M is specified for this environment variable:

  • If M >= NPES or M < NPES, then NPES data files are created.

  • If M == 0, then one data file per node is created.

Otherwise, M files will be created, provided that the file system supports the type of locking required for that number of files. (See PAT_RT_EXPDIR_FSLOCK.)

PAT_RT_EXPFILE_PES

Records data and writes the recorded data to its respective data file only for the specified PEs. If set to *, values from every PE are recorded.

Default: * (all PEs)

If not using the default, the PEs to be recorded are specified in a comma-separated list, with each specification represented as one of the following:

n

Value n.

m-n

Values m through n, inclusive.

*%p

Every pth value of all values.

m-n%p

Every pth value from m through n, inclusive.

For example, the following values are all valid specifications:

0,4,5,10

Record PEs 0, 4, 5, and 10

0-15%4

Record PEs 0, 4, 8, and 12

4-31%8

Record PEs 4, 12, 20, and 28

PAT_RT_EXPFILE_THREADS

Record data for the specified thread only. If set to *, values from every thread are recorded.

Default: * (all threads)

If not using the default, the threads to be recorded are specified in a comma-separated list, with each specification represented as one of the following:

n

Value n.

m-n

Values m through n, inclusive.

*%p

Every pth value of all values.

m-n%p

Every pth value from m through n, inclusive.

For example, the following values are all valid specifications.

0,2

Record threads 0 and 2.

0-7%2

Record threads 0, 2, 4, and 6.

PAT_RT_HEAP_BUFFER_SIZE

Specifies the size in bytes of the runtime summary buffer used to collect dynamic heap information. This environment variable affects tracing experiments only.

Default: 2MB

PAT_RT_LIBLUSTRE_DSO

Specify the file name of the LUSTRE API dynamic shared object used to resolve runtime LUSTRE references in the CrayPat runtime library. If the file name does not start with a slash ‘/’, the directories listed by the environment variable LD_LIBRARY_PATH are searched. If the file name cannot be accessed, execution continues but collection of any LUSTRE information is disabled.

Default: /usr/lib64/liblustreapi.so

PAT_RT_LIBPAPI_DSO

Specify the file name of the PAPI dynamic shared object used to resolve runtime PAPI references in the CrayPat runtime library. If the file name does not start with a slash ‘/’, the directories listed by the environment variable LD_LIBRARY_PATH are searched. If the file name cannot be accessed, execution is determined by the environment variable PAT_RT_PERFCTR_REQUIRED.

Default: libpapi.so.<papi-version>

PAT_RT_MPI_MSG_BINS

Specifies the size boundaries of the histogram bins used to capture MPI messages sent between ranks. The specification is a comma-separated list of values. The maximum number of values indicating each bin size is 30. Zero and infinity are implied.

Note: This environment variable affects data collection only when in runtime summary mode.

Default: 16,256,4KB,64KB,1MB,16MB

PAT_RT_MPI_MSG_TRACKING

If set to 0, data collection for the mosaic view in Cray Apprentice2 is disabled.

Default: 1 (enabled)

PAT_RT_MPI_MSG_TRACKING_BUFFER_SIZE

Specifies the size, in bytes, of the buffer used to collect MPI message passing information for the mosaic view in Cray Apprentice2.

Default: 1MB

PAT_RT_MPI_P2P_MAP

If set to 0, disable the capture of the MPI point-to-point (P2P) map. When enabled, each pair of source rank and callstack is mapped to its destination ranks. Requires PAT_RT_MPI_MSG_TRACKING be enabled.

Default: 1 (enabled)

PAT_RT_MPI_SYNC

Measure load imbalance in programs instrumented to trace MPI functions. If set to 1, this causes the trace wrapper for each collective subroutine to measure the time for a barrier call prior to entering the collective. This time is reported by pat_report(1) in the function group MPI_SYNC, which is separate from the MPI function group.

If PAT_RT_MPI_SYNC is set, the time spent waiting at a barrier and synchronizing processes is reported under MPI_SYNC, while the time spent executing after the barrier is reported under MPI.

To disable measuring MPI barrier and sync times, set this environment variable to 0. This environment variable affects tracing experiments only.

Default: 1 (enabled)

PAT_RT_MPI_THREAD_REQUIRED

Specifies the MPI thread-level support for the instrumented executable to use. If the instrumented executable calls MPI_Init_thread this environment variable should be set to the required MPI thread-level support supplied to the MPI_Init_thread function. Valid values are MPI_THREAD_SINGLE, MPI_THREAD_FUNNELED, MPI_THREAD_SERIALIZED, and MPI_THREAD_MULTIPLE. Optionally the leading MPI_THREAD_ may be omitted. For more information, see the MPI_Init_thread(3) man page.

Default: MPI_THREAD_SERIALIZED

PAT_RT_MSG_FILE

Writes messages generated by the runtime library to the specified file. By default, all runtime library messages are written to standard error. If the specified file cannot be created, the messages are written to standard error. If the first character of the specified file name is an exclamation point (!) and the file already exists, the contents of the file are truncated. The specified file should be created on a file system that supports global file locking to reduce the chance of messages from different ranks being interleaved.

Default: unset

PAT_RT_MSG_VERBOSE

If set, specify the PEs which issue info-level messages. If set to *, messages from every PE are issued.

Default: unset

Alternatively, the PEs that issue messages can be specified in a comma-delimited list, with each specification represented as one of the following:

n

Value n.

m-n

Values m through n, inclusive.

*%p

Every pth value of all values.

m-n%p

Every pth value from m through n, inclusive.

PAT_RT_OMPT_BUFFER_REQUEST_SIZE

Specify the buffer size used for CrayPat OMPT’s ompt_callback_buffer_request_t callback.

Default: 8MiB

PAT_RT_PARALLEL_MAX

Specifies the maximum number of unique call site entries to collect for any OpenMP trace points generated by the CCE compiler when the OpenMP programming model is used.

See the pat_build(1) man page for more information about compiler-generated trace points.

Default: 1024

PAT_RT_PERFCTR

Specifies the performance counter events to be monitored during the execution of a program instrumented for tracing experiments.

Counter events are specified in a comma-separated list. Event names and groups from any and all components may be mixed as needed; the tool is able to parse the list and determine which event names or group numbers apply to which components. To list the names of the individual events on your system, use the papi_avail(1) and papi_native_avail(1) utilities.

Depending on the counter selected, individual counter events can be specified in one of several ways:

  • use the performance counter event name, as given by papi_avail or papi_native_avail

  • use the performance counter event name followed by the @ symbol and a value, to indicate a non-default overflow value used by the sampling-by-overflow experiments

  • Additionally, if the event name is surrounded by parentheses and the event name is determined to be invalid, no WARNING message is issued and the name is ignored.

See the accpc(5), cray_cassini(5), cray_pm(5), cray_rapl(5), hwpc(5), nwpc(5), and uncore(5) man pages for general descriptions of the various performance counters. For complete lists of the hardware counter events currently supported organized by processor family, execute the pat_help utility and select the counters topic.

Note: To properly gain access to some of the performance counter domains, you must specify addition information to the workload manager. See the cray_cassini(5), cray_pm(1), cray_rapl(1), nwpc(5), and uncore(5) man pages for instructions on how to request access to these respective performance counter domains.

In addition, this environment variable supports the use of keywords. The keywords currently recognized are:

  • domain:u — specify that hardware counters are active in the user’s domain

  • domain:k — specify that hardware counters are active in the kernel (OS) domain

  • domain:x — specify that hardware counters are active in the exception domain

  • mpx, mpx:1, mpx:on — enable multiplexing for CPU events

  • mpx:0, mpx:off — disable multiplexing for CPU events

  • mpx:auto — enable multiplexing only if required to count all CPU events

  • mpx:trim — avoid multiplexing by eliminating excess CPU events

  • devx:1, devx:on - enable expanding Cassini events across NIC devices

  • devx:0, devx:off - disable expanding Cassini events across NIC devices

PAT_RT_PERFCTR cannot be set to the name of a derived metric, only to the names of counter events or groups of counter events. A derived metric appears in the report if all of the events required for that metric are collected.

Default: unset

PAT_RT_PERFCTR_DISABLE_COMPONENTS

Specifies, in a comma-separated list, the names of the PAPI components disabled. Use the papi_component_avail utility to list the names of the available components. When a PAPI component is disabled event names defined in the component are not recognized.

Default: unset

PAT_RT_PERFCTR_FILE

Specifies, in a comma-separated list, the names of one or more files that contain performance counter specifications. Within the files, lines beginning with the # character are interpreted as comments and ignored. See PAT_RT_PERFCTR for a description of an event specification.

Default: unset

PAT_RT_PERFCTR_FILE_GROUP

Specifies, in a comma-separated list, the names of one or more files that contain performance counter group definitions. A group definition consists of at least one valid performance counter event. Use the papi_avail and papi_native_avail utilities to determine the names of valid events.

The format of the file is: group-name=event1,…

The definition of the group is terminated with a <newline> character. There may be multiple unique group names defined in a single file. Lines that do not match this syntax are ignored.

If the first file name in the list is the character 0 (zero), the default counter groups are not loaded and therefore are not available for selection using PAT_RT_PERFCTR.

The file containing the group definitions for the default groups is in $CRAYPAT_ROOT/share/counters/.

Default: unset

PAT_RT_PERFCTR_MPX

Controls multiplexing for CPU events. By default multiplexing is disabled.

The valid options are:

0

Disable multiplexing

1

Enable multiplexing

auto

Enable multiplexing only if it is required to count every event

trim

Avoid multiplexing by removing excess events

Default: 0

PAT_RT_PERFCTR_REQUIRED

Specifies the state of performance counters at runtime. By default, if underlying conditions prevent access to performance counters, the instrumented executable still executes but no performance counter information is collected. If this environment variable is set to a value greater than zero, the instrumented executable terminates immediately. If it is set to a value less than zero, performance counter access is disabled and ignored.

Default: 0

PAT_RT_PROG_MODELS

Specifies the programming models present in the executable file. The pat_build and pat_run utilities use heuristics to determine the programming models, such as MPI and OpenMP, present in the file targeted for instrumentation. In some cases, such as when the external symbols identifying a programming model are loaded at runtime using dlopen and dlsym library calls, the programming models cannot be readily identified. This may result in invalid runtime behavior by the instrumented program. Use this environment variable to more accurately represent the programming models inherent in the executable file. One or more of the following values separated by commas may be specified: mpi, shmem, thread, omp, upc, caf, pgas, acc, cuda, and hip. If the list starts with a ,, |, or + the following values in the list indicate the programming models are relative to the programming models determined by pat_build and pat_run.

Default: determined by pat_build or pat_run during instrumentation

PAT_RT_PYTHON_DSO

For use with Python tracing and statically linked Python executables. Specifies the path to Python’s shared library interpreter (libpython3.X.so.Y.Z). When unset, pat_run sets this appropriately if used in conjunction with Python in an Anaconda environment. Set PAT_RT_PYTHON_DSO to 0 to disable this behavior. Statically linked Python executables do not contain a “libpython” filename when the following command is executed:

$ ldd $(which python3)

Default: unset

PAT_RT_RECORD

Specifies the initial data collection and recording state for the instrumented executable. If set to zero, no performance data is collected or recorded when the program starts execution. Use the PAT_record API call to turn on data collection and recording; see the pat_build(1) man page for more information.

Default: unset

PAT_RT_REGION_CALLSTACK

Specifies the depth of the stack for which the CrayPat API functions PAT_region_begin and PAT_region_end are maintained. In other words, it is the maximum number of consecutive PAT_region_begin references that can be made without an intervening PAT_region_end. Setting this environment variable to zero (0) disables data collection for all regions. This environment variable affects tracing experiments only.

Default: 128

PAT_RT_REGION_MAX

Specifies the largest numerical ID that may be used as an argument to the CrayPat API functions PAT_region_begin and PAT_region_end. Values greater than this cause the API function to be ignored. Setting this environment variable to zero (0) disables data collection for all regions. This environment variable affects tracing experiments only.

Default: 100

PAT_RT_REPORT_CLEANUP

If the report directive is set to y in pat_build when the program is instrumented, a textual report is written to stdout when the instrumented executable successfully completes execution. This environment variable specifies how the data directory created during execution is removed after the report is produced. The valid values are skip, 0, force, and 1, where skip or 0 do not remove the data directory and force or 1 remove the data directory if the report generation was successful.

Default: skip

PAT_RT_REPORT_CMD

This environment variable supports two or more comma-separated arguments, report-command and report-options, which can be used to specify the pathname of the executable file that produces the text report and then a comma-separated list of one or more report options to be passed to pat_report.

If only report-command is set, a default text report is produced when the program terminates successfully. If report-options are also included, you can control the content and format of the resulting report. The valid report-options options are listed in the pat_report(1) man page.

Defaults:

  • report-command — $CRAYPAT_ROOT/bin/pat_report

  • report-options — none

PAT_RT_REPORT_METHOD

If the report directive is set to y in pat_build when the program is instrumented, a textual report is written to stdout when the instrumented executable successfully completes execution. This environment variable defines the mechanism used to create the text report. Valid values are 0, skip, ignore, pe, and team. The pe argument uses one PE to control all aspects of report generation. The team argument uses all PEs to share control of all aspects of report generation. Use ignore for perftools-lite experiments, if report generation causes an unacceptable increase in application runtime and you would prefer to use pat_report on the data directory on the login node to generate the report after the job execution is completed.

To disable report generation, set this environment variable to 0.

Default: team

PAT_RT_SAMPLING_DATA

The valid options are:

cray_pm

Cray Power Management counters

cray_rapl

RAPL energy counters

hbm

Program locations responsible for high memory bandwidth.

heap

heap (see mallinfo(3))

memory

current memory state

perfctr

selected performance counters as specified by PAT_RT_PERFCTR and PAT_RT_PERFCTR_FILE

rusage

resource usage (see getrusage(2))

sheap

shared heap for programs that use DMAPP

By default, if this environment variable is set, additional perfctr data is collected once for every sampled program counter address, and any other type of data once for every 100 samples. Alternatively, an option may be followed by @ratio to indicate the frequency at which the data is to be collected. For example, if ratio is 1, the additional data requested is collected each time the program counter is sampled. If ratio is 1000, the additional data requested is collected once every 1000 program counter samples.

Default: not set

PAT_RT_SAMPLING_INTERVAL

Specifies the interval, in microseconds, at which the instrumented executable is sampled.

To specify a random interval, use the following format:

lower-bound,upper-bound[,seed]

After a sample is captured, the interval used for the next sampling interval is generated using rand(3) and will be between lower-bound and upper-bound. The initial seed (seed) for the sequence of random numbers is optional. See srand(3) for more information.

This environment variable affects sampling experiments. It can also be used to control trace-enhanced sampling experiments, provided the program is instrumented for tracing but the environment variable PAT_RT_EXPERIMENT is used to specify a sampling-type experiment, and subject to the environment variable PAT_RT_SAMPLING_MODE setting.

Default: 10000 (microseconds)

PAT_RT_SAMPLING_INTERVAL_TIMER

Specifies the type of POSIX interval timer used for sampling-by-time experiments. The following values are valid:

0

wall-clock (real) time

1

wall-clock (real) time guaranteed to be monotonic

This environment variable affects sampling experiments. It can also be used to control trace-enhanced sampling experiments, provided the program is instrumented for tracing but the environment variable PAT_RT_EXPERIMENT is used to specify a sampling-type experiment, and subject to the environment variable PAT_RT_SAMPLING_MODE setting. See the timer_create(2) man page for more information.

Default: 1

PAT_RT_SAMPLING_MASK

Specifies a bitmask that is AND’d with the PC address acquired during a sampling experiment. This can reduce the number of unique addresses collected. The default value is 0xffffffffff and is specified in hexadecimal notation.

PAT_RT_SAMPLING_MODE

Specifies the mode in which trace-enhanced sampling operates. Trace-enhanced sampling allows a sampling experiment to be executed on a program instrumented for tracing. It affects both user-defined functions and predefined function groups. The value for mode may be one of the following.

0, ignore

Ignore trace-enhanced sampling. The normal tracing experiment is performed.

1, raw

Enable raw sampling. Any traced functions present in the instrumented executable are ignored.

3, bubble

Enable bubble sampling. Traced functions and any functions they call return a sample PC address mapped to the traced function.

When set to a non-zero value, all sampling experiments and parameters that control sampling apply to the executing instrumented executable. Tracing records are not produced.

The value that indicates the mode may be followed by a comma-separated value that indicates the depth of the call stack trace performed during bubble sampling. It is ignored for other modes of sampling. The default depth is 2.

PAT_RT_SAMPLING_OVERFLOW

Specifies the overflow value for a sampling-by-overflow experiment. When the environment variable PAT_RT_PERFCTR specifies a predefined event group (such as hbm) that includes a default overflow value, PAT_RT_SAMPLING_OVERFLOW can be used to override the default. Similarly, it can be used to override an overflow value specified in a file named by PAT_RT_PERFCTR_FILE

PAT_RT_SAMPLING_SIGNAL

Specifies the signal that is issued when a POSIX interval timer expires or a hardware performance counter overflows.

This environment variable affects sampling experiments. It can also be used to control trace-enhanced sampling experiments, provided the program is instrumented for tracing but the environment variable PAT_RT_EXPERIMENT is used to specify a sampling-type experiment, and subject to the environment variable PAT_RT_SAMPLING_MODE setting.

This environment variable accepts the names of signals as given in the signal(7) man page; for example, SIGALRM, SIGPROF, etc. The signal as specified as a cardinal number is also accepted. Note that a given signal may be used by other components or features of the instrumented executable, and some signals may interfere with CrayPat initialization or runtime data collection.

Default: 27 (SIGPROF)

PAT_RT_SETUP_SIGNAL_HANDLERS

If zero, the CrayPat runtime library does not catch signals that the program receives; this results in an incomplete experiment file but a more accurate traceback for an aborted program with a core dump.

Default: 1

PAT_RT_STACK_SIZE

Specifies the size in bytes of the MAIN thread’s runtime stack. This size is used to determine the validity of a frame pointer while unwinding the call stack. This value may be increased to accommodate large data objects defined within a function. This value may be decreased if a segmentation fault occurs as a result of CrayPat following invalid frame pointer information while unwinding the call stack.

Default: 64MB

PAT_RT_SUMMARY

If set to a nonzero value, runtime summarization is enabled and the data collected is aggregated. This greatly reduces the size of the resulting experiment data files but at the cost of fine-grain detail. Formal parameter values and function return values are not recorded.

If set to 0, runtime summarization is disabled and the experiment data files contain a full trace of the performance data with timestamps.

Disabling runtime summarization can be valuable, particularly if you plan to use Cray Apprentice2 to study your data. However, be advised that setting this environment variable to 0 can produce enormous experiment data files, unless you also use the CrayPat API to limit data collection to a specified region of your program.

For more information, see the section “Controlling the Size of Experiment Data Files,” below.

Default: 1 (enabled)

PAT_RT_THREAD_ALLOW

Specifies how created threads are monitored and recorded. If set to a nonzero value, every thread created after the main function has executed is monitored and its data recorded. Set to zero to ignore all data collection for created threads.

Default: 1 (enabled)

PAT_RT_THREAD_CANCEL_NTRIES

Specifies the number of attempts the main thread makes in waiting for all created threads to terminate. An attempt is made every 0.25 seconds. Once all attempts have been completed by the main thread, the rest of the shutdown procedures can complete.

Note: Any threads not shut down after ntries will have is collected data recorded at that point. To avoid the delay, it is best practice to either exit or join threads before program exit.

Default: 120 (30 seconds)

PAT_RT_THREAD_MAX

Specifies the maximum number of threads that can be created and for which data is recorded. See PAT_RT_EXPFILE_THREADS to manage the recording of data for individual threads.

Default: 1,000,000

PAT_RT_TRACE_API

If 0, suppress the events and any data records produced by all embedded CrayPat API functions in the instrumented executable. For more information about the CrayPat API, see the pat_build(1) man page.

Default: 1 (enabled)

PAT_RT_TRACE_DEPTH

Specifies the maximum depth of the runtime call stack for traced functions during runtime summarization.

Default: 512

PAT_RT_TRACE_HEAP

If set to 0, disable the collection of dynamic heap information. This environment variable affects tracing experiments only.

Default: 1 (enabled), if malloc is present

PAT_RT_TRACE_HOOKS

Enable/disable instrumentation inserted as a result of tracing options specified when compiling the program. (See pat_build(1).) The syntax is a comma-separated list of compiler instrumentation types and toggles in the form name:a,name:a…, where name represents the nature of the compiler instrumentation and a is either zero to disable the specified event or nonzero to enable it. If no name is specified and PAT_RT_TRACE_HOOKS is set to zero, all compiler-instrumented tracing is disabled.

Note: PAT_RT_TRACE_HOOKS interacts with PAT_RT_SUMMARY. For more information, see Default, below.

The valid values for name are:

acc

GPU accelerator events

func

function entry and return events

loops; pgo

loop timing events

omp

OpenMP events via CCE tracepoints

Default: 1 (collect data for all compiler-inserted trace points) if PAT_RT_SUMMARY is unset or set to a non-zero value (that is, if runtime summarization is enabled); acc:1,omp:1 (collect data for GPU accelerator events and OpenMP events but ignore all other compiler-inserted trace points) if PAT_RT_SUMMARY is set to 0 (that is, if runtime summarization is disabled).

PAT_RT_TRACE_NARGS

Specify the number of function argument values to record each time a function instrumented for tracing is called. This applies only when runtime summarization is disabled.

Default: 0

PAT_RT_TRACE_PYTHON_GROUPS

A comma-separated list of python trace groups to trace. Valid trace group names include:

tensorflow or tensorflow_v2

Tensorflow 2 API callables

tensorflow_v1

Tensorflow 1 API callables

tensorflow_keras or tensorflow_keras_v2

Tensorflow 2 Keras API callables

tensorflow_keras_v1

Tensorflow 1 Keras API callables

tensorflow_estimators or tensorflow_estimators_v2

Tensorflow 2 Estimators API callables

tensorflow_estimators_v1

Tensorflow 1 Estimators API callables

torch or pytorch

Pytorch API callables

Each list element may optionally append a colon-separated list of the form :depth:prefix1:prefix2:…&include1&include2… to the trace group name. depth defaults to 1 and is the number of recursive calls within the trace group to trace. Each prefix specifies the prefix of function names not to trace. Unless prefix1 is prepended with a “-”, private functions (those beginning with “_”) are not traced. Each include names a function to trace, regardless of prefix. Unless include1 is prepended with a “-”, __init__ is traced.

Default: unset

PAT_RT_TRACE_PYTHON_MODULES

A comma-separated list of Python modules to trace, where each module is of the form “package.module”, i.e. the module’s __name__ attribute. For example, “sound.effects.echo” matches a module imported via “import sound.effects.echo”, “from sound.effects import echo as e”, or “from sound.effects.echo import *”. A callable is traced if it is defined in “package.module”, based on the callable’s __module__ attribute. Each list element may optionally append a colon-separated list of the form :depth:prefix1:prefix2:…&include1&include2… to the module name. depth defaults to unlimited and is the number of recursive calls within the module to trace. Each prefix specifies the prefix of function names not to trace. Unless prefix1 is prepended with “-”, private functions (those beginning with “_”) are not traced. Each include names a function to trace, regardless of prefix. Unless include1 is prepended with a “-”, __init__ is traced.

Default: unset

PAT_RT_TRACE_OVERHEAD

Specify the number of times the functions used to calculate the calling overhead are called upon runtime initialization and termination. To suppress overhead calculations, set this to 0. The larger the value, the more accurate the overhead calculation.

Default: 100

PAT_RT_WRITE_BUFFER_SIZE

Specify the size, in bytes, of a buffer that collects measurement data for a single thread.

Default: 8MB

CONTROLLING THE SIZE OF EXPERIMENT DATA FILES

Depending on the nature of your experiment, the data file created by CrayPat can be quite large. To keep data files down to reasonable sizes, use the runtime environment variables. The particular runtime environment variables you can use will vary depending on the type of experiment being conducted.

When running a sampling-type experiment, use the following runtime environment variables to reduce the amount of sampling data collected:

PAT_RT_CALLSTACK

PAT_RT_EXPFILE_PES

PAT_RT_SAMPLING_INTERVAL

PAT_RT_SUMMARY

When running a tracing-type experiment, use the following runtime environment variables to reduce the amount of tracing data collected:

PAT_RT_CALLSTACK

PAT_RT_EXPFILE_PES

PAT_RT_PERFCTR

PAT_RT_SUMMARY

See the man page for the respective shell for the proper command to execute in order to increase the disk space the experiment data file can consume.

For more information about controlling data file size, see the pat_help system.

WARNINGS

The perftools-base and an instrumentation module must be loaded before compiling and linking the original program. The perftools-base module does not affect program behavior and can be left loaded when not collecting performance data.

Users may experience long exit times for instrumented executables running with more than 16 threads per rank. To reduce this exit time set the environment variable PAT_RT_EXPFILE_MAX to 0. The more threads per rank, the more benefit setting this environment variable to 0 will provide.

POSIX interval timers may fail with a FATAL message if threads are scheduled to be oversubscribed. If this happens, relaunch the instrumented executable ensuring the threads are not oversubscribed.

Collection of accelerator performance data for sampling experiments is not supported. To collect accelerator performance data, perform a trace experiment.

Specific to Nvidia GPUs, collection of accelerator performance data is supported when CUDA Multi-Process Service (MPS) is enabled. However, collected counts may be higher than expected, or result in launch/device failures. See the accelerator man pages accpc (5) for more information.

An instrumented executable is not compatible with the DARSHAN I/O characterization tool. Disable the DARSHAN environment before executing the instrumented executable.

If using the SLURM workload manager, it is recommended to launch an instrumented application that uses threads with the srun –exclusive option.

Assembly language functions, static functions, and functions that are inlined cannot be traced. To trace a function that has been inlined, you must first recompile using the appropriate compiler options to disable inlining. See the compiler man pages and pat_build(1) for more information.

To trace a function that has been declared static, change the declaration to be global (e.g., remove the static keyword from the function definition), recompile, and relink. Be aware that if another function at global scope with the same name exists, a link error will occur.

In the event of an asynchronous program halt, such as occurs when an instrumented UPC application calls the function upc_global_exit(), the resulting experiment data file may not be fully formed.

Instrumented executables that execute the fork(2) or clone(2) system calls may exhibit undefined or unexpected behavior, including the program hanging indefinitely.

If a shared object contains a DT_RUNPATH attribute that determines the resolution of a function symbol, then an attempt to trace that function will change the behavior of a program that causes a call to that function to be made from within that shared library.

Instrumented executables that enable the dl trace group disable calls to the dlclose(3) function so a unique virtual address range is used for each loaded dynamic shared object. This ensures symbols in each DSO map to a unique address in reports. This may increase the executable’s use of virtual memory.

Instrumented executables that enable the dl trace group and call the dlopen (3) function invoked by a shared object with a DT_RUNPATH attribute may fail to find a file with the specified name, or may access the wrong file with the specified name.

A warning will be issued if a program that calls dlopen is instrumented for sampling (asynchronously) and is executed to collect data in runtime summary mode (see the environment variable PAT_RT_SUMMARY description). Data collected in this scenario may be corrupted, although the instrumented executable may finish executing successfully, and pat_report may not interpret the data correctly. To work around this, execute the instrumented executable with PAT_RT_SUMMARY set to 0, to collect the data in trace mode.

If using Multiple Program Multiple Data (MPMD) mode, all executables specified to the WLM launch utility must either be instrumented or not instrumented. Undefined results occur if instrumented and non-instrumented executables are intermixed when being specified to the WLM launch utility.

When the job is run, a separate experiment data directory is created for each executable. Reports from each created directory can be produced with pat_report, but there is currently no way to produce a report that combines data from multiple experiment directories.

When a runtime report is requested, either through perftools-lite or by setting the environment variable PAT_RT_REPORT_CMD, each experiment data directory will contain a summary report: rpt-files/RUNTIME.rpt. Only the copy containing data from rank zero will be copied to stdout.

FILES

$HOME/.craypatrc

Contains CrayPat runtime environment variables and provides configuration for all instrumented executables executed by the user.

$CRAYPAT_ROOT/share

The directory containing subdirectories of reference files, including predefined trace groups, performance counters, versioning, and other information.

a.out+pat+PID-nodes|t

Depending on the nature of the program and the environmental conditions in effect at the time of program execution, the instrumented executable, when executed, generates a experiment_data_directory with this name, where:

a.out

is the name of the original program

PID

is the process ID assigned to the instrumented executable at runtime

node

is the physical node ID upon which the rank zero process was executed

s|t

is a one-letter code indicating the type of experiment performed, either s for sampling or t for tracing

Using type specifications, this is represented as %A+%P-%N%X.

By default, the experiment data directory is created under the current working directory, but this location can be changed by setting the environment variable PAT_RT_EXPDIR_BASE.

Performance data files associated with this executable run are stored in this experiment data directory and include:

xf-files

A subdirectory containing one or more .xf files generated during the run. To save disk space, this subdirectory may be deleted once the ap2-files directory has been generated.

ap2-files

A subdirectory containing one or more .ap2 files automatically generated during the first invocation of pat_report invocation on the experiment data directory. The .ap2 files in the directory contain all the information from the original .xf files, but in the more portable Cray Apprentice2 format. This subdirectory can also be created without using pat_report, by running an executable instrumented for a Perftools-lite experiment.

Note: The most significant difference between .xf and .ap2 format is that .xf files require the original instrumented executable and dynamic libraries to be available to provide mapping from addresses to function names and source line numbers, while .ap2 files incorporate this data mapping and are self-contained. Therefore the .ap2 format is recommended if you wish to preserve the data for future reference.

By default, this subdirectory is created in the experiment data directory, but the location can be changed by using the pat_report -o option.

rpt-files

A subdirectory containing one or more text report files generated during the run or by pat_report.

html-files

A subdirectory containing one or more reports in HTML format, which are produced by using pat_report with the -f html option. These files can be opened with any web browser, or opened from the command line on Macintosh or Linux systems by using the open filename.html or xdg-open filename.html commands, respectively. By default, this subdirectory is created in the experiment data directory, but the location can be changed by using the pat_report -o option.

plot-files

A subdirectory containing one or more reports in gnuplot format. These files are created by running an instrumented executable with PAT_RT_SUMMARY set to 0 and PAT_RT_SAMPLING_DATA set to a supported value (e.g., cray_pm or cray_rapl), and then using pat_report with the -f plot option. The resulting files can be viewed either by invoking pat_report on the experiment data directory or using gnuplot. By default, this subdirectory is created in the experiment data directory, but the location can be changed by using the pat_report -o option.

index.ap2

An index data file created as a map to the data within the ap2-files directory.

build-options.apa

File containing recommended parameters for re-instrumenting the program for more detailed performance analysis. This is generated by running an executable instrumented for Automatic Profiling Analysis (pat_build -O apa) and then running pat_report on the resulting experiment data directory.

MPICH_RANK_ORDER*

One or more files containing options for rerunning MPI applications with optimized rank orders. This file is generated either manually, using the grid_order utility, or automatically, by running a performance analysis experiment using Perftools-lite.

SEE ALSO

intro_craypat(1), pat_build(1), pat_opts(1), pat_help(1), pat_report(1), pat_run(1), grid_order(1)

timer_create(2)

intro_mpi(3), intro_pmi(3)

perftools-base(4), perftools-lite(4), perftools-preload(4)

accpc(5), cray_pm(5), cray_rapl(5), hwpc(5), cray_cassini(5), uncore(5), papi_counters(5)

signal(7)