ATP - Operation

ATP is disabled for instances of applications launched from within a debugger.

Load ATP Plugin

Slurm SPANK plugins have to be recompiled for every major Slurm version. ATP automatically rebuilds and enables its plugin when the ATP module is loaded. The script can also manually be started with diagnostic output by running

$ATP_INSTALL_DIR/slurm/generate_slurm_config.sh

Automatic plugin configuration can be disabled by running

$ATP_INSTALL_DIR/slurm/generate_slurm_config.sh -d

About Backtrace Trees

All stack backtraces of the application processes are gathered into a merged stack backtrace tree and written to disk as the files atpMergedBT.dot (with function-level aggregation) and atpMergedBT_line.dot (with line-level aggregation). An overview of the stack backtrace tree, including the set of ranks receiving each fatal signal, is written to stderr. If Linux core dumping is enabled (see ulimit or limit in your shell documentation), a heuristically selected set of processes also dump their cores.

The backtrace tree files atpMergedBT.dot and atpMergedBT_line.dot can be viewed with the Stack Trace Analysis Tool viewer stat-view (included in the Cray Debugger Support Tools; module load cray-stat), or with the DOT file viewer dotty, which can be found on most Linux systems. The merged stack backtrace tree provides a concise yet comprehensive overview of the application at the time of its termination.

When viewing a backtrace tree file in stat-view, each node in the tree represents a stack backtrace frame of the running job. A full backtrace for a particular rank can be read by starting at the root node (labeled “/”) and reading from parent to child on any given limb of the tree. The edges between stack frame nodes are labeled with the set of ranks that had backtraces containing the next stack frame node.

The syntax for line labeling is count:[rank list]; for example: 10:[0-8,10]. If the rank list is abbreviated with an ellipsis (…), click on the node to pop-up a complete list of the ranks.

The coloration of nodes is arbitrary, but nodes that have the same set of ranks in common will have the same color. This enables you to see at a glance which regions of backtrace involve the same ranks.

About Core Dumps

If Linux core dumping is enabled (see ulimit or limit in your shell documentation), ATP selects a single representative from each leaf node of the merged stack backtrace tree to dump a core file. Each core file is named core.atp.apid.rank.

When viewing the resulting core files, remember that they do not all represent the exact same moment in time. Once the first process traps, there is an indeterminate latency before the others can be stopped. Some processes may have proceeded to barriers and gotten stuck, some may trap moments later, and others may still be running freely when told to dump core.

Users can control, to some degree, the set of core dumps created by ATP. To do so, it is necessary to understand the algorithm used to select the processes to dump.

Minidump core files can be copied separately from ATP analysis for later usage by setting ATP_MINIDUMP_DIR to a cross-mounted directory. The ATP runtime library will copy Minidump files to this directory before ATP analysis begins. This can be helpful if the system’s workload manager ends the job before full analysis is able to complete.

The copied minidump files can be parsed and printed using the included minidump_stackwalk utility:

$ATP_INSTALL_DIR/bin/minidump_stackwalk

This utility will produce a complete stack trace for a single process associated with the minidump file, as well as a listing of all shared libraries loaded by the process at time of crash.

Lightweight Corefiles can also be generated alongside system core files by setting ATP_LIGHTWEIGHT_COREFILES=1. Lightweight Corefiles contain human-readable stack trace information for all threads of the process that generated them. With this setting, each node will produce a lightweight corefile containing stack traces for each of its process.

About the Core Selection Algorithm

A typical stack backtrace shows the progression of user-written routines being called. Often, though, the call sequence continues to library interface routines and then further into internal library routines. Since these are not user-written routines, users generally have little interest in debugging them.

ATP’s algorithm for choosing processes to core dump is to walk the merged stack backtrace tree and stop when the first library interface routine is called. This effectively prunes the tree to contain only user-written routines. From this pruned tree, the first process in each leaf node is dumped, which significantly reduces the number of leaf nodes being dumped.

ATP’s process for distinguishing between user-written and library routines is not foolproof. The environment variable ATP_USER_LIB_INTERFACE_PREFIXES can be used to specify routine name prefixes to prune from the backtrace tree. For example, any routine name beginning with MPI_ could be assumed to be a library interface and pruned. Because it depends on a list of prefixes, though, ATP might inadvertently include library routines with names having prefixes not on the list.

If ATP routinely exceeds the specified maximum number of core dumps permissible (ATP_MAX_CORES), examine the core dump backtraces to identify additional routines that can be pruned using this environment variable. In general, if the set of core files produced by ATP routinely shows backtraces with the same library routine in them, that library name should be added to the prefix list.

The environment variable ATP_CORE_HOSTS can be set to a comma-separated list of hostnames. Instead of using automatic core selection, ATP will dump cores for all ranks on the listed hosts if a crash occurs.

Similarly, ATP_CORE_RANKS can be set to a list of rank IDs that will produce corefiles upon an application crash.

About Hold Time

After detecting a crash, ATP is able to hold a dying application to allow the user to attach to it with a debugger, such as gdb4hpc. To do so, set the ATP_HOLD_TIME environment variable to the number of minutes desired to hold the application.

Once attached, the debugging session can last as long as the batch system allows (which in turn depends on the compute node resources requested when beginning the session), so use ATP_HOLD_TIME to define the time needed to attach to the application, not the total time needed for the debugging session.

If the debugging session takes less than the ATP_HOLD_TIME to complete, abort the job on exiting the debugger. Otherwise, the compute node resources will be tied up until ATP_HOLD_TIME expires.

Note: If ATP_HOLD_TIME is set, core dumping is disabled.

About Signals

ATP performs signal analysis for the following signals:

SIGILL
SIGTRAP
SIGABRT
SIGFPE
SIGBUS
SIGSEGV
SIGSYS
SIGTERM (disabled by default)
SIGXCPU
SIGXFSZ

By default, SIGTERM is not processed by ATP, as it typically does not generate core dumps. SIGTERM can be caught by setting the environment variable ATP_IGNORE_SIGTERM to 1.

Most queuing systems will abort a job that has exceeded its wall clock limit by sending it SIGTERM. Exceeding this limit may be a bug worth investigating. Be aware that the amount of time between when the queuing system sends SIGTERM and then follows up with SIGKILL is site-customizable. If sufficient time is not configured, this feature may not be able to complete its task; 30 seconds is typically sufficient.

An alternate example is that SIGTERM is typically delivered when jobs are deleted from the queuing system. ATP processing is likely less desirable in such a case. To control this setting on a site wide basis, see the earlier discussion of the /etc rc scripts.

In the case of aborts by SIGTERM, or any asynchronous signal, the concept of the first process to die is not meaningful. All processes are dying because they were told to die. Consequently, the order in which the processes die is arbitrary.

About GPU Support

By default, ATP uses an internal library to analyze processes and collect stack trace information. However, to fully support GPU debugging on AMD systems, ATP will make use of the AMD GPU debugger, rocgdb, to analyze a program that may have crashed inside a GPU kernel. Likewise, for debugging support on systems using an Nvidia GPU, the cuda-gdb version corresponding to the currently loaded runtime is used for kernel crash analysis.

ATP will automatically detect when rocgdb or cuda-gdb is required to analyze the process by looking for the presence of the AMD HIP or CUDA runtime libraries in the program’s dependency list. The version and location of rocgdb / cuda-gdb to be used is sourced from the currently-loaded ROCm / Nvidia environment module file.

To manually enable the usage of the GDB data collection backend, the environment variable ATP_GDB_BINARY can be set to the path of the GDB binary to be used to analyze the current program.

To disable the automatic detection and usage of GDB for analysis, set ATP_GDB_BINARY to 0.

About Node Free Space Checks

ATP supports running node checks at a regular interval. Checks are not enabled by default; to enable, set ATP_RUNTIME_CHECKS to 0. Checks are run by default every 5 seconds on each node. To customize this interval, set ATP_CHECK_INTERVAL to the interval in seconds.

The default ATP runtime check monitors free space in the job’s temporary directory (usually set by the system workload manager or as the environment variable TMPDIR). If the node temporary directory is full, a warning along with the name of the affected node will be printed in job output. If full, ATP will usually not be able to ship support files to perform analysis if the job encounters a fatal signal.

By default, the temporary free space requirement is 64MB per node. To increase the size of the free space check for your own requirements, set ATP_MINIMUM_TMP_FREE to the free space in MB.

About Custom Runtime Checks

ATP also supports running a custom check script or binary by setting ATP_CHECK_COMMAND to the path to the binary or script. Each check interval, the command is run on each node.

A return code of 0 from the command allows the process to continue. A non-zero return code starts ATP analysis. This can be used to check your own invariants for the job, and to start a stack trace dump if an invariant fails.

Performing a manual dump

It is possible to force ATP analysis on a running program. To do so under ALPS, use apkill to send a SIGABRT (-6) to the application.

$ apkill -SIBABRT <apid>

To do so under SLURM, use scancel to send the ABRT to the srun step that launched the application.

$ scancel -s ABRT <job_id>.<srun_step_id>

The application will abort and perform the ATP exit processing; however, the reason reported for the application failure will be the user-induced “Aborted.”