ATP - Overview
Abnormal Termination Processing (ATP) is a tool that monitors Cray system user applications. If an application encounters a fatal signal, ATP will handle the signal and perform analysis on the dying application.
Enabling ATP
To use ATP, you must do the following:
Ensure that the atp module is loaded. If it is not, run module load atp.
Ensure that the target application is built with debug symbols (compiler-specific, usually -g).
Optional: rebuild the target application. When the ATP module is loaded, applications built with the Cray or GNU compilers are automatically linked against the ATP signal handler.
Define the environment variable ATP_ENABLED=1. This enables ATP launch support in aprun or srun. When enabled, the ATP system is initialized automatically as part of launching the application (CLE 3.1 or later) . It also tells the ATP application signal handler to register on application startup.
Customer sites can choose to enable ATP as default for their site by defining ATP_ENABLED=1.
For CLE release 6.0 and later, edit the files /etc/opt/cray/pe/admin-pe/site-config.csh and /etc/opt/cray/pe/admin-pe/site-config.sh in the PE image.
Individual users can override site preferences by setting ATP_ENABLED to either 0 or 1.
When the atp module is loaded, ATP sets the MPICH_ABORT_ON_ERROR, SHMEM_ABORT_ON_ERROR, and DMAPP_ABORT_ON_ERROR environment variables. This enables MPI, SHMEM, and DMAPP applications to raise a signal when they discover usage errors instead of printing to stderr and exiting. This enables ATP to handle the fatal signal and perform its analysis.
Note: The ATP module controls application link-time behavior, as well as some ATP setup, while not affecting the enabling of ATP at application run-time. Enabling ATP at application run-time is controlled by the ATP_ENABLED environment variable. ATP run-time behavior can be disabled either by unloading the ATP module or by setting ATP_ENABLED to 0. However, if you do so, be aware that qsub sessions may in some cases re-enable ATP, depending on your site configuration. Disabling ATP by setting ATP_ENABLED via your shell rc script always works.
Required Slurm configuration for CS clusters
The ATP Slurm plugin must be configured on CS cluster systems using the Slurm workload manager. To enable the ATP Slurm plugin, a system administrator should add the following line to the system’s configured Slurm plugin file:
optional /opt/cray/pe/atp/libAtpDispatch.so
By default, the plugstack.conf file is located at /etc/slurm/plugstack.conf. However, this location can be reconfigured. An administrator can find the location of the plugstack.conf plugin file using the following command:
scontrol show config | grep PlugStackConfig
If required, the administrator can update the path by editing the slurm.conf configuration file. The location of the slurm.conf file can be found with the following command:
scontrol show config | grep SLURM_CONF