Cray Debugging Support Tools
Implementation
CDST is available on HPE Cray EX and HPE Cray supercomputer systems, HPE Apollo 2000 Gen10Plus systems, HPE Apollo 80 systems, and Cray XC and CS systems; however, not all tools are supported on all platforms. See the specific platform user guides for details.
Introduction
Cray Debugging Support Tools are tools for debugging parallel applications.
ATP - Abnormal Termination Processing
Abnormal Termination Processing (ATP) is a tool that monitors Cray system user applications. If an application encounters a fatal signal, ATP will handle the signal and perform analysis on the dying application.
Overview
Enabling ATP
Required Slurm configuration for CS clusters
Operation of ATP
Load ATP Plugin
About Backtrace Trees
About Core Dumps
About the Core Selection Algorithm
About Hold Time
About Signals
About GPU Support
About Node Free Space Checks
About Custom Runtime Checks
Performing a manual dump
Environment variables
User-configurable settings for ATP to modify behavior at runtime
Compiler-specific details
Intel Fortran
GNU Fortran
Examples
STAT - Stack Trace Analysis Tool
The Stack Trace Analysis Tool (STAT) is a scalable, lightweight debugger for parallel applications. STAT works by gathering stack traces from all of a parallel application’s processes and merging them into a compact form. The resulting output indicates the location in the code that each application process is executing, which can help locate a bug. The Stack Trace Analysis Tool (STAT) package includes three commands to invoke and control STAT as well as analyze its output.
man pages
CCDB - Cray Comparative Debugger
The Cray Comparative Debugger (CCDB) is Cray’s next generation debugging tool. CCDB features a Graphical User Interface that extends the comparative debugging capabilities of gdb4hpc, enabling users to easily compare data structures between two executing applications.
User Guide
man pages
CTI - Common Tools Interface
The Common Tools Interface (CTI) is an infrastructure framework to enable tools to launch, interact with, and run utilities alongside applications on HPC systems.
man pages
gdb4hpc
gdb4hpc is a GDB-based parallel debugger used to debug applications compiled with CCE, PGI, GNU and Intel Fortran, C and C++ compilers.
Guides and Tutorials
Getting Started Guide
The getting started guide covers the following topics and more:
The help system
Debugging basics
Procsets
Focusing on a subset of an application’s ranks
HPC Features Tutorial
The tutorial covers the following unique HPC-centric features of gdb4hpc and more:
Comparative debugging
Assertion scripts
Decompositions
Shell commands and output piping
Array slicing
Debugging an MPI/CUDA GPU Application Tutorial
This tutorial shows how to use gdb4hpc to debug a multinode MPI application that uses CUDA compute kernels. The tutorial is written with CUDA/NVIDIA GPUs in mind, but the concepts apply to HIP/AMD GPUs as well.
Handling Arrays
Handling arrays covers gdb4hpc’s enhancements on Gdb’s array handling tools.
Parallel Programming Library Support
gdb4hpc has extra support for some popular parallel programming libraries.
VSCode Extension Guide
A guide covering gdb4hpc’s vscode extension.
Python Debugging
gdb4hpc has extra support for debugging Python applications.
gdb4hpc man Pages and Reference Material
sanitizers4hpc
Sanitizers4hpc is an aggregation tool to collect and analyze LLVM Sanitizers output at scale.
Guides
man pages
valgrind4hpc
Valgrind4hpc is a Valgrind-based debugging tool to aid in the detection of memory leaks and errors in parallel applications. Valgrind4hpc aggregates any duplicate messages across ranks to help provide an understandable picture of program behavior.