ATP - Debugging a segfault
In this example, we will use ATP to debug a segfault in a simple MPI C program. The program is supposed to print its arguments, can you see where the error is?
#include <stdlib.h>
#include <stdio.h>
#include "mpi.h"
int main(int argc, char** argv) {
int myRank;
MPI_Init(&argc, &argv);
// Get rank number
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
printf("Arguments for rank %d: ", myRank);
for (int i = 0; i < argv; i++) {
printf("%s ", argv[i]);
}
printf("\n");
}
Build the application. Note that ATP is not required to be loaded before building the application, but enabling debug symbols with -g is required to have ATP report line numbers.
cc -g -O2 crash.c -o a.out
Load the ATP module if it’s not already loaded. Set ATP_ENABLED=1 and run the application.
module load atp
ATP_ENABLED=1 srun -n2 ./a.out
Once the segfault is encountered, ATP will print an aggregated backtrace tree of the state of the ranks.
ATP analysis of Slurm job 1034635.0 is starting...
Processes died with the following statuses:
<0-1> Reason: 'SIGSEGV /SEGV_MAPERR' Address: 0x20 Assertion: ''
<0-1> #11 __strlen_avx2 + 0x37
<0-1> #10 __vfprintf_internal + 0x1b1e
<0-1> #9 __printf + 0xa9
<0-1> #8 free_mem + 0x19dd1
<0-1> #7 MPIR_Init_thread + 0x1ee
<0-1> #6 MPL_env2str + 0x1b
<0-1> #5 crash.c:17
<0-1> #4 __no_mmap_for_malloc + 0x14d
<0-1> #3 __libc_start_main + 0xee
<0-1> #2 __do_fini + 0x56
<0-1> #1 crash + 0x19ef
<0-1> #0 _start + 0x29
Producing core dumps for ranks 0
1 core written in /home/users/adangelo
We can see by the header of each stack line <0-1> that both ranks 0 and 1 had identical frames in their stack traces for the whole trace. Neither diverged at the time of crash, so ATP aggregated both together before displaying the stack.
There are 11 frames reported. However, most are only displayed as offsets into functions or modules, such as the top frame, __strlen_avx2 + 0x37. This indicates that no debugging information was available for this frame, so no line numbers or files could be determined.
There is one frame that does have source file and line information, frame 5. It is reporting the location line 17 in crash.c. Let’s take a look at that line:
printf("%s ", argv[i]);
That line looks fine, at least from here. Next, we can use core files to take a look at the actual values for argv and i at the point where the application crashed.