cray_upc_team_alltoall

Date:: 11-18-2014

NAME

cray_upc_team_alltoall, cray_upc_team_alltoall_nb, cray_upc_team_alltoall_nbi - Cray UPC team all-to-all collectives

SYNOPSIS

#include <upc_collective_cray.h>

void cray_upc_team_alltoall( shared void *dst, shared void *src,
                                                     size_t nbytes, cray_upc_team_t team );

upc_handle_t cray_upc_team_alltoall_nb( shared void *dst, shared void *src,
                                                     size_t nbytes, cray_upc_team_t team );

upc_handle_t cray_upc_team_alltoall_nbi( shared void *dst, shared void *src,
                                                     size_t nbytes, cray_upc_team_t team );

dst: A shared array large enough for each thread in the team to receive nbytes from each thread in the team.
src: A shared array that contains on each thread nbytes for each thread in the team.
nbytes: The number of bytes each thread receives from other threads in the team.
team: A valid UPC team handle

IMPLEMENTATION

Cray Linux Environment (CLE)

DESCRIPTION

The cray_upc_team_alltoall blocking, non-blocking (_nb), and non-blocking implicit (_nbi) functions transfer nbytes bytes of data from each thread in a team to every other thread in that team.

The src and dst pointers point to the first element of distinct shared arrays on the thread that is rank 0 in the team. The i-th block of data in the source array with affinity to the thread that is rank j in the team is transferred to the j-th block of data in the destination array with affinity to the thread that is rank i in the team. For a team consisting of all threads in the application in order, this is equivalent to the standard upc_all_exchange() collective.

Data with affinity to a thread T is neither read nor written until that thread has entered the collective, similar to the UPC_IN_MYSYNC synchronization of the standard UPC collectives.

Completion of these collective routines is split into two operations. The collective is locally complete on thread T when all of the data in the destination array with affinity to T has been written. The data in the source array with affinity to T may still be read after local completion on T.

The collective is globally complete when it is locally complete on every thread in the team. The cray_upc_team_alltoall() function does not return until the collective is globally complete. The cray_upc_team_alltoall_nb() and cray_upc_team_alltoall_nbi() routines return after all threads in the team have entered the collective. The standard UPC non-blocking synchronization routines ( upc_sync(), upc_synci(), upc_sync_attempt(), upc_synci_attempt() ) can be used to query or wait for local completion of the non-blocking collectives.

Since the collectives are globally complete when they are locally complete on all threads in the team, any synchronization over the entire team after local completion can be used to wait for global completion. Notably, initiation of another non-blocking all-to-all collective is sufficient, as demonstrated in the pipelining example below.

RETURN VALUE

The cray_upc_team_alltoall and cray_upc_team_alltoall_nbi functions have no return value. The cray_upc_team_alltoall_nb function returns a handle that can be used to locally complete the operation.

EXAMPLE

The following demonstrates one possible way to setup a pipelined operation using cray_upc_team_alltoall_nb.

int i;
upc_handle_t h = UPC_COMPLETE_HANDLE;
shared char *d[2], *s[2];

     // Allocate and initialize symmetric memory for pipe-lined alltoall
     d[0] = (shared char *) upc_all_alloc( 4 * THREADS, THREADS * nbytes );
     d[1] = d[0] + THREADS * THREADS * nbytes;
     s[0] = d[1] + THREADS * THREADS * nbytes;
     s[1] = s[0] + THREADS * THREADS * nbytes;
     initialize_data( d, s );

     for ( i=0 ; ; i=1-i ) {
         // Check for completion and prepare this round's data
         if ( finished() ) break;
         compute_local_source( s[i] );

         // Wait for results of previous round's alltoall
         upc_sync( h );

         // d[1-i] is ready

     update_local_from_global( d[1-i], s[i] );

         // Start this round's alltoall and take advantage of the
         // synchronization implict in the call to recognize that
         // all ranks have called upc_sync() on the previous round's
         // alltoall.
         h = cray_upc_team_alltoall_nb( d[i], s[i], nbytes, CRAY_UPC_TEAM_ALL );

         // s[1-i] is free to reuse as s[i] in next iteration
     }
     upc_sync( h );
     cray_upc_team_barrier( CRAY_UPC_TEAM_ALL );

NOTES

The synchronization behavior of the Cray non-blocking operations is significantly different than the standard UPC collectives because different policies apply to the source and destination arrays.