intro_pgas

Date:: 11-27-2018

NAME

intro_pgas, pgas, upc, coarray, intro_upc, intro_coarray - Introduce Partitioned Global Address Space (PGAS) programming models

DESCRIPTION

Cray supports three Partitioned Global Address Space (PGAS) programming models, Unified Parallel C (UPC) and Fortran. The latter was formerly known as Coarray Fortran (CAF) until coarrrays were adopted by the Fortran 2008 standard. Cray supports the UPC 1.3 specification.

A PGAS application passes data between cooperating parallel processes. These parallel processes are called threads in UPC and images in Fortran. For consistency and to avoid confusion with OpenMP and pthreads, this man page uses the term processing element (PE). Because the PGAS runtime library is shared by both languages, messages printed by the runtime refer to PEs by a PE number, which is equal to the UPC thread number and one less than the Fortran image number. Thus PE 0 refers to UPC thread 0 and Fortran image 1.

Unlike MPI and SHMEM, the PGAS programming models primarily provide a programming language rather than programming library approach to writing Single Program Multiple Data (SPMD) applications. Communication between PEs is performed using assignment statements where the variables on either side have affinity to different PEs.

In the PGAS model, memory accessible to all PEs is said to be remotely accessible. This memory is partitioned such that each partition has affinity to a particular PE. A PE that references its local partition, the one having affinity to itself, executes a traditional load or store instruction. A PE that references a remote partition, one with affinity to a different PE, requires network communication if the two PEs are executing on different nodes. Within a node, communication occurs via shared memory.

Compiling and Linking

Compiling a PGAS application requires the PrgEnv-cray module to be loaded.

The -hupc option is required to enable recognition of UPC syntax because it is not part of the standard C language.

The -hcaf option is required to enable use of Fortran’s PGAS functionality on some platforms to avoid incompatibilities or resource contention with commonly used third-party libraries like OpenMPI and MVAPICH.

For code samples refer to EXAMPLES. The following commands create an executable file:

% cc -hupc hello.c -o hello
% ftn hello.f90 -o hello

An executable can be created by linking together various object files that were generated from source code written in standard C, UPC, C++, and Fortran. Any compiler can be used to link the object files; however, if C++ is used, the C++ compiler is required so that the C++ runtime library is included.

% cc -hupc x.o y.o z.o
% CC x.o y.o z.o
% ftn x.o y.o z.o

For information about linking PGAS applications to use huge pages, see the intro_hugepages(1) man page.

Launching a PGAS Application

The examples below use the Application Level Placement Scheduler (APLS) launcher used on legacy Cray platforms. Other launchers (eg, SLURM) are supported. Please refer to your platform’s documentation for details on how to launch jobs on compute nodes.

The Application Level Placement Scheduler (ALPS) can launch the executable. Launch the PGAS application using 128 PEs:

% aprun -n 128 ./hello

To reserve a specific amount of symmetric heap space, set the XT_SYMMETRIC_HEAP_SIZE environment variable to the desired number of bytes. The suffixes k, K, m, M, g, and G are permitted to simplify requests for large values:

% export XT_SYMMETRIC_HEAP_SIZE=512M
% aprun -n 128 ./hello

The aprun syntax for launching Multiple Program, Multiple Data (MPMD) applications can be overloaded, by setting the PMI_JOB_IS_SPMD environment variable, to give certain PEs exclusive access to a node:

% export PMI_JOB_IS_SPMD
% aprun -n 1 -N 1 ./hello : -n 127 ./hello

This syntax is useful when some PEs require more memory than others. See the aprun(1) man page for more information about launching applications.

Mixing PGAS with MPI

UPC and Fortran are compatible with MPI. The programmer is responsible for calling MPI_Init(3) and MPI_Finalize(3) from their main program. All of the PGAS languages automatically initialize themselves prior to execution of the main program and finalize themselves upon completion of the main program, so unlike MPI there is no need to explicitly call an initialization or finalization function for PGAS. Note that deadlock scenarios are possible if care isn’t taken to ensure forward progress of both the PGAS runtime and MPI.

CCE’s PGAS runtime is designed to work with the HPE Cray MPI implementation. It may work with other third-party MPI implementations, but this is neither tested nor supported. PGAS support in the compiler is disabled by default on HPE platforms where third-party MPI libraries are used.

For more information about using MPI, refer to the mpi(3) man page.

FORTRAN FEATURES

This section lists some commonly used features. For a complete explanation of Fortran syntax and semantics, refer to the Fortran standard, ISO/IEC 1539-1:2010.

this_image(): Evaluates to an image number in the range 1 to num_images(), inclusive.
num_images(): Evaluates to the number of PEs in the job, equal to the aprun -n value.
lock(), unlock(): Lock synchronization.
sync all: Image control statement. Global barrier.
sync images: Image control statement. Selective synchronization.
sync memory: Image control statement. Fence to ensure that changes to remotely accessible memory are visible to other images.
critical, end critical: Begin and end a mutually exclusive region.

FORTRAN EXTENSIONS

Cray supports PGAS extensions to Fortran that are not part of the Fortran standard. Refer to the man pages for more information.

CO_SUM(): Collective sum.
CO_BCAST(): Collective broadcast.
CO_MIN(), CO_MAX(): Collective minimum and maximum.

UPC FEATURES

This section lists some commonly used features. For a complete explanation of UPC syntax and semantics, refer to the UPC Language Specification, Version 1.3, available here: http://code.google.com/p/upc-specification/downloads/list.

Predefined Identifiers

The following provide processing element (PE) information:

MYTHREAD evaluates to a thread number in the range 0 to THREADS-1, inclusive.
THREADS evaluates to the number of PEs in the job.

Pointer-to-Shared

The following interfaces, declared in upc.h, enable inspection and manipulation of pointers-to-shared.

upc_addrfield(3c)
upc_phaseof(3c)
upc_resetphase(3c)
upc_threadof(3c)

Privatizability

The following interfaces, declared in upc_castable.h, enable the casting of a pointer-to-shared to a normal pointer when the target is directly addressable by the thread (usually because it resides on the same node).

upc_cast(3c)
upc_castable(3c)
upc_thread_info(3c)

Shared Memory Management

The following interfaces, declared in upc.h, provide for dynamic allocation/deallocation of shared objects.

upc_all_alloc(3c)
upc_alloc(3c)
upc_all_free(3c)
upc_free(3c)

Lock Synchronization

The following interfaces, declared in upc.h, enable lock-based synchronization between threads.

upc_all_lock_alloc(3c)
upc_global_lock_alloc(3c)
upc_lock(3c)
upc_lock_attempt(3c)
upc_all_lock_free(3c)
upc_lock_free(3c)
upc_unlock(3c)

Blocking Bulk Data Movement

The following interfaces, declared in upc.h, enable blocked copying of data to/from shared buffers.

upc_memcpy(3c)
upc_memget(3c)
upc_memput(3c)
upc_memset(3c)

Non-blocking Bulk Data Movement

The following interfaces, declared in upc_nb.h, enable non-blocking copying of data to/from shared buffers.

upc_memcpy_nb(3c)
upc_memcpy_nbi(3c)
upc_memget_nb(3c)
upc_memget_nbi(3c)
upc_memput_nb(3c)
upc_memput_nbi(3c)
upc_sync_attempt(3c)
upc_sync(3c)
upc_synci_attempt(3c)
upc_synci(3c)

Collectives

The following interfaces, declared in upc_collective.h, provide common collective operations over shared arrays:

upc_all_broadcast(3c)
upc_all_exchange(3c)
upc_all_gather(3c)
upc_all_gather_all(3c)
upc_all_permute(3c)
upc_all_reduce(3c)
upc_all_prefix_reduce(3c)
upc_all_scatter(3c)

Timing

The following interfaces, declared in upc_tick.h, permit timing regions of UPC code.

upc_ticks_now
upc_ticks_to_ns

Atomics

The following interfaces, declared in upc_atomic.h, enable atomic access to shared objects.

upc_all_atomicdomain_alloc
upc_all_atomicdomain_free
upc_atomic_strict
upc_atomic_relaxed
upc_atomic_isfast

There are also Cray intrinsic amo extensions defined in the amo(3i) man page. These extensions are suffixed with _upc and operate on long types as defined. In addition there are 32-bit integer types for each of the intrinsics. These are suffixed with _i_upc.

I/O

The following interfaces, declared in upc_io.h, provide collective file I/O operations.

upc_all_fclose()
upc_all_fcntl()
upc_all_fget_size()
upc_all_fopen()
upc_all_fpreallocate()
upc_all_fread_list_local()
upc_all_fread_list_local_async()
upc_all_fread_list_shared()
upc_all_fread_list_shared_async()
upc_all_fread_local()
upc_all_fread_local_async()
upc_all_fread_shared()
upc_all_fread_shared_async()
upc_all_fseek()
upc_all_fset_size()
upc_all_fsync()
upc_all_ftest_async()
upc_all_fwait_async()
upc_all_fwrite_list_local()
upc_all_fwrite_list_local_async()
upc_all_fwrite_list_shared()
upc_all_fwrite_list_shared_async()
upc_all_fwrite_local()
upc_all_fwrite_local_async()
upc_all_fwrite_shared()
upc_all_fwrite_shared_async()

UPC EXTENSIONS

Cray extensions to UPC that are not part of the UPC Language Specification are listed here.

Note: A number of former extensions to UPC 1.2 have been standardized in UPC 1.3, including non-blocking bulk copies (upc_nb.h), privatizability (upc_castable.h) and timing (upc_tick.h) interfaces. These interfaces have been removed from the upc_cray.h header and moved into new headers as required by the UPC 1.3 specification. Additionally, some of the semantics and interfaces have been changed slightly, so existing users of these interfaces may need to update their applications.

Team Collectives

The following interfaces, declared in upc_collective_cray.h, provide common collective operations on a subset (team) of threads. These are loosely based on the UPC Collectives Library 2.0 proposal, with changes to argument ordering to better match existing practice in UPC and no explicit initialization.

CRAY_UPC_TEAM_ALL
CRAY_UPC_TEAM_NODE
cray_upc_op_create(3c)
cray_upc_op_free(3c)
cray_upc_type_size(3c)
cray_upc_team_rank(3c)
cray_upc_team_alltoall(3c)
cray_upc_team_alltoall_nb(3c)
cray_upc_team_alltoall_nbi(3c)
cray_upc_team_size(3c)
cray_upc_team_split(3c)
cray_upc_team_free(3c)
cray_upc_team_barrier(3c)
cray_upc_team_allreduce(3c)

Node Affinity

UPC has a flat PE space that does not provide knowledge of PE placement. The following allow an application to be aware of which PEs share a node.

MYNODE
Similar to MYTHREAD, but evaluates to a node number in the range 0 to NODES - 1, inclusive.
NODES
Similar to THREADS, but evaluates to the number of nodes used by the application, equal to the ceiling of the aprun -n value divided by the -N value.
upc_nodeof()
Returns the index of the node of the thread that has affinity to the shared object pointed to by ptr. Similar to upc_threadof().

Examples

Example 1: Fortran (hello.f90)

program main
    print *,"Hello from Fortran image",this_image(),"of",num_images()
end program

Example 2: UPC (hello.c)

#include <stdio.h>
#include <upc.h>

int main( void )
{
    printf( "Hello from UPC thread %d of %d.\n", MYTHREAD, THREADS );
    return 0;
}

Cray Thread Hot UPC

The following interfaces, declared in upc_cray.h, provides a mechanism to use UPC syntax and semantics on a per-thread basis (i.e. OpenMP, POSIX threads, etc). For this section, the word thread refers to an OpenMP/POSIX thread etc. This feature introduces the concept of a thread region, which is a group of threads, that will issue independent UPC operations. The feature is implemented as UPC library extensions. This mechanism can be used to increase injection bandwidth and allow better utilization of network resources for certain use cases.

Important Caveats for thread hot UPC

Collective operations are not supported from within a thread region. If the user desires to use any collective operation (e.g. upc_barrier) from inside a thread region, they must first issue a upc_fence from each thread in the region and use the thread synchronization mechanism of their choice to ensure that only a single thread in the region issues the UPC collective. Any UPC collectives called from within a thread region must be called in the same order on each process.
Any UPC operation that takes place before entering a thread region, or after leaving a thread region, will conform to UPC ordering semantics with respect to operations taking place within each individual thread of a thread region.
No ordering guarantees are provided for operations that are issued by different threads in the same thread region. It is up to the programmer to guarantee that their application is race free between threads. This can be done through the use of a upc_fence followed by a thread synchronization mechanism (e.g. OpenMP barrier or pthread mutex). It is not sufficient to use only a UPC fence or a thread synchronization mechanism, both must be used together to enforce ordering across threads.
All UPC operations that take place within a single thread will conform with normal UPC ordering semantics.
Fencing operations (e.g. upc_fence) inside of a thread region will fence operations that were issued prior to entering the thread region by the parent thread and any operation that was issued by the calling thread prior to the fence. A fence will not apply to operations that are issued by any other thread.
Nested parallelism, that is entering a new thread region from an existing thread region, is supported.
No UPC operation should be issued from a thread not participating in a thread region until after the thread region(s) have exited (between the cray_upc_thread_region_proglogue and cray_upc_thread_region_epilogue calls). It is valid for a parent thread to become a member of a thread region.
It is important that the user sets the environment variable CRAY_PGAS_MAX_CONCURRENCY to the number of threads that will be participating inside of a thread region.

Cray Thread Hot UPC Calls

cray_upc_thread_region_prologue(3c) - create a new thread region
cray_upc_thread_region_begin(3c) - register a thread as a emmber of a thread region
cray_upc_thread_region_end(3c) - de-register a member thread of a thread region
cray_upc_thread_region_epilogue(3c) - end a thread region

Cray relaxed ordering extensions

There are two compiler directives available to users that can relax ordering of references to improve performance under certain circumstances.

The first directive is pgas defer_sync. Details are available in the defer_sync(7) manpage. Normally, the compiler must guard against references that could violate program ordering semantics by synchronizing the references. When a user knows that a reference will not target overlapping memory from a past or future reference between fences/barriers, a pgas defer_sync directive can be used to force synchronization to be delayed until the next fence instruction.

The second directive is pgas buffered_async. Details are available in the buffered_async(7) manpage. This can be used in place of a defer_sync directive to batch small references into bulk data transfers. No ordering or correctness guarantees are made between BA and non-BA references. No progress guarantees are made unless both the source and target are actively making BA references or are inside a barrier/fence. The total size of local buffers used for BA operations is controlled via the PGAS_BA_BUF_SIZE environment variable described below.

ENVIRONMENT VARIABLES

The following environment variables affect PGAS behavior.

XT_SYMMETRIC_HEAP_SIZE: Controls the size (in bytes) of the symmetric heap. The value set in this environment variable is interpreted as a number of bytes, unless the number is followed by a char that acts as a multiplier, where: g or G multiplies by 2**30 (gigabytes), k or K multiplies by 2**10 (kilobytes), m or M multiplies by 2**20 (megabytes). For example, the string 20m returns the integer value 20*2**20, or 20 megabytes. Only one multiplier is recognized, so 20kk will not produce the same value as 20m, nor will invalid strings such as 20MB produce the desired result.

Default: 64M

CRAY_PGAS_SYMMETRIC_HEAP_HBM_MODE: Describes the KNL memory policy for the symmetric heap and can be one of M[andatory], P[referred], or I[nterleaved].

If Mandatory is specified, bind the entire symmetric heap to high bandwidth memory. If the requested amount of high bandwidth memory is not available, the program will abort.
If Preferred is specified, attempt to use high bandwidth memory for the symmetric heap. If the requested amount of high bandwidth memory is not available, fall back to using system memory.
If Interleaved is specified, bind the entire symmetric heap to high bandwidth memory. If the requested amount of high bandwidth memory is not available, the program will abort. Page allocations will be interleaved across all HBM numa nodes.
Note: This feature is only supported on KNL systems. In order to use this feature, the user is required to load the cray-memkind module and re-link their application.
Default: unset (use system memory for symmetric heap)

PGAS_ERROR_FILE [stderr | stdout]: To redirect error messages issued by the PGAS library (libpgas) to stdout, set this variable to stdout.

Default: stderr

PGAS_MEMINFO_DISPLAY: To examine memory usage in a UPC or coarray Fortran application, set this variable to a value of 1 or greater. This causes libpgas to display information about the job’s memory allocation during initialization in a format similar to the output produced by the SHMEM_MEMINFO_DISPLAY environment variable.

Note: This feature is only supported on XC systems.

PMI_JOB_IS_SPMD: If set, PMI treats a job launched via the ALPS MPMD format as an SPMD job instead. This enables PGAS jobs launched using the syntax

aprun -n X ./a.out : -n Y ./a.out

to run on Cray systems. This type of launch may be desired in order to assign extra resources such as memory to one or more PEs in the job.

Note: PGAS does not support MPMD launches with different executables on Cray systems. When using this environment variable, take care to make certain that the executables are identical. If they are not identical, unpredictable results will occur.
Default: Not enabled

CRAY_PGAS_MAX_CONCURRENCY: By default, the PGAS runtime uses locks to ensure thread-safety if the number of PEs on a node is less than the number of available schedulable CPUs (including hyper-threads). In some cases, this can cause a noticeable overhead if the application does not actually need it (no OpenMP, no POSIX threads, etc). Setting this to 1 may improve performance in such cases. Setting this to a value greater than 1 may be necessary if the schedulable CPUs (including hyper-threads) is less than the total number of threads on node.

Default: unset (autodetected)

CRAY_PGAS_XPMEM_LIMIT: Controls the size (in bytes) of the shared memory segments that map memory from PEs located on the same compute node. The value set in this environment variable is interpreted as a number of bytes, unless the number is followed by a char that acts as a multiplier, where: g or G multiplies by 2**30 (gigabytes), k or K multiplies by 2**10 (kilobytes), m or M multiplies by 2**20 (megabytes). Setting this value too small may result in undefined behavior if any PE addresses local memory outside of this limit. Only set this environment variable if insufficient virtual address space errors occur at runtime. Use care that takes into account application behavior when choosing a value.

Default: unset (use default size limits)

CRAY_PGAS_XPMEM_MAP_IMMEDIATE: If set, force shared memory segments between PEs on the same compute node to be created upfront at application launch instead of being deferred until actual use. Use of this environment variable may result in insufficient virtual address space errors. This environment variable should only be set if it is not viable to amortize the costs of attaching shared memory segments during runtime.

Default: unset (map shared memory segments only when targeted)

CRAY_PGAS_BA_BUF_SIZE: Controls the size (in bytes) of the amount of memory used for buffers on each node. The value set in this environment variable is interpreted in the same way as XT_SYMMETRIC_HEAP_SIZE described above.

Note: This only applies to applications that have the buffered_async directve applied to references. By default, no buffer space is allocated.
Default: unset (defaults to none, or reserve a maximum of half the symmetric heap size when the pgas buffered_async directive is used)

NOTES

Printing a UPC pointer-to-shared

The UPC specification does not extend the printf family of functions to add a format specifier for a UPC pointer-to-shared. Therefore, a UPC pointer-to-shared must be printed by printing its components:

printf( "thread %d phase %d address %p\n",
         upc_threadof( ptr ), upc_phaseof( ptr ),
         upc_addrfield( ptr ) );

Supported number of PEs on a node

The application may encounter runtime errors due to lack of resources, including virtual address space, if the number of PEs exceeds the number of physical cores on the node. Some of these resource limits may be worked around to increase the maximum number of PEs per node if these errors are encountered. For example, hard ulimits for the data and stack segments can be lowered or the environment variable CRAY_PGAS_XPMEM_LIMIT set to the maximum amount of memory any single PE in the application uses (if this value is known) to reduce the runtime library’s use of virtual address space.