intro_OpenMP - Introduction to the OpenMP parallel programming model
Cray Linux Environment (CLE)
OpenMP is a parallel programming model that is portable across shared memory architectures from Cray and other vendors. The OpenMP Application Program Interface Specification is available at http://www.openmp.org/specifications. This man page describes the implementation and use of OpenMP in the context of the Cray Compiling Environment (CCE) and features unique to the Cray implementation.
By default, OpenMP is disabled in CCE and must be explicitly enabled using the -fopenmp compiler command line option.
CCE supports full OpenMP 5.0 and partial OpenMP 5.1 and 5.2. The following OpenMP 5.1 features are supported:
inoutset dependence type
primary policy for the proc_bind clause
present behavior for defaultmap (C/C++ only)
masked construct without filter clause (Fortran only)
metadirective dynamic user condition and target_device selectors (Fortran only)
error directive (Fortran only)
compare clause on atomic construct (C/C++ only)
assume and assumes directives (Fortran only)
nothing directive (Fortran only)
The following OpenMP 5.2 features are supported:
otherwise clause for metadirective (Fortran only)
CCE’s OpenMP implementation has the following limitations:
The device clause is not supported. The other mechanisms for selecting a default device are supported: OMP_DEFAULT_DEVICE and omp_set_default_device.
The only API calls allowed in target regions are: omp_is_initial_device, omp_get_thread_num, omp_get_num_threads, omp_get_team_num, and omp_get_num_teams.
User-defined reductions in Fortran are not supported in target regions or task reductions.
An untied task that starts execution on a thread and suspends will always resume execution on that same thread.
declare simd functions will not vectorize if inlining is disabled or the function definition is not visible at the callsite.
simd loops containing function calls will not vectorize if inlining is disabled or the function definitions are not visible.
The requires directive is parsed, but any clauses that are not yet supported produce compile errors, as allowed by the OpenMP specification. The requires directive clauses dynamic_allocators and reverse_offload currently result in compile time errors.
The loop directive is honored where semantics are observable, but no additional optimization is performed. That is, the loop directive is treated as distribute for bind(teams); do/for for bind(parallel); and simd for bind(thread). For C/C++, the stand-alone loop directive is supported, but it is not supported when expressed as part of a combined or composite construct.
The order(concurrent) clause is accepted, but is not used for additional optimization.
omp_pause_resource and omp_pause_resource_all only cause the CCE offloading runtime library to relinquish GPU memory; no other resources are relinquished.
Several hints are parsed and accepted, but no action is taken: simd nontemporal, atomic hints, nonmonotonic loop scheduling, and the close modifier.
The mutexinoutset and inoutset dependence types are accepted but treated as inout dependence types
The uses_allocators clause is accepted, but its usefulness is limited because CCE does not currently support the OpenMP memory allocation APIs in target regions for non-host devices
The assume and assumes directives are accepted, but they are not currently used for additional optimization
The error directive with the at(execution) clause is only supported for the host device; the at(compilation) clause may be used in non-host target regions
Non-rectangular loop collapse is functionally supported, but the collapse depth may be limited to exclude non-rectangular loops from the collapse group. This may result in partial collapse or no collapse at all (i.e., collapse depth one).
Non-contiguous target update directives are not supported for array slices specified with an array of arbitrary indices, even though this is valid array syntax in Fortran. Non-contiguous updates are supported for normal array slice expressions that specify a lower bound, upper bound, and optional stride.
On non-CPU target (i.e., an NVIDIA or AMD GPU), orphaned workshare constructs and some API calls (omp_get_thread_num and omp_get_num_thread) are not supported. The code will fail to link.
Concurrent asynchronous map clauses to the same variable are not fully supported and can result in unspecified behavior. Asynchronous map clauses, ones specified on a directive with a nowait clause, are currently implemented with synchronous present table semantics. That is, any implied data transfers are enqueued for asynchronous completion, but reference count updates are applied immediately. Issues only arise when the same variable appears in asynchronous map clauses concurrently and one of the map clauses triggers an allocation or deallocation (i.e., the reference count increases to one or decreases to zero). There are no issues with concurrent map clauses on independent variables or concurrent map clauses that are synchronous (i.e., the construct does not have a nowait clause). This issue can be avoided by wrapping asynchronous map clauses with enclosing synchronous data regions, which ensures that all underlying allocations and deallocations will occur synchronously.
omp scan is functionally supported, but loops will be limited to a single thread
The conditional modifier for lastprivate is functionally supported on GPU, but runs with a single thread.
Default and named mappers in Fortran are supported for top-level objects (scalars and arrays) on map clauses; only default mappers are supported for update directives; mappers are not currently supported for derived type members that do not have the pointer or allocatable attribute.
CCE OpenMP Offloading Support
OpenMP target directives are supported for targeting AMD GPUs, NVIDIA GPUs, or the current CPU target. An appropriate accelerator target module must be loaded in order to use target directives. Alternatively, the -fopenmp-targets= flag can be used to set the offload target for C and C++ source files (but the desired GPU architecture type must also be specified explicitly).
When the accelerator target is a non-CPU target (i.e., an NVIDIA or AMD GPU), CCE generally maps omp teams to the GPU’s coarse-grained level of parallelism (threadblocks for NVIDIA or work groups for AMD). omp parallel generally maps to the GPU’s fine-grained level of parallelism (threads within a threadblock for NVIDIA and work items for AMD). In the case of nested omp parallel, the fine-grained level of parallelism is applied to the outermost omp parallel. Any inner omp parallel constructs are serialized and therefore limitied to a single observable GPU thread.
For Fortran, omp simd constructs are generally ignored. However, if no omp parallel is present, then CCE performs aggressive autothreading with a preference for loops with an omp simd construct. This autothreading is mapped to the GPU fine-grained parallelism.
For C/C++, omp simd constructs are ignored.
Note: Prior to CCE 16.0 the policy for OpenMP offloading support was different. That policy mapped omp simd to the GPU fine-gained level of parallelism and omp parallel was generally ignored.
When compiling for GPU targets, it is important to note that general per-thread forward progress is not guaranteed by the underlying GPU hardware or software model. Specifically, individual GPU threads (work items) within a warp (wavefront) are not guaranteed to provide independent forward progress. A warp will make overall forward progress, but when threads within a warp diverge there is no guarantee on the priority of those individual threads relative to one another. In particular, attempting to execute an algorithm with a per-thread lock (or any per-thread synchronization algorithm where one threads waits indefinitely upon another thread) can deadlock when those threads are mapped to GPU threads in the same warp. This issue does not apply to OpenMP constructs, including omp critical, as the OpenMP implementation properly handles the hardware constraints. This issue also does not occur between GPU thread blocks (work groups), so the issue can be avoided by limiting warp size to a single thread with a threadlimit(1) clause on the teams construct.
When the accelerator target is the host, and a teams construct is encountered, the number of teams that execute the region will be determined by the num_teams clause if it is present. If the clause is not present, the number of teams will be determined by the nthreads-var ICV if it is set to a value greater than 1. Otherwise, it will execute with one team.
Multiple Device Support
CCE currently supports use of only one GPU device per process. The default device may be changed with the OMP_DEFAULT_DEVICE environment variable or the omp_set_default_device API (the device clause is not yet supported), allowing an application to select any visible GPU. The default device may be changed while an application is running, but only one GPU may be used at a time – use of the prior GPU must be complete prior to switching the default device. Any device allocations and outstanding transfers or kernels launches will be lost if appropriate synchronization is not used prior to changing the default device.
The current CCE implementation only tracks one copy of the default-device-var internal control variable per process rather than one copy per data environment. As a result, if one thread or task changes the default device then that change will be visible globally by all other threads or tasks.
Mechanisms provided by the underlying GPU vendor runtime can be used to control the devices visible to CCE’s OpenMP runtime. Specifically, CUDA_VISIBLE_DEVICES can be used for NVIDIA GPUs and ROCR_VISIBLE_DEVICES can be used for AMD GPUs; please refer to the appropriate GPU vendor documentation for more details. These mechanisms allow limiting and reordering the devices visible to a process and can be used as an alternative mechanism for selecting the default device for an OpenMP application.
Printing from GPU kernels
Standard C printf function calls are supported from OpenMP offload GPU regions compiled for NVIDIA and AMD GPU targets.
Fortran PRINT statements are supported, with limitations, when called from OpenMP offload regions compiled for AMD GPU targets. The current implementation supports PRINT statements with a single scalar value of type character, integer, real, or complex. Other uses of Fortran PRINT will compile successfully but will result in a warning message at runtime.
CCE’s current allocator implementation supports the pinned allocator trait when targeting an NVIDIA or AMD GPU. Allocating pinned memory will result in an underlying call to cudaMallocHost or hipMallocHost.
CCE also provides an extension for allocating GPU managed memory: the cray_omp_get_managed_memory_allocator_handle API will return an OpenMP allocator handle that results in an underlying call to cudaMallocManaged or hipMallocManaged.
GPU Atomic Operations
When supported by the target GPU, atomic directives are lowered into native atomic instructions. Otherwise, atomicity is guaranteed with a native atomic compare-and-swap loop. GPU atomic operations are not supported for data sizes larger than 64-bits (e.g., double-precision complex). OpenMP atomic operations are “device scope”, only providing coherence with threads on the same device.
GPUs often support a number of native floating-point atomic instructions. When available, CCE generates native floating-point atomic instructions for atomic add operations; otherwise, an atomic compare-and-swap loop is generated.
For AMD MI250X GPUs, native floating-point atomic instructions are
only safe for coarse-grained memory; floating-point atomic
instructions operating on fine-grained memory will be silently
ignored. In general, memory granularity can not be determined
statically, so at default CCE will always generate atomic
compare-and-swap loops for floating-point atomic operations. (Integer
atomic instructions, including atomic compare-and-swap, are safe for
any memory granularity.) The
-munsafe-fp-atomics compiler flag
may be used to enable generation of native floating-point atomic
instructions, but with this flag users are responsible for ensuring
that atomic operations do not target fine-grained memory.
GPU Unified Memory
At default, CCE will always allocate GPU memory and issue explicit
memory transfers for OpenMP
map clauses and
If all variables that are accessed in GPU regions are properly mapped,
this implementation will work properly for GPUs with and without
unified memory support.
Both NVIDIA and AMD GPUs provide support for “managed memory” that can be accessed from both the CPU and GPU, even if the system does not support unified memory for all CPU memory. At default, CCE will still allocate GPU memory when mapping a variable that was allocated with managed memory. But, the CRAY_ACC_USE_UNIFIED_MEM environment variable (described below) can be used to skip GPU allocations and transfers when mapping a range of managed memory.
For systems that support full unified memory, setting the CRAY_ACC_USE_UNIFIED_MEM environment variable can be used to skip all GPU allocations and transfers.
The omp requires unified_shared_memory directive can be used to require unified memory. CCE will issue a compile-time error if this directive is used for targets that do not support unified memory for all CPU addresses. Currently, CCE only supports omp requires unified_shared_memory for AMD MI250X GPUs. The CRAY_ACC_USE_UNIFIED_MEM environment variable is implied when using omp requires unified_shared_memory, and therefore it does not need to be explicitly set. However, the AMD environment variable HSA_XNACK=1 must be set to enable unified memory for AMD MI250X GPUs, otherwise the CCE OpenMP library will issue a runtime error when a variable is first mapped.
For AMD MI250X GPUs, unified memory requires compiling device code with XNACK support enabled. CCE always compiles OpenMP device code for AMD GPUs with the default XNACK “any” mode, so the resulting device code will work properly when running with or without unified memory. CCE does not currently provide a way to override the XNACK compilation mode for OpenMP device code or to build fat binaries with different XNACK modes.
It is possible to provide unified memory with a variety of different hardware and software implementations (e.g., physically unified memory or physically separate CPU and GPU memory with in-place accesses and/or automatic page migration). Unified memory support in CCE relies entirely upon the underlying GPU vendor’s hardware and software implementation. For AMD MI250X GPUs, there is physically separate CPU and GPU memory, and unified memory is provided through a combination of in-place accesses and automatic page migration. The page migration policy may differ depending on how memory was allocated – please refer to AMD documentation for full detail on page migration policies.
It is important to note that CCE uses a non-standard heap
configuration, differing from the standard Linux GNU libc
malloc/free heap configuration, which may affect the page
migration policy for heap allocations. In particular, heap
allocations satisfied by
mmap may have different
migration policies. CCE uses the GNU libc heap implementation; but,
it issues custom
mallopt settings upon application startup that
tune the underlying heap allocator to always prefer
mmap, even for large allocations. These custom
settings can be disabled by setting the environment variable
AMD GPU Memory Granularity
AMD GPUs support two types of memory granularity for global GPU memory: coarse and fine. Coarse-grained memory only ensures cross-device coherence at kernel boundaries while fine-grained memory allows coherence within a running kernel.
For AMD MI250X GPUs, memory granularity has both functional and performance implications:
Native floating-point atomic instructions are only safe for coarse-grained memory; floating-point atomic instructions operating on fine-grained memory will be silently ignored. (Integer atomic instructions, including atomic compare-and-swap, are safe for any memory granularity.)
Fine-grained memory is cached differently than coarse-grained memory, potentially affecting performance.
CCE does not automatically alter memory granularity of any memory ranges, and instead just adopts the default granularity provided by the underlying AMD implementation. This results in different granularity depending on how the memory is allocated:
mapclauses and the
omp_target_allocAPI allocate GPU memory with
hipMalloc, which provides coarse-grained memory at default.
OpenMP allocators with the
pinnedallocator trait allocate memory with
hipMallocHost, which provides fined-grained memory at default.
The allocator returned by the
cray_omp_get_managed_memory_allocator_handleAPI allocates memory with
hipMallocManaged, which provides fine-grained memory at default.
All other standard host allocations provide fine-grained memory at default.
In general, fine-grained memory is only relevant for OpenMP applications that use unified or managed memory. OpenMP applications that do not rely on unified or managed memory, and instead explicitly map all variables accessed on the GPU, will operate entirely on coarse-grained memory.
Please refer to AMD documentation for full detail on memory granularity.
OpenMP Implementation-defined Behavior
The OpenMP Application Program Interface Specification presents a list of implementation-defined behaviors. The Cray-specific implementation is described in the following sections.
When multiple threads access the same shared memory location and at least one thread is a write, threads should be ordered by explicit synchronization to avoid data race conditions and the potential for non-deterministic results. Always use explicit synchronization for any access smaller than one byte.
OpenMP uses the following Internal Control Variables (ICVs).
Initial value: 1
Initial value: TRUE
Behaves according to Algorithm 2-1 of the specification.
Initial value: static
Initial value: 128 MB
Initial value: AUTO
Initial value: 64
Threads may be dynamically created up to an upper limit of 4 times the number of cores/node. It is up to the programmer to try to limit oversubscription.
Initial value: 4095
Initial value: static
The chunksize is rounded up to improve alignment for vectorized loops.
Dynamic Adjustment of Threads
The internal control variable dyn-var is enabled by default. Threads may be dynamically created up to an upper limit which is 4 times the number of cores/node. It is up to the programmer to try to limit oversubscription.
If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads specified for the parallel region exceeds the number that the runtime system can supply, the program terminates. The number of physical processors actually hosting the threads at any given time is fixed at program startup and is constrained by the initial CPU affinity mask of the process. The OMP_NESTED environment variable and the omp_set_nested() call control nested parallelism. To enable nesting, set OMP_NESTED to true or use the omp_set_nested() call. Nesting is disabled by default.
Directives and Clauses
When supported by the target architecture, atomic directives are lowered into native atomic instructions. Otherwise, atomicity is guaranteed with a native atomic compare-and-swap loop; or if the data size is larger than the native atomic compare-and-swap size, then a lock is used. OpenMP atomic directives are compatible with C11 and C++11 atomic operations, as well as GNU atomic builtins.
- do (Fortran), for (C/C++)
For the schedule(guided,chunk) clause, the size of the initial chunk for the master thread and other team members is approximately equal to the trip count divided by the number of threads.
For the schedule(runtime) clause, the schedule type and, optionally, chunk size can be chosen at run time by setting the OMP_SCHEDULE environment variable. If this environment variable is not set, the default behavior of the schedule(runtime) clause is as if the schedule(static) clause appeared instead.
The integer type or kind used to compute the iteration count of a collapsed loop are signed 64-bit integers, regardless of how the original induction variables and loop bounds are defined. If the schedule specified by the runtime schedule clause is specified and run-sched-var is auto, then the Cray implementation generates a static schedule.
In the absence of the schedule clause, the default schedule is static and the default chunk size is approximately the number of iterations divided by the number of threads.
If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads specified for the parallel region exceeds the number that the run time system can supply, the program terminates.
The number of physical processors actually hosting the threads at any given time is fixed at program startup and is constrained by the initial CPU affinity mask of the process.
The OMP_NESTED environment variable and the omp_set_nested() call control nested parallelism. To enable nesting, set OMP_NESTED to true or use the omp_set_nested() call. Nesting is disabled by default.
If a variable is declared as private, the variable is referenced in the definition of a statement function, and the statement function is used within the lexical extent of the directive construct, then the statement function references the private version of the variable.
Multiple structured blocks within a single sections construct are scheduled in lexical order and an individual block is assigned to the first thread that reaches it. It is possible for a different thread to execute each section block, or for a single thread to execute multiple section blocks. There is not a guaranteed order of execution of structured blocks within a section.
A single block is assigned to the first thread in the team to reach the block; this thread may or may not be the master thread.
The threadprivate directive specifies that variables are replicated, with each thread having its own copy. If the dynamic threads mechanism is enabled, the definition and association status of a thread’s copy of the variable is undefined, and the allocation status of an allocatable array is undefined.
It is implementation-defined whether the include file omp_lib.h or the module omp_lib (or both) is provided. It is implementation-defined whether any of the OpenMP runtime library routines that take an argument are extended with a generic interface so arguments of different KIND type can be Fortran accommodated. Cray provides both omp_lib.h and the module omp_lib, and uses generic interfaces for routines. If an OMP runtime library routine is defined to be generic, use of arguments of kind other than those specified by OMP_*_KIND constants is undefined.
The omp_get_max_active_levels() routine returns the maximum number of nested parallel levels currently allowed. There is a single max-active-levels-var internal control variable for the entire runtime system. Thus, a call to omp_get_max_active_levels() will bind to all threads, regardless of which thread calls it.
The omp_set_dynamic() routine enables or disables dynamic adjustment of the number of threads available for the execution of subsequent parallel regions by setting the value of the dyn-var internal control variable. The default is on.
Sets the max-active-levels-var internal control variable. Defaults to 4095. If then argument is less than 1, then set to 1.
The omp_set_nested() routine enables or disables nested parallelism, by setting the nest-var internal control variable. The default is false.
Sets the nthreads-var internal control variable to a positive integer. If the argument is less than 1, then sets nthreads-var to 1.
Sets the schedule type as defined by the current specification. There are no implementation-defined schedule types.
Cray-specific OpenMP API
The following features and behaviors are not included in the OpenMP specification. They are specific to Cray.
subroutine cray_omp_set_wait_policy ( policy ) character(*), intent(in) :: policy
This routine allows dynamic modification of the wait-policy-var internal control variable, which corresponds to the OMP_WAIT_POLICY environment variable. The policy argument provides a hint to the OpenMP runtime library environment about the desired behavior of waiting threads: the acceptable values are ACTIVE or PASSIVE and case-insensitive. It is an error to call this routine within an active parallel region.
The OpenMP runtime library supports an environment variable to control the wait policy:
This environment variable sets the policy at program launch for the duration of the execution. However, in some circumstances it is useful to override the policy at specific points during the program’s execution: in these circumstances, use cray_omp_set_wait_policy to change the wait policy dynamically.
One example of this might be an application that requires OpenMP for the first part of the program’s execution, but then has a clear point after which OpenMP is no longer needed. Given that idle OpenMP threads still consume resources, as they are waiting for more work, this condition results in reduced performance for the remainder of the program’s execution. Therefore, to improve program performance, use cray_omp_set_wait_policy to change the wait policy from ACTIVE to PASSIVE after the end of the OpenMP section of the code.
To avoid deadlock from waiting and signaling threads using different policies, this routine notifies all threads of the policy change at the same time, regardless of whether they are active or idle.
If the omp_lib module is not used and the kind of the actual argument does not match the kind of the dummy argument, the behavior of the procedure is undefined.
This procedure returns real(kind=8) values instead of double-precision values.
This procedure returns real(kind=8) values instead of double-precision values.
CRAY_ACC_DEBUG Output Routines
When the runtime environment variable CRAY_ACC_DEBUG is set to 1, 2, or 3, CCE writes runtime commentary of accelerator activity to STDERR for debugging purposes; every accelerator action on every PE generates output prefixed with “ACC:”. This may produce a large volume of output and it may be difficult to associate messages with certain routines and/or certain PEs.
With this set of API calls, the programmer can enable or disable output at certain points in the code, and modify the string that is used as the debug message prefix.
Set prefix or get prefix
The cray_acc_set_debug_*_prefix( void ) routines define a string that is used as the prefix, with the default being “ACC:”. The cray_acc_get_debug_*_prefix( void ) routines are provided so that the previous setting can be restored.
Output from the library is printed with a format string starting with “ACC: %s %s”, where the global prefix is printed for the first %s (if not NULL), and the thread prefix is printed for the second %s. The global prefix is shared by all host threads in the application, and the thread prefix is set per-thread. By default, strings used in the %s fields are empty.
The C interface is provided by omp.h:
char *cray_acc_get_debug_global_prefix( void )
void cray_acc_set_debug_global_prefix( char * )
char *cray_acc_get_debug_thread_prefix( void )
void cray_acc_set_debug_thread_prefix( char * )
The Fortran interface is provided by the omp_lib module:
character (:), allocatable, intent(out) ::prefix
character (*), intent(out) ::prefix
character (:), allocatable, intent(out) ::prefix
subroutine cray_acc_set_debug_thread_prefix( intlevel)
character (*), intent(out) ::prefix
Set and get debug level
To enable debug output, set level from 1 to 3, with 3 being the most verbose. Setting a level less than or equal to 0 disables the debug output. The get version is provided so the previous setting can be restored. The thread level is an optional override of the global level.
int cray_acc_get_debug_global_level( void )
void cray_acc_set_debug_global_level( intlevel)
int cray_acc_get_debug_thread_level( void )
void cray_acc_set_debug_thread_level( intlevel)
integer ( kind = 4 ), intent(in), value ::level
integer ( kind = 4 ), intent(in), value ::level
If using target directives, a craype-accel module should be loaded to add the necessary compiler options to target an accelerator. For example, to target an NVIDIA GPU, load the craype-accel-nvidiaversion module. The module environment forces dynamic linking.
The craype-accel-host module supports compiling and running an OpenMP application on the host processor. This provides source code portability between systems with and without an accelerator.
Compiler Command-line Options
By default, OpenMP is disabled in CCE and must be explicitly enabled using the -fopenmp compiler command line option. The following CCE command-line options affect OpenMP applications.
- -f [no-]openmp
Enables or disables compiler recognition of OpenMP directives (C/C++/Fortran).
- -h [no]omp
Enables or disables compiler recognition of OpenMP directives (Fortran).
- -h acc_model=option[:option]…
Explicitly controls the execution and memory model utilized by the accelerator support system. The option arguments identify the type of behavior desired. There are three option sets. Only one member of a set may be used at a time; however, all three sets may be used together.
Valid -h acc_model=option values are:
Option Set 1:
Execute kernels and updates synchronously, unless there is an async clause present on the kernels or update directive.
(Default) Execute all kernels asynchronously ensuring program order is maintained.
Execute all kernels and data transfers asynchronously, ensuring program order is maintained.
Option Set 2:
Use default types for addressing.
(Default) Attempt to use 32 bit integers in all addressing to improve performance. This optimization may result in incorrect behavior for some codes.
Option Set 3:
(Default) Do not look inside of an object type to transfer sub-objects. Allocatable members of derived type objects will not be allocated on the device.
(Fortran only) Look inside of derived type objects and recreate the derived type on the accelerator recursively. A derived type object that contains an allocatable member will have memory allocated on the device for the member.
Pass command line arguments to the PTX assembler for OpenMP applications.
Pass command line arguments to the CUDA linker for OpenMP applications.
- -h [no]omp_trace
Enables or disables the insertion of CrayPat OpenMP tracing calls. By default tracing is off.
- -O [no]omp
This option is identical to -h [no]omp.
- -h [no]safe_addr
Provides assurance that most conditionally executed memory references are thread safe, which in turn supports a more aggressive use of speculative writes, thereby improving application performance. If -h nosafe_addr is specified, the optimizer performs speculative stores only when it can prove absolute thread safety using the information available within the application code.
Default: -h safe_addr
- -h threadn
This option controls both OpenMP and autothreading. If n is 0, both OpenMP and autothreading are disabled. For n 1 through 3, other behaviors are specified. This option is identical to -O threadn and is provided for command-line compatibility between the Cray Fortran and Cray C/C++ compilers. If -h thread1 is specified, it is equivalent to specifying -h nosafe_addr.
- -O threadn
This option is identical to -h threadn.
This option can be used to disable specified directives or classes of directives, including OpenMP directives, and OpenACC directives.
For OpenMP applications, use both the OMP_NUM_THREADS environment variable to specify the number of threads and a workload manager option to specify the number of CPUs hosting the threads. The number of threads specified by OMP_NUM_THREADS should not exceed the number of cores in the CPU. If neither the OMP_NUM_THREADS environment variable nor the omp_set_num_threads call is used to set the number of OpenMP threads, the system defaults to the maximum number of available CPUs in the initial CPU affinity mask.
The -g option provides debugging support for OpenMP directives identical to specifying the -G0 option. This level of debugging implies -fopenmp, which means that most optimizations disabled but OpenMP directives are recognized, and -h fp0. To debug without OpenMP, use -g-xomp or -g -fno-openmp, which will disable OpenMP and turn on debugging.
Specifies the accelerator heap size in bytes. The accelerator heap size defaults to 8MB. When compiling with the debug option (-g), CCE may require additional memory from the accelerator heap, exceeding the 8MB default. In this case, there will be malloc failures during compilation. It may be necessary to increase the accelerator heap size to 32MB (33554432), 64MB (67108864), or greater.
When set to 1, 2, or 3 (most verbose), writes runtime commentary of accelerator activity to STDERR for debugging purposes. There is also an API which allows the programmer to enable/disable debug output and set the output message prefix from within the application. See CRAY_ACC_DEBUG Output Routines.
Specify the maximum number of bytes that the Cray accelerator runtime will hold for later reuse.
By default, the Cray accelerator runtime for NVIDIA GPUs does not release memory back to the CUDA runtime, but instead optimizes performance by holding memory allocations for later reuse. Use this environment variable to specify the maximum number of bytes the runtime will hold. To disable this feature, set CRAY_ACC_REUSE_MEM_LIMIT to 0.
When set to a non-empty value, the accelerator runtime library will opportunistically use unified memory. That is, if a particular host address can be accessed directly on the device, then the runtime library will not explicitly allocate device memory and transfer the data between the host and device memories. Instead, an accelerator compute kernel will dereference the original host pointer directly.
This environment variable applies to both OpenACC and OpenMP, including all constructs, clauses, and API functions that make variables and array sections available on the device.
This mode is automatically implied by the omp requires unified_shared_memory directive, with the additional property that running on a GPU without support for unified memory will result in a fatal runtime error.
AMD MI250X GPUs support unified memory for all host addresses when the AMD HSA_XNACK=1 environment variable is set at runtime, enabling unified memory in the GPU. For other AMD and NVIDIA GPUs, a host memory location can only be accessed on the device if that memory was allocated through a HIP or CUDA allocation routine (i.e., it is HIP or CUDA “managed” memory).
When set to a non-empty value, the accelerator runtime library will fully initialize all available devices at program startup time. This overrides the default behavior, which is to defer device initialization until first use. Device initialization includes initializing the GPU vendor’s low-level device runtime library (e.g., libcuda for NVIDIA GPUs) and establishing all necessary software contexts for interacting with the device, so that data transfer and kernel launch operations may be issued to the device. The main benefit of early initialization is that it forces all initialization overhead to be incurred consistently, at program startup time.
When set to a non-empty value, causes the accelerator runtime library to strictly follow the OpenMP specification with respect to pointer attach behavior. The OpenMP specification defines that pointer attach will occur for a base pointer and the target of that base pointer if both are mapped on the same construct and at least one of them is newly created in the device data environment on entry to the construct. Pointer attach is not defined to occur if both the base pointer and pointer target were already present prior to entry of the construct. This behavior may surprise some users, so CCE still performs pointer attach in this case. Setting the environment variable CRAY_ACC_DISABLE_EXTRA_ATTACH to a non-empty value will disable this extra, non-standard pointer attach behavior.
This environment variable is superseded by OMP_DISPLAY_AFFINITY. Cray recommends that users use OMP_DISPLAY_AFFINITY instead of this environment variable.
CRAY_OMP_CHECK_AFFINITY is a run time environment variable. Set it to TRUE to display affinity binding for each OpenMP thread. The messages contain the hostname, process identifier, OS thread identifier, OpenMP thread identifier, and affinity binding.
This is a runtime environment variable. Set it to TRUE to display formatted affinity binding for each OpenMP thread. The default format includes the hostname, process identifier, OS thread identifier, OpenMP thread identifier, and affinity binding. The format can be changed using the OMP_AFFINITY_FORMAT environment variable, which is documented in the OpenMP 55.0 API Syntax Reference Guide.
The default value is true.
The default value is 4095.
The default value is false.
If this environment variable is not set and you do not use the omp_set_num_threads() routine to set the number of OpenMP threads, the default is to the maximum number of available CPUs on the system.
The maximum number of threads per compute node is 4 times the number of allocated processors. If the requested value of OMP_NUM_THREADS is more than the number of threads an implementation can support, the behavior of the program depends on the value of the OMP_DYNAMIC environment variable. If OMP_DYNAMIC is false, the program terminates. If OMP_DYNAMIC is true, it uses up to 4 times the number of allocated processors. For example, on an 8-core Cray XE-series system, this means the program can use up to 32 threads per compute node.
When set to false, the OpenMP runtime does not attempt to set or change affinity binding for OpenMP threads. When not false, this environment variable controls the policy for binding threads to places. Care must be taken when using OpenMP affinity binding with other binding mechanisms or when launching multiple ranks per compute node. Ideally, applications should be launched with appropriate workload manager affinity settings to ensure that each rank recieves a unique initial CPU affinity mask with enough CPUs to satisfy the desired number of OpenMP threads per rank. The main thread will initially bind to all CPUs in the initial affinity mask, but after program startup the OpenMP runtime library can then bind the main thread and all worker threads to different CPUs within the initial affinity mask according to the active OpenMP affinity policy.
Valid values for this environment variable are true, false, or auto; or, a comma-separated list of spread, close, and master. A value of true is mapped to spread.
The default value for OMP_PROC_BIND is auto, a Cray-specific extension. The auto binding policy directs the OpenMP runtime library to select an affinity binding setting that it determines to be most appropriate for a given situation. If there is only a single place in the place-partition-var ICV, and that place corresponds to the initial affinity mask of the main thread, then the auto binding policy maps to false (i.e., binding is disabled). Otherwise, the auto binding policy causes threads to bind in a manner that partitions the available places across OpenMP threads.
This environment variable has no effect if OMP_PROC_BIND=false; when OMP_PROC_BIND is not false, then OMP_PLACES defines a set of places, or CPU affinity masks, to which threads are bound. When using the threads, cores, and sockets keywords, places are constructed according to the CPU topology presented by Linux. However, the place list is always constrained by the initial affinity mask of the main thread. As a result, specific numeric CPU identifiers appearing in OMP_PLACES will map onto CPUs in the initial CPU affinity mask. If an application is launched with an unconstrained initial CPU affinity mask, then numeric CPU identifiers will exactly match Linux CPU numbers. If instead an application is launched with a restricted initial CPU affinity mask, then numeric CPU identifier 0 will map to the first CPU in the initial affinity mask for the main thread; identifier 1 will map to the second CPU in the initial mask, and so on. This allows the same OMP_PLACES environment variable for all PEs to be used, even when launching multiple PEs per node. Specifying the apporpriate the workload manager affinity binding options ensures that each rank begins executing with a non-overlapping initial affinity mask, allowing each instance of the OpenMP runtime to assign thread affinity within those non-overlapping affinity masks.
The default value of OMP_PLACES depends on the value of OMP_PROC_BIND. If OMP_PROC_BIND is auto, then the default value for OMP_PLACES is cores. Otherwise, the default value of OMP_PLACES is threads.
The default value for this environment variable is static. For the schedule(runtime) clause, the schedule type and, optionally, chunk size can be chosen at run time by setting the OMP_SCHEDULE environment variable.
The default value is 128 MB.
Sets the number of OpenMP threads to use for the entire OpenMP program by setting the thread-limit-var ICV. The Cray implementation defaults to 4 times the number of available processors.
Provides a hint to an OpenMP implementation about the desired behavior of waiting threads by setting the wait-policy-var ICV. Possible values are ACTIVE and PASSIVE, as defined by the OpenMP specification, and AUTO, a Cray-specific extension. The default value for this environment variable is AUTO, which direct the OpenMP runtime library to select the most appropriate wait policy for the situation. In general, the AUTO policy behaves like ACTIVE, unless the number of OpenMP threads or affinity binding results in over subscription of the available hardware processors. If over subscription is detected, the AUTO policy behaves like PASSIVE.
A certain amount of overhead is associated with multiprocessing a loop. If the work occurring in the loop is small, the loop can actually run slower by multiprocessing than by single processing. To avoid this, make the amount of work inside the multiprocessed region as large as possible, as is shown in the following examples.
Consider the following code:
DO K = 1, N DO I = 1, N DO J = 1, N A(I,J) = A(I,J) + B(I,K) * C(K,J) END DO END DO END DO
In this example, the J or I loops can be parallelized. The K loop cannot be parallelized because different iterations of the K loop read and write the same values of A(I,J). Always try to parallelize the outermost DO loop that it is possible to parallelize, because it encloses the most work; in this example, the outermost loop that can be parallelized is the I loop.
This code is a good place to try loop interchange. Although the parallelizable loops are not the outermost ones, the loops can be reordered to make a parallelizable loop the outermost one. Thus, loop interchange would produce the following code:
!$OMP PARALLEL DO PRIVATE(I, J, K) DO I = 1, N DO K = 1, N DO J = 1, N A(I,J) = A(I,J) + B(I,K) * C(K,J) END DO END DO END DO
Now the parallelizable loop encloses more work and shows better performance.
In practice, relatively few loops can be reordered in this way. However, it does occasionally happen that several loops in a nest of loops are candidates for parallelization. In such a case, it is usually best to parallelize the outermost one.
Occasionally, the only loop available to be parallelized has a fairly small amount of work. It may be worthwhile to force certain loops to run without parallelism or to select between a parallel version and a serial version, on the basis of the length of the loop.
The loop is worth parallelizing if N is sufficiently large. To overcome the parallel loop overhead, N needs to be around 1000, depending on the specific hardware and the context of the program. The optimized version would use an IF clause on the PARALLEL DO directive:
!$OMP PARALLEL DO IF (N .GE. 1000), PRIVATE(I) DO I = 1, N A(I) = A(I) + X*B(I) END DO
crayftn(1), craycc(1), crayCC(1)
Cray Fortran Reference Manual
Cray C and C++ Reference Manual
The OpenMP Application Program Interface Specification: http://www.openmp.org/specifications