intro_mpi_ucx

intro_mpi_ucx - Introduces the Message Passing Interface (MPI) for the UCX netmod

DESCRIPTION

The Message-Passing Interface (MPI) supports parallel programming across a network of computer systems through a technique known as message passing. The goal of the MPI Forum, simply stated, is to develop a widely used standard for writing message-passing programs. MPI establishes a practical, portable, efficient, and flexible standard for message passing that makes use of the most attractive features of a number of existing message passing systems, rather than selecting one of them and adopting it as the standard.

MPI is a specification (like C or Fortran) that has a number of implementations.

Other sources of MPI information include the man pages for MPI library functions and the following URLs:

http://www.mpich.org
http://www.mpi-forum.org

Cray-MPICH-UCX uses the Unified Communication X (UCX) API as its network interface driver. The provided UCX installation is based off the Mellanox HPC-X® ScalableHPC Software Toolkit. UCX is an open-source collaboration. For more information on the open-source UCX, visit:

https://www.openucx.org

Cray-MPICH-UCX explicitly sets default values for a subset of UCX environment variables that have been tested and determined appropriate. These are documented below, starting with the prefix UCX_.

For a list of UCX environment variables with their default values, use: <path_to_ucx_install>/bin/ucx_info -cf

For a list of available UCX devices and transports, use: <path_to_ucx_install>/bin/ucx_info -d

Note

Changing the values of UCX environment variables to non-default values may lead to undefined behavior. The UCX environment variables are mostly designed for advanced users, or for specific tunings or workarounds recommended by HPE.

Cray-MPICH-UCX does not currently support dynamic process management.

About the MPI UCX Module

Cray MPICH supports runtime switching to the UCX netmod. The cray-mpich-ucx module is provided for runtime loading of the mpich libraries built to support the UCX netmod. To do this load the craype-network-ucx module and module swap between cray-mpich and cray-mpich-ucx modules. The cray-mpich-ucx module is not loaded by default and must be loaded manually.

ENVIRONMENT

Environment variables have predefined values. You can change some variables to achieve particular performance objectives; others are required values for standard-compliant programs.

GENERAL MPICH ENVIRONMENT VARIABLES

MPICH_ABORT_ON_ERROR

If enabled, causes MPICH to abort and produce a core dump when MPICH detects an internal error. Note that the core dump size limit (usually 0 bytes by default) must be reset to an appropriate value in order to enable coredumps.

Default: Not enabled.

MPICH_ASYNC_PROGRESS

If enabled, MPICH will initiate an additional thread to make asynchronous progress on all communication operations including point-to-point, collective, one-sided operations, and I/O. Setting this variable will automatically increase the thread-safety level to MPI_THREAD_MULTIPLE. While this improves the progress semantics, it might cause a small amount of performance overhead for regular MPI operations. The user is encouraged to leave one or more hardware threads vacant in order to prevent contention between the application threads and the progress thread(s). The impact of oversubscription is highly system dependent but may be substantial in some cases, hence this recommendation.

Default: Not enabled.

MPICH_CH4_PROGRESS_POKE

If set to 1, this cvar allows MPI to call progress_test() from within an MPI_Isend() operation. Progress_test() is called only if GPU support is enabled and if the process posted recv request list is non-NULL. If this is set to 0 OR GPU support is disabled, progress_test() is not called from within an MPI_Isend() operation.

Default: 1

MPICH_CH4_PROGRESS_POKE_FREQ

This cvar controls how often a process calls progress_test() from within an MPI_Isend operation. This variable is not relevant if MPICH_CH4_PROGRESS_POKE is 0 OR GPU support is disabled.

Default: 8

MPICH_COLL_SYNC

If enabled, a Barrier is performed at the beginning of each specified MPI collective function. This forces all processes participating in that collective to sync up before the collective can begin.

To disable this feature for all MPI collectives, set the value to 0. This is the default.

To enable this feature for all MPI collectives, set the value to 1.

To enable this feature for selected MPI collectives, set the value to a comma-separated list of the desired collective names. Names are not case-sensitive. Any unrecognizable name is flagged with a warning message and ignored. The following collective names are recognized: MPI_Allgather, MPI_Allgatherv, MPI_Allreduce, MPI_Alltoall, MPI_Alltoallv, MPI_Alltoallw, MPI_Bcast, MPI_Exscan, MPI_Gather, MPI_Gatherv, MPI_Reduce, MPI_Reduce_scatter, MPI_Scan, MPI_Scatter, and MPI_Scatterv.

Default: Not enabled.

MPICH_ENV_DISPLAY

If set, causes rank 0 to display all MPICH environment variables and their current settings at MPI initialization time.

Default: Not enabled.

MPICH_GPU_SUPPORT_ENABLED

If set to 1, enables GPU support. Currently, AMD and NVIDIA GPUs are supported. If a parallel application is GPU-enabled and performs MPI operations with communication buffers that are on GPU-attached memory regions, MPICH_GPU_SUPPORT_ENABLED needs to be set to 1.

Default: 0

MPICH_GPU_MAX_NUM_STREAMS

Specifies maximum number of GPU stream handles a process can use to implement intra-node D2D, D2H, and H2D operations. This is only applicable to systems with NVIDIA GPUs. This variable defaults to 1, but can be set to a value between 1 and 4. Certain use cases can benefit from MPI issuing asynchronous cudaMemcpy operations on different streams.

Default: 1

MPICH_GPU_IPC_ENABLED

If set to 1, enables GPU IPC support for intra-node GPU-GPU communication operations. Currently, this supports the use of IPC for both AMD and NVIDIA GPUs. If MPICH_GPU_SUPPORT_ENABLED is set to 1, MPICH_GPU_IPC_ENABLED is automatically set to 1. This variable has no effect if MPICH_GPU_SUPPORT_ENABLED is set to 0

Default: 1

MPICH_GPU_IPC_CACHE_MAX_SIZE

Specifies maximum number of GPU IPC cache handles a process can retain in its IPC cache at a given point in time. This value currently defaults to 50. If a process has 50 entries in its IPC cache and a new entry is required, we evict the oldest entry that is still in the IPC cache. For most use cases, the current default should suffice. However, for some use cases that aggressively manage memory on the GPU, adjusting this value lower (such as, 5 or 10) will allow HPE Cray MPI to offer IPC optimizations while also effectively capping how many IPC handles can be retained in the cache.

Default: 50

MPICH_GPU_MANAGED_MEMORY_SUPPORT_ENABLED

If set to 1, enables GPU managed memory support. This setting will allow MPI to properly handle unified memory addresses.

On systems with NVIDIA GPUs, this setting may lead to a small performance overhead. This is because the MPI implementation needs to perform an additional pointer query against the GPU runtime layer for each MPI data transfer operation.

For latency sensitive use cases that do not rely on NVIDIA’s Managed Memory routines, users are advised to set MPICH_GPU_MANAGED_MEMORY_SUPPORT_ENABLED to 0.

This variable has no effect if MPICH_GPU_SUPPORT_ENABLED is set to 0.

Default: 1

MPICH_GPU_EAGER_REGISTER_HOST_MEM

If set to 1, the MPI library registers the CPU-attached shared memory regions with the GPU runtime layers. These shared memory regions are used for small message intra-node CPU-to-GPU and GPU-to-GPU MPI transfers. This optimization helps amortize the cost of registering memory with the GPU runtime layer. MPICH_GPU_EAGER_REGISTER_HOST_MEM is automatically set to 1, if MPICH_GPU_SUPPORT_ENABLED is set to 1. This variable has no effect if MPICH_GPU_SUPPORT_ENABLED is set to 0

Default: 1

MPICH_GPU_IPC_THRESHOLD

This variable determines the threshold for the GPU IPC capability. GPU IPC takes advantage of DMA engines on GPU devices to accelerate data movement operations between GPU devices on the same node. Intra-node GPU-GPU transfers with payloads of size greater than or equal to this value will use the IPC capability. Transfers with smaller payloads will use CPU-attached shared memory regions.

Default: 1024

MPICH_GPU_NO_ASYNC_COPY

This variable toggles an optimization for intra-node MPI transfers involving CPU and GPU buffers. This optimization is enabled by default. If set to 1, it reverts to using blocking memcpy operations for intra-node MPI transfers involving CPU and GPU buffers. Depending on the GPU hardware being used, disabling this optimziation negatively affects the performance of large message intra-node MPI operations involving the CPU-to-GPU data paths.

Default: 0

MPICH_GPU_COLL_STAGING_AREA_OPT

This variable toggles an optimization for certain collective operations involving GPU buffers. The optimization is currently implemented for MPI_Allreduce operations involving large payloads. This optimization is applicable for GPU-GPU transfers involving communication peers that are on the same compute node, or on different compute nodes. If set to 1, this optimization is enabled.

Default: 1

MPICH_MEMORY_REPORT

If set to 1, print a summary of the min/max high water mark and associated rank to stderr.

If set to 2, output each rank’s high water mark to a file as specified using MPICH_MEMORY_REPORT_FILE.

If set to 3, do both 1 and 2.

The detailed report for each rank may be slightly higher than the summary because the summary is collected earlier during finalize and requires MPI collective calls, which may allocate more memory.

Example #1:MPICH_MEMORY_REPORT=1

This summary reports maximum and minimum values and the lowest rank that reported the value (max_loc/min_loc reductions). The by malloc lines are for malloc/free calls. The by mmap lines are for mmap/munmap calls. The by shmget lines are for shmget or SYSCALL(shmget) and shmctl(..RM_ID..) calls.
# MPICH_MEMORY: Max memory allocated by malloc:  3898224 bytes by rank 40
# MPICH_MEMORY: Min memory allocated by malloc:  2805600 bytes by rank 61
# MPICH_MEMORY: Max memory allocated by mmap:    10485760 bytes by rank 0
# MPICH_MEMORY: Min memory allocated by mmap:    10485760 bytes by rank 0
# MPICH_MEMORY: Max memory allocated by shmget:  108821784 bytes by rank 0
# MPICH_MEMORY: Min memory allocated by shmget:  0 bytes by rank 1
# MPICH_MEMORY: Max memory reserved by symmetric heap:  2097152 bytes by rank 0
# MPICH_MEMORY: Min memory reserved by symmetric heap:  2097152 bytes by rank 0
# MPICH_MEMORY: Max memory allocated by symmetric heap: 1048576 bytes by rank 0
# MPICH_MEMORY: Min memory allocated by symmetric heap: 1048576 bytes by rank 0

Example #2:MPICH_MEMORY_REPORT=2

Each rank reports similar high water information to file.rank. Each line in the report begins with rank. This example is for rank 1.
# [1] Max memory allocated by malloc:    2898408 bytes
# [1] Max memory allocated by mmap:      10485760 bytes
# [1] Max memory allocated by shmget:    0 bytes
# [1] Max memory reserved by symmetric heap:  2097152 bytes
# [1] Max memory allocated by symmetric heap: 1048576 bytes

Default: not set (off)

MPICH_MEMORY_REPORT_FILE

Specifies the target path/prefix for the detailed high water mark list generated if MPICH_MEMORY_REPORT is set to 2 or 3. The actual filename for each high water mark report is this path/prefix plus the MPI rank number. If the specified target file cannot be opened, stderr is used.

Default: stderr

MPICH_NO_BUFFER_ALIAS_CHECK

If enabled, the buffer alias error check for collectives is disabled. The MPI standard does not allow aliasing of type OUT or INOUT parameters on the same collective function call. The use of MPI_IN_PLACE is required in these scenarios. A check is in place to detect this condition and report the error. To bypass this check, set MPICH_NO_BUFFER_ALIAS_CHECK to any value.

Default: Not enabled.

MPICH_OPT_THREAD_SYNC

Controls the mechanism used to implement thread-synchronization inside the MPI library. If set to 1, an optimized synchronization implementation is used. If set to 0, the library falls back to using the pthread mutex based thread-synchronization implementation. This variable is applicable only if the MPI_THREAD_MULTIPLE threading level is requested by the application during MPI initialization.

Default: 1

MPICH_OPTIMIZED_MEMCPY

Specifies which version of memcpy to use. Valid values are:

0

Use the system (glibc) version of memcpy.

1

Use an optimized version of memcpy if one is available for the processor being used.

2

Use a highly optimized version of memcpy that provides better performance in some areas but may have performance regressions in other areas, if one is available for the processor being used.

Default: 1

MPICH_RANK_REORDER_DISPLAY

If enabled, causes rank 0 to display which node each MPI rank resides in. The rank order can be manipulated via the MPICH_RANK_REORDER_METHOD environment variable.

Default: Not enabled.

MPICH_RANK_REORDER_FILE

If MPICH_RANK_REORDER_METHOD is set to 3 and this variable is set, then the value of this variable is the file name that MPI will check for rank reordering information. If this variable is not set, then MPI will check the default file name, MPICH_RANK_ORDER. Default: Not set

MPICH_RANK_REORDER_METHOD

Overrides the default scheme of assigning MPI rank identifiers to processes. Note that this does not change where the application launcher places processes. This feature requires the use of HPE Cray PMI. To display the MPI rank assignment information, enable MPICH_RANK_REORDER_DISPLAY.

MPICH_RANK_REORDER_METHOD accepts the following values:

0
Specifies round-robin assignment. Sequential MPI ranks are assigned to the next process on the next node in the list. When every node has been used, the rank assignment starts over again with the first node.

For example, an 8-process job launched on 4 dual-core nodes would be assigned as
NODE   RANK
  0    0&4
  1    1&5
  2    2&6
  3    3&7
For example, an 8-process job launched on 2 quad-core nodes would be assigned as
NODE   RANK
  0    0&2&4&6
  1    1&3&5&7
A 24-process job launched on three 8-core nodes would be assigned as
NODE   RANK
  0    0&3&6&9&12&15&18&21
  1    1&4&7&10&13&16&19&22
  2    2&5&8&11&14&17&20&23
If the last node is not fully populated with MPI processes, no additional ranks can be assigned on that node. A 20-process job launched on three 8-core nodes with round-robin assignment would be assigned as::
::

    NODE   RANK
      0    0&3&6&9&12&14&16&18
      1    1&4&7&10&13&15&17&19
      2    2&5&8&11
1
Specifies SMP-style assignment. For a multi-core node, sequential MPI ranks are placed on the same node.

For example, an 8-process job launched on 4 dual-core nodes would be assigned as::
::

    NODE   RANK
      0    0&1
      1    2&3
      2    4&5
      3    6&7
An 8-process job launched on 2 quad-core nodes would be assigned as
NODE   RANK
  0    0&1&2&3
  1    4&5&6&7
A 24-process job launched on three 8-core nodes would be assigned as
NODE   RANK
  0    0&1&2&3&4&5&6&7
  1    8&9&10&11&12&13&14&15
  2    16&17&18&19&20&21&22&23
2
Specifies folded-rank assignment. Sequential MPI ranks are assigned on the next node in the list. When every node has been used, instead of starting over with the first node again, the rank assignment starts at the last node, going back to the first. For quad-core or larger nodes, this fold is repeated.

For example, an 8-process job on 4 dual-core nodes would be assigned as
NODE   RANK
  0    0&7
  1    1&6
  2    2&5
  3    3&4
An 8-process job on 2 quad-core nodes would be assigned as
NODE   RANK
  0    0&3&4&7
  1    1&2&5&6
A 24-process job launched on three 8-core nodes would be assigned as
NODE   RANK
  0    0&5&6&11&12&17&18&23
  1    1&4&7&10&13&16&19&22
  2    2&3&8&9&14&15&20&21
3

Specifies a custom rank assignment defined in the file named MPICH_RANK_ORDER. The MPICH_RANK_ORDER file must be readable by the first process of the program, and reside in the current running directory. The order in which the ranks are listed in the file determines which ranks are assigned closest to each other, starting with the first node in the list. To help with creating this file, consider using the grid_order tool from the Perftools package.

The PALS launcher forwards stdin to the first process it starts. If your application requires stdin to be available to rank 0, any rank reorder arrangement must not reorder the original rank 0. If your application does not read from stdin, rank 0 can be reordered.

For example:

0-15

Assigns the ranks in SMP-style order (see above).

15-0

For dual-core processors, assigns ranks 15&14 on the first node, ranks 13&12 on the next node, and so on. For quad-core processors, assigns ranks 15&14&13&12 on the first node, ranks 11&10&9&8 on the next node, and so on.

4,1,5,2,6,3,7,0,…

Assigns the first n ranks listed on the first node, the next n ranks on the next node, and so on, where n is the number of processes launched on each node.

You can use combinations of ranges (8-15) or individual rank numbers in the MPICH_RANK_ORDER file. The number of ranks listed in this file must match the number of processes launched.

A # denotes the beginning of a comment. A comment can start in the middle of a line and will continue to the end of the line.

MPICH_RMA_MAX_PENDING

Determines how many RMA network operations may be outstanding at any time. RMA operations beyond this max will be queued and only issued as pending operations complete.

Default: 64

MPICH_RMA_SHM_ACCUMULATE

If set to 1, enables accumulate operations using shm shared memory. If set to 0, disables shm and accumulates will use other implementations. It also sets the default for the window hint “disable_shm_accumulate” to true if MPICH_RMA_SHM_ACCUMULATE is 0, and false if MPICH_RMA_SHM_ACCUMULATE is 1.

Default: 1

MPICH_SINGLE_HOST_ENABLED

If enabled, prevents MPICH from using networking hardware when all ranks are on a single host. This avoids the unnecessary consumption of networking resources. This feature is only usable when MPI Spawn is not possible due to all possible ranks in MPI_UNIVERSE_SIZE already being part of MPI_COMM_WORLD.

Default: Enabled

MPICH_VERSION_DISPLAY

If enabled, causes MPICH to display the HPE Cray MPI version number as well as build date information. The version number can also be accessed through the attribute CRAY_MPICH_VERSION.

Default: Not enabled.

GPU-NIC ASYNC ENVIRONMENT VARIABLE

MPICH_GPU_USE_STREAM_TRIGGERED

If set, causes MPICH to allow using GPU-NIC Async Stream Triggered (ST) GPU communication operations.

Default: Not enabled.

MPICH_GPU_USE_KERNEL_TRIGGERED

If set, causes MPICH to allow using GPU-NIC Async Kernel Triggered (KT) GPU communication operations.

Default: Not enabled.

MPICH_GPU_USE_STREAM_TRIGGERED_SET_SIGNAL

If set, causes MPICH to allow using stream triggered GPU communication operations with atomic set operations for signaling purpose.

Default: Not enabled.

MPICH_MAX_TOPS_COUNTERS

Specifies the maximum number of HW counters to be opened for performing the triggered operations required to support the ST and KT GPU-NIC communication operations. Triggered operation should be enabled for this variable to take effect.

Default: 64.

UNIFIED COMMUNICATION X (UCX) ENVIRONMENT VARIABLES

MPICH_UCX_VERBOSE

If set to 1, more verbose output will be displayed during MPI_Init to confirm the UCX network interface driver is being used. Set this to 2 to display additional UCX configuration details, as well as a subset of UCX endpoint settings which detail the specific transports used on a per-rank basis. Set this to 3 to display all the above, and an UCX endpoint setting for each rank. This may be helpful for debugging purposes.

Default: not set

MPICH_UCX_RC_MAX_RANKS

By default, CrayMPICH selects either the UCX rc or ud transports for inter-node messaging. If a job is launched with MPICH_UCX_RC_MAX_RANKS ranks or less, the rc transport is selected. If more ranks are launched, the UCX ud transport is chosen. The rc transport provides good performance but does not scale well due to resource requirements. The ud transport scales very well and uses a limited amount of resources. Set this variable to change the default job size cut-off between the rc and ud transports. Selecting a non-default transport via the UCX_TLS environment variable overrides this setting.

Default: 8

UCX_IB_REG_METHODS

This is a UCX ENV variable. It specifies which memory registration method is used by UCX. By default, the rcache users-pace memory registration cache method is used to provide the best performance. However, in certain cases at high scale, when a large amount of memory is being registered with the device, the rcache method may run out of resources. In this case, an error similar to “UCX ERROR ibv_exp_reg_mr(address=0xnn, length=nn, access=0xf) failed: Cannot allocate memory” may occur. To work around this limitation, it may be necessary to request the direct memory registration method by setting this variable: export UCX_IB_REG_METHODS=direct.

Default: rcache

UCX_NET_DEVICES

This is a UCX ENV variable. It specifies the network devices for UCX to use. By default, UCX attempts to use them all in an optimal manner. If multiple NICs are available on a node, you may use this environment variable to limit your application to using only a subset of NICS. For example, if the node has four Mellanox NICS available (mlx5_0:1, mlx5_1:1, mlx5_2:1 mlx5_3:1), and you want to limit use to only use mlx5_0 and mlx5_2, you would set: export UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1

Default: all

UCX_LOG_LEVEL

This is a UCX ENV variable. If set to “warn”, UCX will issue warnings if it detects the system does not enable certain features that may impact performance. It may also issue warnings if not all user resources have been freed by the application during finalize. Normally these warnings are harmless, since resources will be freed upon application termination. To enable these warnings, set this variable: export UCX_LOG_LEVEL=warn

Default: error

UCX_TLS

This is a UCX ENV variable. It controls the transports that UCX uses for the job. More than one transport can be specified, as any given rank can communicate with other ranks that are on different nodes, to itself, or to ranks that are on the same node. By default, for jobs using MPICH_UCX_RC_MAX_RANKS or fewer, CrayMPICH-UCX will use UCX_TLS=rc,self,sm. For jobs using more than MPICH_UCX_RC_MAX_RANKS, a default of UCX_TLS=ud,self,sm is used. The ud transport is highly recommended when running applications at scale.

For more details on which transports UCX is using, see the MPICH_UCX_VERBOSE option.

Default behavior if unset:
For jobs sizes of <- MPICH_UCX_RC_MAX_RANKS ranks, default is "rc,self,sm"
For job sizes of > MPICH_UCX_RC_MAX_RANKS ranks, default is "ud,self,sm"

UCX_UD_TIMEOUT

This is a UCX ENV variable. It controls the maximum timeout for UD connections.

Default: 10min

COLLECTIVE ENVIRONMENT VARIABLES

MPICH_ALLGATHER_VSHORT_MSG

Adjusts the cutoff point at and below which the architecture-specific optimized gather/bcast algorithm is used instead of the optimized ring algorithm for MPI_Allgather. The gather/bcast algorithm is better suited for small messages.

Defaults:
For communicator sizes of <= 512 ranks, 1024 bytes.
For communicator sizes of > 512 ranks, 4096 bytes.

MPICH_ALLGATHERV_VSHORT_MSG

Adjusts the cutoff point at and below which the architecture-specific optimized gatherv/bcast algorithm is used instead of the optimized ring algorithm for MPI_Allgatherv. The gatherv/bcast algorithm is better suited for small messages.

Defaults:
For communicator sizes of <= 512 ranks, 1024 bytes.
For communicator sizes of > 512 ranks, 4096 bytes.

MPICH_ALLREDUCE_BLK_SIZE

Specifies the block size (in bytes) to use when dividing very large Allreduce messages into smaller blocks for better performance. The value is interpreted as bytes, unless the string ends in a K, which indicates kilobytes, or M, which indicates megabytes. Valid values are between 8192 and MAX_INT.

Default: 716800 bytes

MPICH_ALLREDUCE_GPU_MAX_SMP_SIZE

When GPU support is enabled, this variable specifies the maximum message size (in bytes) for which an SMP-aware allreduce algorithm is used. Larger allreduce messages will use a reduce-scatter-allgather algorithm. A value of 0 specifies an SMP-aware allreduce algorithm for all message sizes. The value is interpreted as bytes, unless the string ends in a K, which indicates kilobytes, or M, which indicates megabytes.

Default: 1024 bytes

MPICH_ALLREDUCE_MAX_SMP_SIZE

Specifies the maximum message size (in bytes) for which an SMP-aware allreduce algorithm is used. Larger allreduce messages will use a reduce-scatter-allgather algorithm. A value of 0 specifies an SMP-aware allreduce algorithm for all message sizes. The value is interpreted as bytes, unless the string ends in a K, which indicates kilobytes, or M, which indicates megabytes.

Default: 262144 bytes

MPICH_ALLREDUCE_NO_SMP

If set, MPI_Allreduce uses an algorithm that is not smp-aware. This provides a consistent ordering of the specified allreduce operation regardless of system configuration.

Note: This algorithm may not perform as well as the default smp-aware algorithms as it does not take advantage of rank topology.

Default: not set

MPICH_ALLTOALL_BLK_SIZE

Specifies the chunk size in bytes for the MPI_Alltoall chunking algorithm. Larger messages will be broken into chunks of this size. Only applies to Slingshot 11.

Default: 16384

MPICH_ALLTOALL_CHUNKING_MAX_NODES

Adjusts the cut-off point at and below which the MPI_Alltoall chunking algorithm is used. The chunking algorithm sends large messages in chunks of size MPICH_ALLTOALL_BLK_SIZE bytes. By default, the MPI_Alltoall chunking algorithm is used for communicators spanning a smaller number of nodes, and the MPI_Alltoall throttled (non-chunking) algorithm is used for communicators spanning more nodes than this value. To disable the chunking algorithm entirely, set this to 0. Only applies to Slingshot 11.

Default: 90

MPICH_ALLTOALL_SHORT_MSG

Adjusts the cut-off points at and below which the store and forward Alltoall algorithm is used for short messages. The default value is dependent upon the total number of ranks in the MPI communicator used for the MPI_Alltoall call.

Defaults:
if communicator size <= 1024, 512 bytes
if communicator size > 1024 and <= 65536, 256 bytes
if communicator size > 65536 and <= 131072, 128 bytes
if communicator size > 131072, 64 bytes

MPICH_ALLTOALL_SYNC_FREQ

Adjusts the number of outstanding messages (the synchronization frequency) each rank participating in the Alltoall algorithm will allow. The defaults vary for each call, depending on several factors, including number of ranks on a node participating in the collective, and the message size.

Default: Varies from 1 to 24

MPICH_ALLTOALLV_THROTTLE

Sets the per-process maximum number of outstanding Isends and Irecvs that can be posted concurrently for the MPI_Alltoallv and MPI_Alltoallw algorithms. This setting also applies to the non-blocking MPI_Ialltoallv and MPI_Ialltoallw algorithms that use throttling. For sparsely-populated or small message Alltoallv/w data, setting this to a higher value may improve performance. For heavily-populated large message Alltoallv/w data, or when running at high process-per-node counts, consider decreasing this value to improve performance.

Default: 8

MPICH_BCAST_INTERNODE_RADIX

Used to set the radix of the inter-node tree. This can be set to any integer value greater than or equal to 2.

Default: 4

MPICH_BCAST_INTRANODE_RADIX

Used to set the radix of the intra-node tree. This can be set to any integer value greater than or equal to 2.

Default: 4

MPICH_BCAST_ONLY_TREE

If set to 1, MPI_Bcast uses an smp-aware tree algorithm regardless of data size. The tree algorithm generally scales well to high processor counts.

If set to 0, MPI_Bcast uses a variety of algorithms (tree, scatter, or ring) depending on message size and other factors.

Default: 1

MPICH_COLL_OPT_OFF

If set, disables collective optimizations which use nondefault, architecture-specific algorithms for some MPI collective operations. By default, all collective optimized algorithms are enabled.

To disable all collective optimized algorithms, set MPICH_COLL_OPT_OFF to 1.

To disable optimized algorithms for selected MPI collectives, set the value to a comma-separated list of the desired collective names. Names are not case-sensitive. Any unrecognizable name is flagged with a warning message and ignored. For example, to disable the MPI_Allgather optimized collective algorithm, set MPICH_COLL_OPT_OFF=mpi_allgather.

The following collective names are recognized: MPI_Allgather, MPI_Allgatherv, MPI_Allreduce, MPI_Alltoall, MPI_Alltoallv, MPI_Bcast, MPI_Gatherv, MPI_Scatterv, MPI_Igatherv, and MPI_Iallreduce.

Default: Not enabled.

MPICH_ENABLE_HCOLL

This enables the use of Mellanox’s HCOLL collectives offload feature when the UCX netmod is being used. The HCOLL libraries must be in the library search path, and HCOLL must be configured on the system. HCOLL will give optimized performance for some collectives at the cost of higher MPI communicator creation time. This feature will not be available if optimizations are disabled through MPICH_COLL_OPT_OFF.

Default: Not enabled

MPICH_GATHERV_MAX_TMP_SIZE

Only applicable to the Gatherv tree algorithm. Sets the maximum amount of temporary memory Gatherv will allow a rank to allocate when using the tree-based algorithm. Each rank allocates a different amount, with many allocating no extra memory. If any rank requires more than this amount of temporary buffer space, a different algorithm is used.

Default: 512M

MPICH_GATHERV_MIN_COMM_SIZE

Cray MPI offers two optimized Gatherv algorithms: a tree algorithm for small messages and a permission-to-send algorithm for larger messages. Set this value to the minimum communicator size to attempt use of either of the Cray optimized Gatherv algorithms. Smaller communicator sizes will use the ANL MPI_Gatherv algorithm.

Default: 64

MPICH_GATHERV_SHORT_MSG

Adjusts the cutoff point at and below which the optimized tree MPI_Gatherv algorithm is used instead of the optimized permission-to-send algorithm. The cutoff is in bytes, based on the average size of the variable MPI_Gatherv message sizes.

Default: 131072

MPICH_GPU_ALLGATHER_VSHORT_MSG_ALGORITHM

If set to 1, enables optimizations for small message MPI_Allgather operations with GPU-attached payloads. This variable is only relevant if MPICH_GPU_SUPPORT_ENABLED is set to 1 and MPICH_GPU_COLL_STAGING_AREA_OPT is also set to 1

Default: 1

MPICH_GPU_ALLREDUCE_BLK_SIZE

Controls the size of the GPU-attached staging buffer user for GPU-kernel-based optimizations for MPI_Allreduce, MPI_Reduce and MPI_Reduce_scatter_block. Defaults to 8MB per process. There is evidence that suggests that larger values (~ 64MB) can offer improved Allreduce performance for very large payloads (100s of MB). Setting the default conservatively for now but allowing for additional tuning opportunities for specific use cases in the future. This variable is relevant only if MPICH_GPU_ALLREDUCE_USE_KERNEL and MPICH_GPU_SUPPORT_ENABLED are also set.

Default: 8388608

MPICH_GPU_ALLREDUCE_KERNEL_THRESHOLD

MPI_Allreduce collectives with payloads equal to or larger than this threshold can utilize the GPU kernel-based optimization. This variable is relevant only if MPICH_GPU_ALLREDUCE_USE_KERNEL and MPICH_GPU_SUPPORT_ENABLED are also set.

Default: 131072

MPICH_GPU_ALLREDUCE_USE_KERNEL

If set, adds a hint that the use of device kernels for reduction operations is desired. MPI is not guaranteed to use a device kernel for all reduction operations. This variable is relevant only if MPICH_GPU_SUPPORT_ENABLED is set to 1. GPU kernel-based optimizations are currently disabled for Reductions that involve the use of non-contig MPI datatypes. This feature is also used only when user buffers are on GPU-attached memory regions. This optimization is applicable for MPI_Allreduce, MPI_Reduce and MPI_Reduce_Scatter_block.

Default: 1

MPICH_GPU_REDUCE_KERNEL_THRESHOLD

MPI_Reduce collectives with payloads equal to or larger than this threshold can utilize the GPU kernel-based optimization. This variable is relevant only if MPICH_GPU_ALLREDUCE_USE_KERNEL and MPICH_GPU_SUPPORT_ENABLED are also set.

Default: 2048

MPICH_GATHERV_SYNC_FREQ

Only applicable to the Gatherv permission-to-send algorithm. Adjusts the number of outstanding receives the root for Gatherv will allow.

Default: 16

MPICH_IALLGATHERV_THROTTLE

Sets the per-process maximum number of outstanding Isends and Irecvs that can be posted concurrently for the throttled MPI_Iallgatherv algorithm. This only applies if the throttled MPI_Iallgatherv algorithm is explicitly requested by setting MPICH_IALLGATHERV_INTRA_ALGORITHM=throttled. This algorithm may be beneficial when using a small number of ranks per node. By default a recursive_doubling, brucks or ring algorithm is chosen based on data size and other parameters.

Default: 6

MPICH_IGATHERV_MIN_COMM_SIZE

Set this value to the minimum communicator size to trigger use of the Cray optimized Igatherv permission-to-send algorithm. Smaller communicator sizes send without permission.

Default: 1000

MPICH_IGATHERV_SYNC_FREQ

Adjusts the maximum number of receives the root rank of the Cray optimized Igatherv algorithm can have outstanding.

Default: 100

MPICH_REDUCE_NO_SMP

If set, MPI_Reduce uses an algorithm that is not smp-aware. This provides a consistent ordering of the specified reduce operation regardless of system configuration.

Note: This algorithm may not perform as well as the default smp-aware algorithms as it does not take advantage of rank topology.

Default: not set

MPICH_REDUCE_SCATTER_COMMUTATIVE_LONG_MSG_SIZE

This environment variable applies to MPI_Reduce_scatter and MPI_Reduce_scatter_block. For the reduce_scatter functions, this variable specifies the cutoff size of the send buffer (in bytes) at and above which a pairwise exchange algorithm is attempted. In addition, the op must be commutative and the communicator size less than or equal to MPICH_REDUCE_SCATTER_MAX_COMMSIZE for the pairwise exchange algorithm to be used. For smaller send buffers, a recursive halving algorithm is used.

Default value: 524288

MPICH_REDUCE_SCATTER_MAX_COMMSIZE

This environment variable applies to MPI_Reduce_scatter and MPI_Reduce_scatter_block. For the reduce_scatter functions, this variable specifies the maximum communicator size that triggers use of the pairwise exchange algorithm, provided the op is commutative. The pairwise exchange algorithm is not well-suited for scaling to high process counts, so for larger communicators, a recursive halving algorithm is used by default instead.

Default value: 1000

MPICH_SCATTERV_MAX_TMP_SIZE

Only applicable to the Scatterv tree algorithm. Sets the maximum amount of temporary memory Scatterv will allow a rank to allocate when using the tree-based algorithm. Each rank allocates a different amount, with many allocating no extra memory. If any rank requires more than this amount of temporary buffer space, a different algorithm is used.

Default: 512M

MPICH_SCATTERV_MIN_COMM_SIZE

Cray MPI offers two optimized Scatterv algorithms: a tree algorithm for small messages and a staggered send algorithm for larger messages. Set this value to the minimum communicator size to attempt use of either of the Cray optimized Scatterv algorithms. Smaller communicator sizes will use the ANL MPI_Scatterv algorithm.

Default: 64

MPICH_SCATTERV_SHORT_MSG

Adjusts the cutoff point at and below which the optimized tree MPI_Scatterv algorithm is used instead of the optimized staggered send algorithm. The cutoff is in bytes, based on the average size of the variable MPI_Scatterv message sizes.

Default behavior if unset is:

For communicator sizes of < or = 512 ranks, 2048 bytes

For communicator sizes of > 512 ranks, 8192 bytes

MPICH_SCATTERV_SYNCHRONOUS

Only applicable to the ANL non-optimized Scatterv algorithm. The ANL MPI_Scatterv algorithm uses asynchronous sends for communicator sizes less than 200,000 ranks. If set, this environment variable causes the ANL MPI_Scatterv algorithm to switch to using blocking sends, which may be beneficial with large data sizes or high process counts.

For communicator sizes equal to or greater than 200,000 ranks, the blocking send algorithm is used by default.

Default: not enabled

MPICH_SCATTERV_SYNC_FREQ

Only applicable to the Scatterv staggered send algorithm. Adjusts the number of outstanding sends the root for Scatterv will use.

Default: 16

MPICH_SHARED_MEM_COLL_OPT

By default, the MPICH library will use the optimized shared-memory based design for collective operations. The supported collective operations are: MPI_Allreduce, MPI_Barrier, and MPI_Bcast.

To disable all available shared-memory optimizations, set MPICH_SHARED_MEM_COLL_OPT to 0.

To enable this feature for a specific set of collective operations, set MPICH_SHARED_MEM_COLL_OPT to a comma-separated list of collective names. For example, to enable this optimization for MPI_Bcast only, set MPICH_SHARED_MEM_COLL_OPT=MPI_Bcast. To enable this optimization for MPI_Allreduce only, set MPICH_SHARED_MEM_COLL_OPT=MPI_Allreduce. Unsupported names are flagged with a warning message and ignored.

Default: set

MPI-IO ENVIRONMENT VARIABLES

MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY

If enabled, displays the assignment of MPIIO collective buffering aggregators for reads/writes of a shared file, showing rank and node ID (nid). For example:

Aggregator Placement for /lus/scratch/myfile
RankReorderMethod=3  AggPlacementStride=-1
 AGG    Rank       nid
 ----  ------  --------
    0       0  nid00578
    1       4  nid00579
    2       1  nid00606
    3       5  nid00607
    4       2  nid00578
    5       6  nid00579
    6       3  nid00606
    7       7  nid00607

Default: Not enabled.

MPICH_MPIIO_AGGREGATOR_PLACEMENT_STRIDE

Partially controls to which nodes MPIIO collective buffering aggregators are assigned. See the notes below on the order of nodes. Network traffic and resulting I/O performance may be affected by the assignments.

If set to 1, consecutive nodes are used. The number of aggregators assigned per node is controlled by the cb_config_list hint. By default, no more than one aggregator per node will be assigned if there are at least as many nodes as aggregators.

If set to a value greater than 1, node selection is strided across the available nodes by this value. If the stride times the number of aggregators exceeds the number of nodes, the assignments will wrap around, which is usually not optimal for performance.

If set to -1, node selection is strided across available nodes by the value of the number of nodes divided by the number of aggregators (integer division, minimum value of 1). The purpose is to spread out the nodes to reduce network congestion.

Note: The order of nodes can be shown by setting the MPICH_RANK_REORDER_DISPLAY environment variable. This lists in rank order (rank for MPI_COMM_WORLD) the node on which each rank resides. When there is more than one rank per node, the node ID is repeated. When MPI has not done any rank reordering, all the ranks for the first node are listed first, then all the ranks for the second node are listed second, etc. When rank reordering has been done, see the MPICH_RANK_REORDER_METHOD environment variable, the order of the nodes can be very different. To spread the aggregators across the nodes, if MPICH_RANK_REORDER_METHOD=3, MPIIO sorts the list by nid and then by rank on that node, for the node order used when assigning aggregators. This has the desired effect of spreading the aggregators across the nodes assigned to the job.The current implementation is not file-specific. That is, the environment variable applies to all files opened with MPI_File_open().

Default: -1

MPICH_MPIIO_CB_ALIGN

Sets the default value for the cb_align hint. Files opened with MPI_File_open wil have this value for the cb_align hint unless the hint is set on a per file basis with either the MPICH_MPIIO_HINTS environment variable or from within a program with the MPI_Info_set() call.

Note: Only MPICH_MPIIO_CB_ALIGN == 2 is fully supported. Other values are for internal testing only.

Default: 2

MPICH_MPIIO_DVS_MAXNODES

Note: This environment variable in relevant only for file systems accessed from HPE system compute nodes via DVS server nodes; e.g. GPFS or PANFS.

As described in the dvs(5) man page, the environment variable DVS_MAXNODES can be used to set the stripe width—that is, the number of DVS server nodes—used to access a file in “stripe parallel mode.” For most files, and especially for small files, setting DVS_MAXNODES to 1 (“cluster parallel mode”) is preferred.

The MPICH_MPIIO_DVS_MAXNODES environment variable enables you to leave DVS_MAXNODES set to 1 and then use MPICH_MPIIO_DVS_MAXNODES to temporarily override DVS_MAXNODES when it is advantageous to specify wider striping for files being opened by the MPI_File_open() call. The range of values accepted by MPICH_MPIIO_DVS_MAXNODES goes from 1 to the number of server nodes specified on the mount with the nnodes mount option.

DVS_MAXNODES is not set by default. Therefore, for MPICH_MPIIO_DVS_MAXNODES to have any effect, DVS_MAXNODES must be defined before program startup and defined using exactly three characters, where the characters specify the decimal value and the remainder are underscore characters: for example, DVS_MAXNODES=12_. If DVS_MAXNODES is not defined or defined incorrectly, MPI-IO will ignore MPICH_MPIIO_DVS_MAXNODES. A warning message is issued if the value requested by the user does not match the value actually used by DVS.

MPICH_MPIIO_DVS_MAXNODES interacts with MPICH_MPIIO_HINTS. To determine the striping actually used, the order of precedence is:

striping_factor set using MPICH_MPIIO_HINTS, if set

striping_factor set using MPI_Info_set(), if set

MPICH_MPIIO_DVS_MAXNODES value, if set

DVS_MAXNODES, if set

DVS maxnodes=n mount option, if specified

Default: unset

MPICH_MPIIO_HINTS

If set, override the default value of one or more MPI I/O hints. This also overrides any values that were set by using calls to MPI_Info_set in the application code. The new values apply to the file the next time it is opened using an MPI_File_open() call.

After the MPI_File_open() call, subsequent MPI_Info_set calls can be used to pass new MPI I/O hints that take precedence over some of the environment variable values. Other MPI I/O hints such as striping_factor, striping_unit, cb_nodes, and cb_config_list cannot be changed after the MPI_File_open() call, as these are evaluated and applied only during the file open process.

An MPI_File_close call followed by an MPI_File_open call can be used to restart the MPI I/O hint evaluation process.

The syntax for this environment variable is a comma-separated list of specifications. Each individual specification is a pathname_pattern followed by a colon-separated list of one or more key=value pairs. In each key=value pair, the key is the MPI-IO hint name, and the value is its value as it would be coded for an MPI_Info_set library call.

For example:

MPICH_MPIIO_HINTS=spec1[,spec2,...]
Where each specification has the syntax:

pathname_pattern:key1=value1[:key2=value2:...]
The pathname_pattern can be an exact match with the filename argument used in the MPI_File_open() call or it can be a pattern as described below.

When a file is opened with MPI_File_open(), the list of hint specifications in the MPICH_MPIIO_HINTS environment variable is scanned. The first pathname_pattern matching the filename argument in the MPI_File_open() call is selected. Any hints associated with the selected pathname_pattern are applied to the file being opened. If no pattern matches, no hints from this specification are applied to the file.

The pathname_pattern follows standard shell pattern-matching rules with these meta-characters:

----------------------------------------------------------------
Pattern    Description
----------------------------------------------------------------
*          Match any number of characters
?          Match any single character
[a-b]      Match any single character between a and b, inclusive
\          Interpret the meta-character that follows literally
----------------------------------------------------------------
The simplest pathname_pattern is . Using this results in the specified hints being applied to all files opened with *MPI_File_open(). Use of this wildcard is discouraged because of the possibility that a library linked with the application may also open a file for which the hints are not appropriate. * The following example shows how to set hints for a set of files. The final specification in this example, for file /scratch/user/me/dump., has two *key=value pairs.

MPICH_MPIIO_HINTS=file1:direct_io=true,file2:romio_ds_write=disable,\
/scratch/user/me/dump.*:romio_cb_write=enable:cb_nodes=8
The following MPI-IO key values are supported on HPE systems.

abort_on_rw_error

If set to enable, causes MPI-IO to abort immediately after issuing an error message if an I/O error occurs during a system read() or system write() call. The valid values are enable and disable. See the MPICH_MPIIO_ABORT_ON_RW_ERROR environment variable for more details.

Default: disable

cb_align

Specifies which alignment algorithm to use for collective buffering. If set to 2, an algorithm is used to divide the I/O workload into Lustre stripe-sized pieces and assigns these pieces to collective buffering nodes (aggregators) so that each aggregator always accesses the same set of stripes and no other aggregator accesses those stripes. This is generally the optimal collective buffering mode as it minimizes the Lustre file system extent lock contention and thus reduces system I/O time.

Historically there have been a few different collective buffering alignment algorithms used on HPE systems. Currently only one of them, algorithm 2, is supported. The alignment value of 1 is no longer supported. The alignment values of 0 and 3 are not fully supported but are for internal testing only. Other algorithms may be supported in the future.

Default: 2

cb_buffer_size

Sets the buffer size in bytes for collective buffering.

This hint is not used with the current default collective buffering algorithm.

cb_config_list
Specifies by name which nodes are to serve as aggregators. The syntax for the value is:
#name1:maxprocesses[,name2:maxprocesses,...]#
Where name is either * (match all node names) or the name returned by MPI_Get_processor_name, and maxprocesses specifies the maximum number of processes on that node to serve as aggregators. If the value of the cb_nodes hint is greater than the number of compute nodes, the value of maxprocesses must be greater than 1 in order to assign the required number of aggregators. When the cb_align hint is set to 2 (the default), the aggregators are assigned using a round-robin method across compute nodes.

The pair of # characters beginning and ending the list are not part of the normal MPIIO hint syntax but are required. Because colon (:) characters are used in both this list and in the MPICH_MPIIO_HINTS environment variable syntax, the # characters are required in order to determine the meaning of colon (:) character.

This value cannot be changed after the file is opened.

Default: :

cb_nodes

Specifies the number of aggregators used to perform the physical I/O for collective I/O operations when collective buffering is enabled. On multi-core nodes, all cores share the same node name.

With the current default collective buffering algorithm, the best value for cb_nodes is usually the same as striping_factor (in other words, the stripe count).

Default: striping_factor

cray_cb_nodes_multiplier

Specifies the number of collective buffering aggregators (cb_nodes) per OST for Lustre files. In other words, the number of aggregators is the stripe count (striping_factor) times the multiplier. This may improve or degrade I/O performance, depending on the file locking mode and other conditions. When the locking mode is 0, a multiplier of 1 is usually best for writing the file. When the locking mode is 1 or 2, a multiplier of 2 or more is usually best for writing the file. If a locking mode is specified and both cb_nodes and cray_cb_nodes_multiplier hints are set, the cb_nodes hint is ignored. See cray_cb_write_lock_mode.

When reading a file with collective buffering, a multiplier of 2 or more often improves read performance.

Note: If the number of aggregators exceeds the number of compute nodes, performance generally won’t improve over 1 aggregator per compute node.

Default: 1

cray_cb_write_lock_mode

Specifies the file locking mode for accessing Lustre files. These modes do not apply when accessing other file systems. Valid values are:

0

Standard locking mode. Extent locks are held by each MPI rank accessing the file. The extent of each lock often exceeds the byte range needed by the rank. Locks are revoked and reissued when the extent of a lock held by one rank conflicts with the extent of a lock needed by another rank.

1

Shared lock locking mode. A single lock is shared by all MPI ranks that are writing the file. This lock mode is only applicable when collective buffering is enabled and is only valid if the only accesses to the file are writes and all the writes are done by the collective buffering aggregators. The romio_no_indep_rw hint must be set to true to use this locking mode. This is an explicit assertion that all file accesses will be with MPI collective I/O. Setting the romio_no_indep_rw hint to true also sets romio_cb_write and romio_cb_read to enable. Any other MPI I/O accesses will cause the program to abort and any non-MPI I/O access may cause the program to hang. Both HDF5 and netCDF do both collective and independent I/O so this locking mode is not appropriate for these APIs.

2

Lock ahead locking mode. Sets of non-overlapping extent locks are acquired ahead of time by all MPI ranks that are writing the file and the acquired locks are not expanded beyond the size requested by the ranks. This locking mode is only applicable when collective buffering is enabled but unlike locking mode 1, MPI independent I/O and non-MPI I/O are also allowed. However, to be a performance benefit, the majority of the I/O should be MPI collective I/O. This supports file access patterns such as that done by HDF5 and netCDF where some MPI independent I/O might occur in addition to MPI collective I/O. The romio_no_indep_rw hint does not need to be set to true. Also see the cray_cb_lock_ahead_num_extents hint.

Locking modes 1 and 2 reduce lock contention between multiple clients and therefore support greater parallelism by allowing multiple aggregators per OST to efficiently write to the file. Set the cray_cb_nodes_multiplier hint to 2 or more to get the increased parallelism. The optimal value depends on file system characteristics. Note that if lock mode 2 is not supported, a warning will be printed and the lock mode will be reset to 0.

Default: 0

direct_io

Enables the O_DIRECT mode for the specified file. The user is responsible for aligning the write or read buffer on a getpagesize() boundary. MPI-IO checks for alignment and aborts if it is not aligned. Valid values are true or false.

Default: false.

ind_rd_buffer_size

Specifies in bytes the size of the buffer to be used for data sieving on read.

Default: 4194304

ind_wr_buffer_size

Specifies in bytes the size of the buffer to be used for data sieving on write.

Default: 524288

romio_cb_read

Enables collective buffering on read when collective IO operations are used. Valid values are enable, disable, and automatic. In automatic mode, whether or not collective buffering is done is based on runtime heuristics. When MPICH_MPIIO_CB_ALIGN is set to 2, the heuristics favor collective buffering.

Default: automatic.

romio_cb_write

Enables collective buffering on write when collective IO operations are used. Valid values are enable, disable, and automatic. In automatic mode, whether or not collective buffering is done is based on runtime heuristics. When MPICH_MPIIO_CB_ALIGN is set to 2, the heuristics favor collective buffering.

Default: automatic.

romio_ds_read

Specifies if data sieving is to be done on read. Valid values are enable, disable, and automatic.

Default: disable

romio_ds_write

Specifies if data sieving is to be done on write. Valid values are enable, disable, and automatic. When set to automatic, data sieving on write is turned off if the MPI library has been initialized with MPI_THREAD_MULTIPLE. Setting the value enable will turn on data sieving on write irrespective of the thread environment and is safe as long as MPI-IO routines aren’t called concurrently from threads in a rank. Additionally, in order to avoid data corruption, it is necesary that data sieving is disabled if single-threaded applications write to a file using multiple communicators.

Default: automatic

romio_no_indep_rw

Enables an optimization in which only the aggregators open the file, thus limiting the number of system open calls. For this hint to be valid, all I/O on the file must be done by MPI collective I/O calls (that is, no independent I/O) and collective buffering must not be disabled. Valid values are true or false.

Default: false.

striping_factor

Specifies the number of Lustre file system stripes (stripe count) to assign to the file. This has no effect if the file already exists when the MPI_File_open() call is made. File striping cannot be changed after a file is created. Currently this hint applies only when MPICH_MPIIO_CB_ALIGN is set to 2.

The value 0 denotes the default value for the Lustre file system, or the value for the directory in which the file is created if the lfs setstripe command was used to set the stripe count of the directory to a value other than the system default. The value -1 means using all available OSTs for striping.

Default: 0

striping_unit

Specifies in bytes the size of the Lustre file system stripes (stripe size) assigned to the file. This has no effect if the file already exists when the MPI_File_open() call is made. File striping cannot be changed after a file is created. Currently this hint applies only when MPICH_MPIIO_CB_ALIGN is set to 2.

Default: the default value for the Lustre file system, or the value for the directory in which the file is created if the lfs setstripe command was used to set the stripe size of the directory to a value other than the system default.

overstriping_factor

Specifies the number of Lustre file system stripes (stripe count) to assign to the file when more stripes than the available OSTs are needed. This has no effect if the file already exists when the MPI_File_open() call is made. File overstriping cannot be changed after a file is created. Currently this hint applies only when MPICH_MPIIO_CB_ALIGN is set to 2. This hint will take precedence when used along with striping_factor.

The value 0 denotes the default value for the Lustre file system, or the value for the directory in which the file is created if the lfs setstripe command was used to set the overstripe count of the directory to a value other than the system default. The value -1 means using all available OSTs for overstriping.

Default: 0

MPICH_MPIIO_HINTS_DISPLAY

If enabled, causes rank 0 in the participating communicator to display the names and values of all MPI-IO hints that are set for the file being opened with the MPI_File_open call.

Default: not enabled.

MPICH_MPIIO_OFI_STARTUP_CONNECT

By default, OFI connections between ranks are set up on demand. This allows for optimal performance while minimizing memory requirements in Slingshot-10 and Infiniband systems. However, for MPIIO jobs requiring large number of PEs and IO aggregators, it may be beneficial to create OFI connections between PEs and IO aggregators in a coordinated manner at file open. If enabled, this feature will create connections between all ranks on each node and the IO aggregators in the job during MPI_File_open.

This option is not beneficial on a Slingshot-11 system.

Default: not enabled.

MPICH_MPIIO_OFI_STARTUP_NODES_AGGREGATOR

If MPICH_MPIIO_OFI_STARTUP_CONNECT is enabled, this specifies the number of nodes that will concurrently attempt to connect to each aggregator during the MPI_File_open. Increasing the value may be improve performance of the MPI_File_open in some configurations when MPICH_MPIIO_OFI_STARTUP_CONNECT is enabled.

Default: 2

MPICH_MPIIO_STATS

If set to 1, a summary of file write and read access patterns is written by rank 0 to stderr. This information provides some insight into how I/O performance may be improved. The information is provided on a per-file basis and is written when the file is closed. It does not provide any timing information.

If set to 2, a set of data files are written to the working directory, one file for each rank, with the filename prefix specified by the MPICH_MPIIO_STATS_FILE environment variable. The data is in comma-separated values (CSV) format, which can be summarized with the cray_mpiio_summary script in the /opt/cray/pe/mpich/version/ofi/mpich-cray/version*/bin* directory. Additional example scripts are provided in that directory to further process and display the data.

Default: not set

MPICH_MPIIO_STATS_FILE

Specifies the filename prefix for the set of data files written when MPICH_MPIIO_STATS is set to 2. The filename prefix may be a full absolute pathname or a relative pathname.

Summary plots of these files can be generated using the cray_mpiio_summary script from the /opt/cray/pe/mpich/version/ofi/mpich-cray/version*/bin* directory. Other example scripts for post-processing this data can also be found in /opt/cray/pe/mpich/version/ofi/mpich-cray/version*/bin*.

Default: cray_mpiio_stats

MPICH_MPIIO_STATS_INTERVAL_MSEC

Specifies the time interval in milliseconds for each MPICH_MPIIO_STATS data point.

Default: 250

MPICH_MPIIO_TIMERS

If set to 0 or not set at all, no timing data is collected.

If set to 1, timing data for different phases in MPI_IO is collected locally by each MPI process. During MPI_FILE_close the data is consolidated and printed. Some timing data is displayed in seconds, other data is displayed in clock ticks, possibly scaled down. The relative values of the reported times are more important to the analysis than the absolute time. See also MPICH_MPIIO_TIMERS_SCALE

More detailed information about MPI_IO performance can be obtained by using the MPICH_MPIIO_STATS feature and by using the CrayPat and Aprentice2 Timeline Report of I/O bandwidth.

Default: 0

MPICH_MPIIO_TIMERS_SCALE

Specifies the power of 2 to use to scale the times reported by MPICH_MPIIO_TIMERS. The raw times are collected in clock ticks. This generally is a very large number and reducing all the times by the same scaling factor makes for a more compact display.

If set to 0, or not set at all, MPI-IO automatically determines a scaling factor to limit the report times to 9 or fewer digits. This auto-determined value is displayed. To make run to run comparisons, you can set the scaling factor to your preferred value.

Default: 0

MPICH_MPIIO_TIME_WAITS

If set to non-zero, time how long this rank has to wait for other ranks to catch up. This separates true metadata time from imbalance time.

This is disabled when MPICH_MPIIO_TIMERS is not set. Otherwise it defaults to 1.

Default: 1

DYNAMIC PROCESS MANAGEMENT (DPM) ENVIRONMENT VARIABLES

HPE Cray MPICH supports Dynamic Process Management (DPM). This allows MPI applications to spawn additional ranks. Some special settings are required on different platforms and launchers in order to enable DPM. This document describes environment variables and command line options for enabling and configuring DPM on different platforms and launchers.

PMI uses the PMI_UNIVERSE_SIZE environment variable to indicate to MPI

the maximum number of ranks that it can have running in the same job. This directly translates to the MPI_UNIVERSE_SIZE attribute. Slurm currently sets this variable to the number of ranks in the base job-step. You can override this environment variable in the launched environment, though the launcher may still limit the number of spawned processes.

In order to save resources, the MPI library does not enable support

for the fabric if it was launched onto a single host. Fabric support is needed for DPM. If you intend to use DPM or intercommunicators with single-host applications, then set MPICH_SINGLE_HOST_ENABLED=0 to force the MPI library to initialize fabric support.

Slingshot Fabric with Cassini NICs

MPICH uses the Process Management Interface (PMI) to request the launcher to start additional processes. On Slurm systems, the PMI_SPAWN_SRUN_ARGS environment variable controls how PMI interacts with Slurm. By default it is set to “–mpi=cray_shasta –exclusive”. You may wish to add a –cpu-bind= option to the environment variable. See Slurm’s srun(1) man page for more options.

On Cassini systems, add –network to both the original srun and the PMI_SPAWN_SRUN_ARGS environment variable. It should be set to “–network=single_node_vni,job_vni,def_tles=0”. This will enable Slurm to subdivide the job’s slots into separate job-steps. The single_node_vni option will ensure that Slingshot security tokens are provided even if the application only runs on a single host. The job_vni option will ensure that an intra-job security token is provided. The def_tles option will prevent a limited Cassini resource that is rarely used from being exchausted by DPM.

With HPE’s Parallel Application Launch Service (PALS), PBS needs to be told that the job will need an intra-job security token. PBS ignores this option by default if the job only includes a single node. Therefore your qsub command line may look like “qsub -v HPE_SLINGSHOT_OPTS=get_job_vni:single_node_vni …”. This requires that the HPE Slingshot PBS hook be installed and configured.

With PALS, applications must be launched through mpiexec with the –job-aware option. This causes PALS to track the compute slots reserved through PBS. On Cassini systems, add “–network def_tles=0”. The def_tles option will prevent a limited Cassini resource that is rarely used from being exhausted by DPM. If you wish to use DPM on a single host, add –single-node-vni to cause PALS to provide HPE Slingshot VNI security tokens, which are not normally used on single hosts. See the mpiexec(1) man page for more information.

Slingshot Fabric with Mellanox NICs

Do not use Cassini-specific –network options with Mellanox NICs.

MPICH_DPM_DIR

Specifies the file path used for the dynamic process management directory. The directory must be cross-mounted on the compute nodes in order to be visible to MPI during an application’s execution. If MPICH_DPM_DIR is not set, MPI will attempt to create the directory in the user’s home directory. Note that the directory (and any files in the directory) are managed by MPI, and should not be directly modified by the user.

Default: not set

HPE Cray MPI Compiler Wrappers

HPE Cray MPI provides simple compiler wrappers for invoking a C/C++/Fortran compiler and linking the corresponding HPE Cray MPI library. Linking of additional libraries is the responsibility of the end user, and the compiler wrappers are not intented to replace linking behaviors provided by the craype compiler wrappers.

Compiling using the HPE Cray MPI compiler wrappers

To compile with the compiler wrappers, select the wrapper for your specific language and the wrapper will invoke the correct compiler and link the correct MPI library

· For C code, mpicc invokes the C compiler and links the main MPI library

· For C++ code, mpicxx or mpic++ invokes the C++ compiler and links the main MPI library

· For Fortran code, mpifort, mpif77, or mpif90 invokes the Fortran compiler and link the main and Fortran MPI libaries

By default, the wrappers will search for Huge Pages (libhugetlbfs) and XPMEM (libxpmem) and add them to the link line if it finds them. These are important, yet optional, MPI library optimizations. The Huge Pages library will be checked for in /usr/lib64. The XPMEM library will check for the environment variable CRAY_XPMEM_POST_LINK_OPTS for the location of the xpmem library. That environment variable is set by the XPMEM module. The following flags have been added to the wrappers to toggle that behavior by turning off automatic huge pages linking, xpmem linking, and both respectively:

· -no-auto-hugetlbfs

· -no-auto-xpmem

· -no-auto-cray-opts

The wrappers will also detect if -lhugetlbfs or -lxpmem are present as a wrapper argument, and in those cases it will not look to add the same library automatically.

Compiling GPU aware codes using the HPE Cray MPI Compiler wrappers

While the HPE Cray MPI compiler wrappers are not aware of the HPE Cray MPI GTL libraries for use with GPU codes, they can be simply used to compile for use with GPUs. Relative symlinks are provided in the library directories pointing to the location of the gtl libraries within a given version of HPE Cray MPI. Simply link the appropriate gtl library for your intended GPU.

Example of compiling a GPU code written in C with the HPE Cray MPI compiler wrappers:

For CUDA codes

· mpicc -lmpi_gtl_cuda

For ROCM codes

· mpicc -lmpi_gtl_hsa

For Intel ZE codes

· mpicc -lmpi_gtl_ze