intro_mpi

intro_mpi - Introduces the Message Passing Interface (MPI)

DESCRIPTION

The Message-Passing Interface (MPI) supports parallel programming across a network of computer systems through a technique known as message passing. The goal of the MPI Forum, simply stated, is to develop a widely used standard for writing message-passing programs. MPI establishes a practical, portable, efficient, and flexible standard for message passing that makes use of the most attractive features of a number of existing message passing systems, rather than selecting one of them and adopting it as the standard.

MPI is a specification (like C or Fortran) that has a number of implementations.

Other sources of MPI information include the man pages for MPI library functions and the following URLs:

http://www.mpich.org
http://www.mpi-forum.org

The default netmod that HPE Cray MPI uses is libfabric (OFI). Libfabric is an open source project as a subgroup of the OpenFabrics Alliance. For more information visit:

http://ofiwg.github.io/libfabric
http://www.openfabrics.org

The MPI library explicitly sets default values for a subset of OFI environment variables that have been tested and determined appropriate. These are documented below, starting with the prefix FI_.

Note

Changing the values of OFI environment variables to non-default values may lead to undefined behavior. The OFI environment variables are mostly designed for advanced users, or for specific tunings or workarounds recommended by HPE.

Alternative netmods for Cray MPICH

In addition to OFI, libraries and documentation are provided for the UCX netmod. To use the UCX netmod unload the craype-network-ofi and cray-mpich modules and then load the craype-network-ucx and cray-mpich-ucx modules. Doing so will also make the UCX specific intro_mpi man page available to view.

Note

OFI is the only supported netmod on HPE Slingshot 11 systems. The UCX netmod is not supported with HPE Cray MPI on HPE Slingshot 11 systems.

About the MPI Module

The MPI module is not loaded by default. Users must either load the module manually or have their system administrator select system-wide default preferences in /etc/bash.bashrc.local and /etc/csh.cshrc.local. Therefore, if your code uses MPI, verify that the cray-mpich module is loaded before compiling or linking. This ensures that your code is linked using the -lmpich option.

GPU Support in Cray MPICH

HPE Cray MPI offers “GPU Aware” MPI support for applications that perform MPI operations with communication buffers on GPU-attached memory regions. HPE Cray MPI is tightly integrated with the rest of the Cray PE stack to offer GPU support and currently supports NVIDIA and AMD GPU devices.

HPE Cray MPI supports the following technologies for MPI operations involving GPU-attached memory regions:

· GPU-NIC RDMA (for inter-node MPI transfers)

· GPU Peer2Peer IPC (for intra-node MPI transfers)

The following sections include sample recipes for compiling, linking, and running GPU-enabled parallel applications on systems:

Section 1 includes details for NVIDIA GPUS.

Section 2 includes relevant details for AMD GPUs.

Section 3 includes a set of recommendations to specify process-to-NIC affinity.

Section 1. Using HPE Cray MPI’s GPU support with NVIDIA GPUs

NVIDIA GPU support in Cray MPI is available for usage models involving PrgEnv-nvidia, PrgEnv-cray, and PrgEnv-gnu flavors. However, there are subtle variations in the level of GPU support offered for each PrgEnv flavor.

PrgEnv Support

Cray PE will provide PrgEnv-nvidia for:

· CPU-only applications,

· CPU/GPU applications with CUDA and CPU codes in the same file, and

· CPU/GPU applications that use OpenMP offload

Cray PE will provide PrgEnv-cray for:

· CPU-only applications and

· CPU/GPU applications with CUDA and CPU codes in different files.

Cray (CC, ftn, cc) drivers must be used at link time and CUDA runtime must be included

Cray PE will provide PrgEnv-gnu for:

· CPU-only applications,

· CPU/GPU applications with CUDA and CPU codes in the same file.

Nvcc host compiler and gcc versions need to be compatible. (nvcc currently supports GCC 9.x). Mixing Fortran and CUDA in the same source file is not supported in gfortran CPU/GPU applications that use OpenMP offload are not supported

Section 1 (a). Compiling and linking NVIDIA GPU-enabled parallel applications

This section describes a sample recipe for compiling and linking GPU-enabled parallel applications on a system with NVIDIA GPUs.

For the purpose of this illustration, the HPE Cray EX system consists of compute nodes that are based on NVIDIA GPUs and HPE Slingshot-11 NICs and the software stack supports the latest NVIDIA cudatoolkit version.
$ module load PrgEnv-cray
$ module load cray-mpich
$ module load cudatoolkit
$ module load craype-accel-nvidia90
The exact flavor of the craype-accel-nvidia module depends on the underlying system architecture. Users are encouraged to use the “module help” command to learn more about the module flavors and the hardware architectures they support.

NVIDIA’s GDRCopy is a low-latency GPU memory copy library that allows the CPU to directly map and access GPU-attached memory regions. This library is packaged and released along with the rest of the system software stack on the HPE Cray EX system. Cray MPI relies on the GDRCopy layer to optimize small message intra-node and inter-node MPI transfers involving communication buffers that are resident on NVIDIA GPU devices. While GDRCopy is an optional layer, users are advised to add the following linker flags to enable optimizations that leverage the GDRCopy capability. -L/usr/lib64 -lgdrapi:
**Section 1 (b). Running NVIDIA GPU-enabled parallel applications**
The following is a simple recipe for running GPU-enabled parallel applications on a system with NVIDIA GPUs. This example assumes the use of 4 GPUs per node across 64 compute nodes.

Cray MPI will neither select a default NVIDIA GPU device for a given process, nor initialize a default CUDA context for that process. The intent is to allow users retain the flexibility of establishing the process-to-CPU and the process-to-GPU mapping to match an application’s requirements.

An end user can establish process-to-CPU mapping via Slurm runtime options. There are two recommended ways to establish process-to-GPU mapping:

CUDA_VISIBLE_DEVICES

NVIDIA offers the CUDA_VISIBLE_DEVICES environment variable to limit the number of GPU devices that are available for a given process. A simple way of initializing CUDA_VISIBLE_DEVICES is to use the SLURM_LOCALID environment variable. The following snippet illustrates this example:
$ cat select_gpu_device
#!/bin/bash

export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID

exec $*
$ export MPICH_GPU_SUPPORT_ENABLED=1
$ srun -p <GPU_Partition> -n256 -N64 --ntasks-per-node=4 \
    --cpu-bind=map_cpu:0,16,32,48 ./select_gpu_device  ./exe
Assuming the compute node has four NVIDIA GPUs, each MPI rank is essentially associated with a specific GPU device. This also ensures that two MPI ranks do not share the same GPU device in this specific example. Hence, for each MPI rank, the default CUDA context is initialized by the GPU runtime layer only on the GPU device that is associated with it via the CUDA_VISIBLE_DEVICES variable.

Users have the flexibility of initializing CUDA_VISIBLE_DEVICES via other mechanisms.
MPI ranks sharing a single GPU device

For use cases that involve multiple MPI ranks sharing the same GPU device, enabling CUDA’s Multi-Process Service (MPS) is strongly recommended. More information on MPS is available here: https://docs.nvidia.com/deploy/mps/index.html

CUDA APIs

An alternative approach involves the use of explicit CUDA API calls to detect the number of GPU devices and bind each process to a specific GPU device. In order to perform this mapping each MPI process needs to be aware of its “local id” within a compute node. If an application chooses to establish the process-to-GPU mapping before MPI_Init, the SLURM_LOCALID environment variable can be used.

Note

Slurm’s “–gpus-per-task” flag may also be used to specify process-to-GPU affinity settings. However, Slurm may choose to use cgroups to implement the required affinity settings. Typically, the use of cgroups has the downside of preventing the use of GPU Peer2Peer IPC mechanisms. By default Cray MPI uses IPC for implementing intra-node, inter-process MPI data movement operations that involve GPU-attached user buffers. When Slurm’s cgroups settings are in effect, users are advised to set MPICH_SMP_SINGLE_COPY_MODE=NONE or MPICH_GPU_IPC_ENABLED=0 to disable the use of IPC-based implementations. Disabling IPC also has a noticeable impact on intra-node MPI performance when GPU-attached memory regions are involved.

Section 2. Using HPE Cray MPI’s GPU support with AMD GPUs

AMD GPU support in Cray MPI is available for usage models involving PrgEnv-amd, PrgEnv-cray, and PrgEnv-gnu flavors.

For AMD GPUs, each of these PrgEnv flavors can be used for CPU-only applications, CPU/GPU applications with ROCm and CPU codes in the same file, and CPU/GPU applications that use OpenMP offload

Section 2 (a). Compiling and linking AMD GPU-enabled parallel applications

This section describes a sample recipe for compiling and linking GPU-enabled parallel applications on a system for AMD GPUs.

For the purpose of this illustration, the HPE Cray EX system consists of compute nodes that are based on AMD GPUs and HPE Slingshot-11 NICs. The system software stack on the HPE Cray EX system supports the latest AMD ROCm version:
$ module load PrgEnv-cray
$ module load cray-mpich
$ module load rocm
$ module load craype-accel-amd-gfx90a
The exact flavor of the craype-accel-amd-gfx module depends on the underlying system architecture. Users are encouraged to use the “module help” command to learn more about the module flavors and the hardware architectures they support.

Section 2 (b). Running AMD GPU-enabled parallel applications

The following is a simple recipe for running GPU-enabled parallel applications on a system with AMD GPUs. This example assumes the use of 4 GPUs per node across 64 compute nodes.

Cray MPI will neither select a default GPU device for a given process, nor initialize a default AMD GPU context for that process. The intent is to allow users retain the flexibility of establishing the process-to-CPU and the process-to-GPU mapping to match an application’s requirements.

An end user can establish process-to-CPU mapping via Slurm runtime options. There are two recommended ways to establish process-to-GPU mapping:

ROCR_VISIBLE_DEVICES

AMD offers the ROCR_VISIBLE_DEVICES environment variable to limit the number of GPU devices that are available for a given process. A simple way of initializing ROCR_VISIBLE_DEVICES is to use the SLURM_LOCALID environment variable. The following snippet illustrates this example:
$ cat select_gpu_device
#!/bin/bash

export ROCR_VISIBLE_DEVICES=$SLURM_LOCALID

exec $*

$ srun -p <GPU_Partition> -n 256 -N64 --ntasks-per-node=4 \
        --cpu-bind=map_cpu:0,16,32,48 ./select_gpu_device  ./exe
Assuming the compute node has four AMD GPUs, each MPI rank detects the availability of a single GPU device. However, each rank is essentially associated with a different GPU device to ensure that two MPI ranks do not share the same GPU device. Hence, for each MPI rank, the default GPU context is initialized by the GPU runtime layer only on the GPU device that is visible to it.

Users have the flexibility of initializing ROCR_VISIBLE_DEVICES via other mechanisms.

Note that process-to-CPU and process-to-GPU affinity settings are highly system architecture specific. Users may customize the above recipes to best fit their target architectures.
HIP (ROCm) APIs

An alternative approach involves the use of explicit HIP API calls to detect the number of GPU devices and bind each process to a specific GPU device. In order to perform this mapping each MPI process needs to be aware of its “local id” within a compute node. If an application chooses to establish the process-to-GPU mapping before MPI_Init, the SLURM_LOCALID environment variable can be used.

Section 3. Mapping processes to network interfaces

Mapping processes to network interfaces

On compute nodes that offer multiple GPU devices and multiple Network Interface Controllers (NIC), Cray MPI offers a flexible way to offer the ideal mapping between a process and the default NIC.

For GPU-enabled parallel applications that involve MPI operations that access application arrays resident are on GPU-attached memory regions, users can set MPICH_OFI_NIC_POLICY to GPU. In this case, for each MPI process, Cray MPI strives to select a NIC device that is closest to the GPU device being used.

For CPU-enabled applications, or for GPU-enabled applications that do not involve MPI operations that access GPU-attached memory regions, users can set MPICH_OFI_NIC_POLICY to NUMA. In this case, for each MPI process, Cray MPI strives to select a NIC device that is closest to the CPU NUMA domain being used.

For custom NIC selection requirements, please refer to the section on MPICH_OFI_NIC_POLICY in this man page.

GPU-NIC Async Communication Strategies

On modern heterogeneous supercomputing systems that are comprised of compute blades that offer CPUs and GPUs, it is necessary to move data efficiently between these different compute engines across a high-speed network. While current generation scientific applications and systems software stacks are “GPU-aware”, especially from the point of view of RDMA, CPU threads are still required to orchestrate data moving communication operations and inter-process synchronization operations. Naturally, this requirement results in all communication and synchronization operations occurring at GPU kernel boundaries. An application process running on the CPU first synchronizes with the local GPU device to ensure that the compute kernel has completed. Next, an application process initiates, progresses, and completes inter-process communication/synchronization operations. Typically, subsequent compute kernels can be launched only after the inter-process communication operations have completed. Owing to this behavior, current generation GPU-aware parallel applications are affected by potentially expensive synchronization points that require the CPU to synchronize with the GPU and NIC devices. In addition, the overhead of launching compute kernels is in the critical path and an iterative-parallel application experiences this overhead each time a new kernel is offloaded to the GPU.

HPE Cray MPI offers early experimental support for advanced GPU-centric communication operations called “GPU-NIC Async” strategies to address these problems on emerging supercomputing systems specifically on platforms consisting of AMD and Nvidia GPU devices. These strategies also leverage some of the key technologies offered by the HPE Slingshot network. The following are currently supported strategies: * Stream Triggered (ST) * Kernel Triggered (KT)

· Stream Triggered (ST) Communication

The Stream Triggered (ST) technology enables users to offload both computation and the communication control paths to the GPU. In this approach, the CPU creates network command descriptors and appends them to the NIC command queue. These command descriptors have special attributes that allow them to be “triggered” at a later point in time when certain conditions have been satisfied. In addition, the CPU also creates control operations and appends them to the GPU stream. These operations will be executed by the GPU Control Processor in sequential order, relative to other operations in the GPU stream. When the control operations are executed by the GPU Control Processor, they act as “triggers” that initiate the execution of the previously appended network command descriptors in the NIC’s command queues. Additionally, this approach also allows the GPU Control Processor to synchronize with the Cassini NIC to determine the successful completion of communication operations. This synchronization step involves the use of hardware counters in the Cassini NIC. Thus, ST minimizes the need for synchronization between the CPU and GPU.

· Kernel Triggered (KT) Communication

While ST minimizes synchronization overheads between CPU and GPU, it fundamentally requires communication operations to be performed at kernel boundaries. In addition, kernel launch/teardown overheads still exist in the critical path for parallel applications. Kernel Triggered (KT) is another strategy that can address these limitations. This approach also relies on the CPU to create network command descriptors beforehand and append them to the NIC command queues. The KT strategy allows a GPU thread to trigger these network command descriptors, directly from within a GPU kernel. Hence, KT allows application developers to define both computation and communication operations inside GPU kernels. By removing the need for communication operations to be performed only at kernel boundaries, this strategy can be used to define long running (or persistent) GPU kernels that can perform computation, communication, and synchronization operations. Therefore, KT can potentially reduce the overheads associated with repeated kernel launch/teardown operations. In addition, sophisticated kernels can be designed to overlap compute with communication and synchronization operations to achieve latency hiding and improve application scaling and efficiency.

Different Cray MPI specific APIs are introduced to support the ST and KT communication strategies. Please refer the Cray MPI GPU-NIC Async API section for the new APIs introduced to support ST and KT. Environment variables related to Cray MPI GPU-NIC Async communication operations include the following: * MPICH_GPU_USE_STREAM_TRIGGERED * MPICH_GPU_USE_KERNEL_TRIGGERED * MPICH_GPU_USE_STREAM_TRIGGERED_SET_SIGNAL * MPICH_MAX_TOPS_COUNTERS * MPICH_GPU_DISABLE_STREAM_TRIGGERED_TOPS Please refer the respective environment variable description for more details.

ENVIRONMENT

Environment variables have predefined values. You can change some variables to achieve particular performance objectives; others are required values for standard-compliant programs.

At launch, MPICH checks /etc/cray-mpich.conf for environment variable values to use. These enable administrators to set site-wide defaults. It is essential that this file is the same on each cluster host in order to prevent random and difficult to debug MPI errors. The file may contain blank lines and comment lines starting with “#”. Lines with variables are formatted as KEY=VALUE. The KEY is set unless already set in the environment. If a line is formatted as KEY:force=VALUE, then the value will overwrite any previous value.

GENERAL MPICH ENVIRONMENT VARIABLES

MPICH_ABORT_ON_ERROR

If enabled, causes MPICH to abort and produce a core dump when MPICH detects an internal error. Note that the core dump size limit (usually 0 bytes by default) must be reset to an appropriate value in order to enable coredumps.

Default: Not enabled.

MPICH_ASYNC_PROGRESS

If enabled, MPICH will initiate an additional thread to make asynchronous progress on all communication operations including point-to-point, collective, one-sided operations, and I/O. Setting this variable will automatically increase the thread-safety level to MPI_THREAD_MULTIPLE. While this improves the progress semantics, it might cause a small amount of performance overhead for regular MPI operations. The user is encouraged to leave one or more hardware threads vacant in order to prevent contention between the application threads and the progress thread(s). The impact of oversubscription is highly system dependent but may be substantial in some cases, hence this recommendation.

Default: Not enabled.

MPICH_COLL_SYNC

If enabled, a Barrier is performed at the beginning of each specified MPI collective function. This forces all processes participating in that collective to sync up before the collective can begin.

To disable this feature for all MPI collectives, set the value to 0. This is the default.

To enable this feature for all MPI collectives, set the value to 1.

To enable this feature for selected MPI collectives, set the value to a comma-separated list of the desired collective names. Names are not case-sensitive. Any unrecognizable name is flagged with a warning message and ignored. The following collective names are recognized: MPI_Allgather, MPI_Allgatherv, MPI_Allreduce, MPI_Alltoall, MPI_Alltoallv, MPI_Alltoallw, MPI_Bcast, MPI_Exscan, MPI_Gather, MPI_Gatherv, MPI_Reduce, MPI_Reduce_scatter, MPI_Scan, MPI_Scatter, and MPI_Scatterv.

Default: Not enabled.

MPICH_ENV_DISPLAY

If set, causes rank 0 to display all MPICH environment variables and their current settings at MPI initialization time.

Default: Not enabled.

MPICH_GPU_SUPPORT_ENABLED

If set to 1, enables GPU support. Currently, AMD and NVIDIA GPUs are supported. If a parallel application is GPU-enabled and performs MPI operations with communication buffers that are on GPU-attached memory regions, MPICH_GPU_SUPPORT_ENABLED needs to be set to 1.

Default: 0

MPICH_GPU_IPC_ENABLED

If set to 1, enables GPU IPC support for intra-node GPU-GPU communication operations. Currently, this supports the use of IPC for both AMD and NVIDIA GPUs. If MPICH_GPU_SUPPORT_ENABLED is set to 1, MPICH_GPU_IPC_ENABLED is automatically set to 1. This variable has no effect if MPICH_GPU_SUPPORT_ENABLED is set to 0

Default: 1

MPICH_GPU_MANAGED_MEMORY_SUPPORT_ENABLED

If set to 1, enables GPU managed memory support. This setting will allow MPI to properly handle unified memory addresses.

On systems with NVIDIA GPUs, this setting may lead to a small performance overhead. This is because the MPI implementation needs to perform an additional pointer query against the GPU runtime layer for each MPI data transfer operation.

For latency sensitive use cases that do not rely on NVIDIA’s Managed Memory routines, users are advised to set MPICH_GPU_MANAGED_MEMORY_SUPPORT_ENABLED to 0.

This variable has no effect if MPICH_GPU_SUPPORT_ENABLED is set to 0.

Default: 1

MPICH_GPU_EAGER_REGISTER_HOST_MEM

If set to 1, the MPI library registers the CPU-attached shared memory regions with the GPU runtime layers. These shared memory regions are used for small message intra-node CPU-to-GPU and GPU-to-GPU MPI transfers. This optimization helps amortize the cost of registering memory with the GPU runtime layer. MPICH_GPU_EAGER_REGISTER_HOST_MEM is automatically set to 1, if MPICH_GPU_SUPPORT_ENABLED is set to 1. This variable has no effect if MPICH_GPU_SUPPORT_ENABLED is set to 0

Default: 1

MPICH_GPU_IPC_THRESHOLD

This variable determines the threshold for the GPU IPC capability. GPU IPC takes advantage of DMA engines on GPU devices to accelerate data movement operations between GPU devices on the same node. Intra-node GPU-GPU transfers with payloads of size greater than or equal to this value will use the IPC capability. Transfers with smaller payloads will use CPU-attached shared memory regions.

Default: 1024

MPICH_GPU_NO_ASYNC_COPY

This variable toggles an optimization for intra-node MPI transfers involving CPU and GPU buffers. This optimization is enabled by default. If set to 1, it reverts to using blocking memcpy operations for intra-node MPI transfers involving CPU and GPU buffers. Depending on the GPU hardware being used, disabling this optimziation negatively affects the performance of large message intra-node MPI operations involving the CPU-to-GPU data paths.

Default: 0

MPICH_GPU_COLL_STAGING_AREA_OPT

This variable toggles an optimization for certain collective operations involving GPU buffers. The optimization is currently implemented for MPI_Allreduce operations involving large payloads. This optimization is applicable for GPU-GPU transfers involving communication peers that are on the same compute node, or on different compute nodes. If set to 1, this optimization is enabled.

Default: 1

MPICH_MEMORY_REPORT

If set to 1, print a summary of the min/max high water mark and associated rank to stderr.

If set to 2, output each rank’s high water mark to a file as specified using MPICH_MEMORY_REPORT_FILE.

If set to 3, do both 1 and 2.

The detailed report for each rank may be slightly higher than the summary because the summary is collected earlier during finalize and requires MPI collective calls, which may allocate more memory.

Example #1:MPICH_MEMORY_REPORT=1

This summary reports maximum and minimum values and the lowest rank that reported the value (max_loc/min_loc reductions). The by malloc lines are for malloc/free calls. The by mmap lines are for mmap/munmap calls. The by shmget lines are for shmget or SYSCALL(shmget) and shmctl(..RM_ID..) calls.
# MPICH_MEMORY: Max memory allocated by malloc:  3898224 bytes by rank 40
# MPICH_MEMORY: Min memory allocated by malloc:  2805600 bytes by rank 61
# MPICH_MEMORY: Max memory allocated by mmap:    10485760 bytes by rank 0
# MPICH_MEMORY: Min memory allocated by mmap:    10485760 bytes by rank 0
# MPICH_MEMORY: Max memory allocated by shmget:  108821784 bytes by rank 0
# MPICH_MEMORY: Min memory allocated by shmget:  0 bytes by rank 1
# MPICH_MEMORY: Max memory reserved by symmetric heap:  2097152 bytes by rank 0
# MPICH_MEMORY: Min memory reserved by symmetric heap:  2097152 bytes by rank 0
# MPICH_MEMORY: Max memory allocated by symmetric heap: 1048576 bytes by rank 0
# MPICH_MEMORY: Min memory allocated by symmetric heap: 1048576 bytes by rank 0

Example #2:MPICH_MEMORY_REPORT=2

Each rank reports similar high water information to file.rank. Each line in the report begins with rank. This example is for rank 1.
# [1] Max memory allocated by malloc:    2898408 bytes
# [1] Max memory allocated by mmap:      10485760 bytes
# [1] Max memory allocated by shmget:    0 bytes
# [1] Max memory reserved by symmetric heap:  2097152 bytes
# [1] Max memory allocated by symmetric heap: 1048576 bytes

Default: not set (off)

MPICH_MEMORY_REPORT_FILE

Specifies the target path/prefix for the detailed high water mark list generated if MPICH_MEMORY_REPORT is set to 2 or 3. The actual filename for each high water mark report is this path/prefix plus the MPI rank number. If the specified target file cannot be opened, stderr is used.

Default: stderr

MPICH_NO_BUFFER_ALIAS_CHECK

If enabled, the buffer alias error check for collectives is disabled. The MPI standard does not allow aliasing of type OUT or INOUT parameters on the same collective function call. The use of MPI_IN_PLACE is required in these scenarios. A check is in place to detect this condition and report the error. To bypass this check, set MPICH_NO_BUFFER_ALIAS_CHECK to any value.

Default: Not enabled.

MPICH_OPT_THREAD_SYNC

Controls the mechanism used to implement thread-synchronization inside the MPI library. If set to 1, an optimized synchronization implementation is used. If set to 0, the library falls back to using the pthread mutex based thread-synchronization implementation. This variable is applicable only if the MPI_THREAD_MULTIPLE threading level is requested by the application during MPI initialization.

Default: 1

MPICH_OPTIMIZED_MEMCPY

Specifies which version of memcpy to use. Valid values are:

0

Use the system (glibc) version of memcpy.

1

Use an optimized version of memcpy if one is available for the processor being used.

2

Use a highly optimized version of memcpy that provides better performance in some areas but may have performance regressions in other areas, if one is available for the processor being used.

Default: 1

MPICH_RANK_REORDER_DISPLAY

If enabled, causes rank 0 to display which node each MPI rank resides in. The rank order can be manipulated via the MPICH_RANK_REORDER_METHOD environment variable.

Default: Not enabled.

MPICH_RANK_REORDER_FILE

If MPICH_RANK_REORDER_METHOD is set to 3 and this variable is set, then the value of this variable is the file name that MPI will check for rank reordering information. If this variable is not set, then MPI will check the default file name, MPICH_RANK_ORDER. Default: Not set

MPICH_RANK_REORDER_METHOD

Overrides the default MPI rank placement scheme. To display the MPI rank placement information, enable MPICH_RANK_REORDER_DISPLAY.

MPICH_RANK_REORDER_METHOD accepts the following values:

0
Specifies round-robin placement. Sequential MPI ranks are placed on the next node in the list. When every node has been used, the rank placement starts over again with the first node.

For example, an 8-process job launched on 4 dual-core nodes would be placed as
NODE   RANK
  0    0&4
  1    1&5
  2    2&6
  3    3&7
For example, an 8-process job launched on 2 quad-core nodes would be placed as
NODE   RANK
  0    0&2&4&6
  1    1&3&5&7
A 24-process job launched on three 8-core nodes would be placed as
NODE   RANK
  0    0&3&6&9&12&15&18&21
  1    1&4&7&10&13&16&19&22
  2    2&5&8&11&14&17&20&23
If the last node is not fully populated with MPI ranks using the default placement, no additional ranks can be placed on that node. A 20-process job launched on three 8-core nodes with round-robin placement would be placed as::
::

    NODE   RANK
      0    0&3&6&9&12&14&16&18
      1    1&4&7&10&13&15&17&19
      2    2&5&8&11
1
Specifies SMP-style placement. For a multi-core node, sequential MPI ranks are placed on the same node.

For example, an 8-process job launched on 4 dual-core nodes would be placed as::
::

    NODE   RANK
      0    0&1
      1    2&3
      2    4&5
      3    6&7
An 8-process job launched on 2 quad-core nodes would be placed as
NODE   RANK
  0    0&1&2&3
  1    4&5&6&7
A 24-process job launched on three 8-core nodes would be placed as
NODE   RANK
  0    0&1&2&3&4&5&6&7
  1    8&9&10&11&12&13&14&15
  2    16&17&18&19&20&21&22&23
2
Specifies folded-rank placement. Sequential MPI ranks are placed on the next node in the list. When every node has been used, instead of starting over with the first node again, the rank placement starts at the last node, going back to the first. For quad-core or larger nodes, this fold is repeated.

For example, an 8-process job on 4 dual-core nodes would be placed as
NODE   RANK
  0    0&7
  1    1&6
  2    2&5
  3    3&4
An 8-process job on 2 quad-core nodes would be placed as
NODE   RANK
  0    0&3&4&7
  1    1&2&5&6
A 24-process job launched on three 8-core nodes would be placed as
NODE   RANK
  0    0&5&6&11&12&17&18&23
  1    1&4&7&10&13&16&19&22
  2    2&3&8&9&14&15&20&21
3

Specifies a custom rank placement defined in the file named MPICH_RANK_ORDER. The MPICH_RANK_ORDER file must be readable by the first rank of the program, and reside in the current running directory. The order in which the ranks are listed in the file determines which ranks are placed closest to each other, starting with the first node in the list. To help with creating this file, consider using the grid_order tool from the Perftools package.

The PALS launcher forwards stdin to original rank 0 in MPI_COMM_WORLD only. If your application requires stdin to be available to rank 0, any rank reorder arrangement must not reorder the original rank 0. If your application does not read from stdin, rank 0 can be reordered.

For example:

0-15

Places the ranks in SMP-style order (see above).

15-0

For dual-core processors, places ranks 15&14 on the first node, ranks 13&12 on the next node, and so on. For quad-core processors, places ranks 15&14&13&12 on the first node, ranks 11&10&9&8 on the next node, and so on.

4,1,5,2,6,3,7,0,…

Places the first n ranks listed on the first node, the next n ranks on the next node, and so on, where n is the number of processes launched on each node.

You can use combinations of ranges (8-15) or individual rank numbers in the MPICH_RANK_ORDER file. The number of ranks listed in this file must match the number of processes launched.

A # denotes the beginning of a comment. A comment can start in the middle of a line and will continue to the end of the line.

MPICH_RMA_MAX_PENDING

Determines how many RMA network operations may be outstanding at any time. RMA operations beyond this max will be queued and only issued as pending operations complete.

Default: 64

MPICH_RMA_SHM_ACCUMULATE

If set to 1, enables accumulate operations using shm shared memory. If set to 0, disables shm and accumulates will use other implementations. It also sets the default for the window hint “disable_shm_accumulate” to true if MPICH_RMA_SHM_ACCUMULATE is 0, and false if MPICH_RMA_SHM_ACCUMULATE is 1.

Default: 1

MPICH_SINGLE_HOST_ENABLED

If enabled, prevents MPICH from using networking hardware when all ranks are on a single host. This avoids the unnecessary consumption of networking resources. This feature is only usable when MPI Spawn is not possible due to all possible ranks in MPI_UNIVERSE_SIZE already being part of MPI_COMM_WORLD.

Default: Enabled

MPICH_VERSION_DISPLAY

If enabled, causes MPICH to display the HPE Cray MPI version number as well as build date information. The version number can also be accessed through the attribute CRAY_MPICH_VERSION.

Default: Not enabled.

GPU-NIC ASYNC ENVIRONMENT VARIABLE

MPICH_GPU_USE_STREAM_TRIGGERED

If set, causes MPICH to allow using GPU-NIC Async Stream Triggered (ST) GPU communication operations.

Default: Not enabled.

MPICH_GPU_USE_KERNEL_TRIGGERED

If set, causes MPICH to allow using GPU-NIC Async Kernel Triggered (KT) GPU communication operations.

Default: Not enabled.

MPICH_GPU_USE_STREAM_TRIGGERED_SET_SIGNAL

If set, causes MPICH to allow using stream triggered GPU communication operations with atomic set operations for signaling purpose.

Default: Not enabled.

MPICH_MAX_TOPS_COUNTERS

Specifies the maximum number of HW counters to be opened for performing the triggered operations required to support the ST and KT GPU-NIC communication operations. Triggered operation should be enabled for this variable to take effect.

Default: 64.

SMP ENVIRONMENT VARIABLES

MPICH_SHM_PROGRESS_MAX_BATCH_SIZE

Adjusts the maximum number of on-node requests that can be processed in a single batch. Higher values for the maximum batch size can lower the overhead due to entering the progress engine, but can also delay the processing of off-node message requests.

Default: 8

MPICH_SMP_SINGLE_COPY_MODE

If set, selects the on-node implementation for large messages. This variable can be set to XPMEM, CMA or NONE. By default, Cray MPICH will attempt to use the single-copy-based implementation via XPMEM. If XPMEM is not available, Cray MPICH will fallback to using CMA. If CMA is also not available, CRAY MPICH will fallback to the two-copy-based shared-memory implementation. If this variable is set to NONE, it overrides the MPICH_SMP_SINGLE_COPY_SIZE setting.

On systems with compute nodes that offer CPUs and GPUs, users can request GPU-aware MPI communication capabilities by setting MPICH_GPU_SUPPORT_ENABLED to 1. By default, HPE Cray MPI uses the GPU Peer2Peer (IPC) technology to optimize intra-node, inter-GPU data movement operations. In this configuration, it is necessary to note that if MPICH_SMP_SINGLE_COPY_MODE set to NONE OR if MPICH_GPU_IPC_ENABLED is set to 0, the GPU IPC optimization is disabled. When IPC is disabled, HPE Cray MPI uses a two-copy-based shared-memory implementation that involves DeviceToHost and/or HostToDevice memcpy operations to implement the requested MPI operations. Disabling IPC often leads to high performance overheads and is only intended to be used for debugging issues.

Default: XPMEM

MPICH_SMP_SINGLE_COPY_SIZE

Specifies the minimum message size in bytes to consider for single-copy transfers for on-node messages. This applies only to the SMP (on-node shared memory) device. The value is interpreted as bytes, unless the string ends in a K or M, which indicates kilobytes or megabytes, respectively.

Valid values are between 512 and approximately 65536 bytes.

Default: 8192 bytes

LIBFABRIC ENVIRONMENT VARIABLES FOR INDUSTRY STANDARD NIC (Slingshot 10)

FI_OFI_RXM_BUFFER_SIZE

This is a verbs;ofi_rxm libfabric ENV variable. It specifies the transmit buffer size/inject size in bytes. Messages of size less than this will be transmitted via an eager protocol and those above will be transmitted via a rendezvous or SAR (Segmentation And Reassembly) protocol. Note that user data is copied up to this size. Only applies to Slingshot 10.

Default: 16364

FI_OFI_RXM_SAR_LIMIT

This is a verbs;ofi_rxm libfabric ENV variable. Set this environment variable to control the SAR (Segmentation And Reassembly) protocol. The SAR protocol breaks a message into smaller units before transmission and reassembles them into the proper order at the receiving end. Messages of size greater than this (in bytes) are transmitted via rendezvous protocol. Setting this to 0 disables SAR protocol entirely. When set to 0, messages will be transferred by either the eager or rendezvous protocols. Only applies to Slingshot 10.

Default: 262144

FI_OFI_RXM_RX_SIZE

This is a verbs;ofi_rxm libfabric ENV variable. Adjust this environment variable to control the size of the receive queue. The size of a receive queue dictates how many outstanding receives the application can post at a time, without processing any matching sends. In some applications, the number of outstanding receives might exceed the size of the receive queue and cause the application to deadlock if no matching sends are received. Increasing the queue size can mitigate the problem. Note the default queue size should satisfy most applications, and libfabric will consume more memory with a larger receive queue size. Only applies to Slingshot 10.

Default: 4096

FI_OFI_RXM_USE_SRX

This is a verbs;ofi_rxm libfabric ENV variable. Set this to 1 to instruct the provider to use shared receive queues. Using shared receive queues can reduce the overall memory usage significantly, but may cause latency to increase slightly. Only applies to Slingshot 10.

Default behavior if unset:
For jobs sizes of < 64 ranks, default is 0
For job sizes of 64 ranks or larger, default is 1

FI_VERBS_PREFER_XRC

This is a verbs;ofi_rxm libfabric ENV variable. Set this to 1 to request use of the XRC (eXtended Reliable Connection) protocol. Note FI_OFI_RXM_USE_SRX must also be set to 1 when requesting XRC. Using the XRC protocol reduces the number of connections, hardware resources, and memory footprint for large scaling jobs that require a demanding communication pattern. This environment variable is required when scaling jobs with an all-to-all communication pattern. Only applies to Slingshot 10.

Default behavior if unset:
For jobs sizes of < 64 ranks, default is 0
For job sizes of 64 ranks or larger, default is 1

FI_VERBS_MIN_RNR_TIMER

This is a verbs;ofi_rxm libfabric ENV variable. This sets the minimum backoff time used when the Mellanox NICs experience congestion. Allowable values are 0-31, with higher values corresponding to longer backoffs. Setting this to 0 is not recommended, however, as that translates into a very large backoff and will adversely affect performance. Optimal value for Slingshot-10 systems are likely between 3 and 6. Only applies to Slingshot 10.

Default: 6

LIBFABRIC ENVIRONMENT VARIABLES FOR HPE SLINGSHOT NIC (Slingshot 11)

FI_CXI_RDZV_THRESHOLD

This is a cxi libfabric ENV variable. It specifies the threshold in bytes above which the rendezvous protocol will be used. Messages of this size or smaller will use an eager protocol. Only applies to Slingshot 11.

Default: 16384

FI_CXI_RDZV_EAGER_SIZE

This is a cxi libfabric ENV variable. It specifies the portion of data in bytes to send eagerly when a message qualifies to use the rendezvous protocol. This small portion of data is sent eagerly along with the message header. The remainder of the payload data will be read from the source using a Get. Only applies to Slingshot 11.

Default: 2048

FI_CXI_OFLOW_BUF_SIZE

This is a cxi libfabric ENV variable. It specifies the size in bytes of each CXI overflow buffer. Overflow buffers are used to hold unexpected messages before they are matched with a posted receive buffer. Only applies to Slingshot 11.

Default: 12582912

FI_CXI_OFLOW_BUF_COUNT

This is a cxi libfabric ENV variable. It specifies the number of CXI overflow buffers allocated to hold unexpected messages. Each buffer holds FI_CXI_OFLOW_BUF_SIZE bytes. Only applies to Slingshot 11.

Default: 3

FI_CXI_DEFAULT_CQ_SIZE

This is a cxi libfabric ENV variable. It specifies the maximum number of entries in the CXI provider completion queue. Too small of a queue can result in “Cassini Event Queue overflow detected” errors. Only applies to Slingshot 11.

Default: 131072

FI_CXI_DEFAULT_TX_SIZE

This is a cxi libfabric ENV variable. It specifies the size of the transmit queue. The maximum number of outstanding rendezvous messages per rank is limited by this value. If this value is too small, it can result in delays or hangs when issuing MPI rendezvous sends. The cxi libfabric provider allows a maximum of 32768 outstanding MPI rendezvous messages. Only applies to Slingshot 11.

Default: 1024

FI_CXI_REQ_BUF_MAX_CACHED

This is a cxi libfabric ENV variable. It specifies the maximum number of request buffers that once allocated will be cached for reuse. Request buffers are only used when running in either software or hybrid endpoint mode. See FI_CXI_RX_MATCH_MODE. A value of zero indicates that once a request buffer is allocated it will be cached and used as needed. A non-zero value can be used with bursty traffic to shrink the number of allocated buffers to a maximum count when they are no longer needed. Only applies to Slingshot 11.

Default: 0

FI_CXI_REQ_BUF_MIN_POSTED

This is a cxi libfabric ENV variable. It specifies the number of CXI request buffers to keep posted to hold unexpected messages when running in either software or hybrid endpoint mode. See FI_CXI_RX_MATCH_MODE. Each buffer holds FI_CXI_REQ_BUF_SIZE bytes. Only applies to Slingshot 11.

Default: 6

FI_CXI_REQ_BUF_SIZE

This is a cxi libfabric ENV variable. It specifies the size in bytes of each CXI request buffer. These request buffers hold eager unexpected messages before they are matched with a posted receive buffer, when running in software or hybrid endpoint mode. See FI_CXI_RX_MATCH_MODE. Only applies to Slingshot 11.

Default: 12582912

FI_CXI_RX_MATCH_MODE

This is a cxi libfabric ENV variable. It specifies what message matching mode will be used by each endpoint. There are three endpoint mode options: [hardware | software | hybrid]. Only applies to Slingshot 11.

hardware

Message matching is fully offloaded to the NIC. If hardware resources become exhausted, flow control will be performed to help alleviate that condition. In cases where hardware resources are completely exhausted, the job will abort with an error message such as: “LE resources not recovered during flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is required.”

software

Message matching is performed fully in software. When running in software endpoint mode the ENV variables FI_CXI_REQ_BUF_SIZE and FI_CXI_REQ_BUF_MIN_POSTED are used to control the size and number of the eager request buffers posted to handle incoming unmatched messages. See the man page documentation for these related environment variables.

hybrid

Message matching begins fully offloaded to the NIC, but if hardware resources become exhuasted at any point, the message matching will transition to a “hybrid” of both hardware and software matching. This is done on a rank by rank basis. If a rank exhausts its hardware resources, that rank will transparently transition to software endpoint mode. When running in software endpoint mode the ENV variables FI_CXI_REQ_BUF_SIZE and FI_CXI_REQ_BUF_MIN_POSTED are used to control the size and number of the eager request buffers posted to handle incoming unmatched messages. See the man page documentation for these related environment variables.

Default: hardware

FI_MR_CACHE_MAX_SIZE

This is a cxi libfabric ENV variable. It specifies the total number of bytes for all memory regions that may be tracked by the CXI provider memory registration (MR) cache. A setting of -1 is unlimited. Only applies to Slingshot 11.

Default: -1

FI_MR_CACHE_MAX_COUNT

This is a cxi libfabric ENV variable. It specifies the total number of memory regions that may be stored in the MR cache at any one time, regardless of their size. Setting this to zero will disable MR caching. Only applies to Slingshot 11.

Default: 524288

MPICH OFI ENVIRONMENT VARIABLES

MPICH_CH4_OFI_ENABLE_CONTROL_AUTO_PROGRESS

If set to 1, MPICH will request asynchronous automatic control progress (FI_PROGRESS_AUTO) from the libfabrics provider. FI_PROGRESS_AUTO is supported in the verbs;rxm provider on Slingshot 10 systems. However, this feature is currently not supported in the Cassini provider on Slingshot 11 systems.

When this feature is requested on On Slingshot 10 systems, it may require an additional libfabrics thread (per MPI process) to make asynchronous progress on all communication operations. This option may be beneficial for applications using RMA passive communication on Slingshot 10 systems. The user is encouraged to leave one or more hardware threads vacant (per MPI process) in order to prevent contention between the application threads and the progress thread(s). The impact of oversubscription is highly system dependent but may be substantial in some cases, hence this recommendation.

Default: not set

MPICH_CH4_OFI_ENABLE_DATA_AUTO_PROGRESS

If set to 1, MPICH will request asynchronous automatic data progress (FI_PROGRESS_AUTO) from the libfabrics provider. FI_PROGRESS_AUTO is supported in the verbs;rxm provider on Slingshot 10 systems. However, this feature is currently not supported in the Cassini provider on Slingshot 11 systems.

When this feature is requested on On Slingshot 10 systems, it may require an additional libfabrics thread (per MPI process) to make asynchronous progress on all communication operations. This option may be beneficial for applications using RMA passive communication on Slingshot 10 systems. The user is encouraged to leave one or more hardware threads vacant (per MPI process) in order to prevent contention between the application threads and the progress thread(s). The impact of oversubscription is highly system dependent but may be substantial in some cases, hence this recommendation.

Default: not set

MPICH_OFI_CXI_COUNTER_REPORT

Determines if Cassini (CXI) counters are collected during the application and the verbosity of the counter data report displayed during MPI_Finalize. By default, HPE Cray MPI will track network timeouts during the application. The counter data is collected during MPI_Init and MPI_Finalize, with a report indicating the change in the selected counters over the duration of the application. There is no interference while the application is running. To obtain a valid counter report, run times must be at least a few seconds.

A network timeout is defined as an event on the Slingshot 11 network (such as a link flap) that causes a packet to be re-issued. Network timeouts are identified from a NIC perspective. A single link flap may affect multiple NICs, depending on the network traffic at the time, so it is likely a single flap will generate multiple timeouts. MPI queries all NICs in use for the application. Depending on the application traffic pattern and timing, network timeouts may or may not affect the application performance metric. If network timeouts affecting the application have occurred, the following one-line message will be sent to stdout, indicating the number of timeouts. If no timeouts were detected, this line will be suppressed unless additional verbosity is requested. [ e.g. MPICH Slingshot Network Summary: 3 network timeouts ]

Along with the network timeout counters, MPI can also collect and display an additional set of Cassini counters by setting MPICH_OFI_CXI_COUNTER_REPORT to a value of 2 or higher. MPI has a default set of counters it will collect for each application, which may be useful in debugging performance problems. This set can be overridden by specifying a file with a list of alternate counters. See more information on MPICH_OFI_CXI_COUNTER_FILE below.

When set to 2 or higher, the Cassini counter summary report displayed during MPI_Finalize will include a min/mean/max value for each counter selected, along with computed rates. When set to 3 or higher, NIC-specific detailed counter data is displayed. Counters that have recorded zero values are suppressed. Recognized values are between 0 and 5. The value affects the verbosity of the counter data displayed. Only applicable to Slingshot 11.

- no Cassini counters collected; feature is disabled
- network timeout counters collected, one-line display (default)
- option 1 + CXI counters summary report displayed
- option 2 + display counter data for any NIC that hit a network timeout
- option 2 + display counter data for all NICs, if any network timeout occurred
- option 2 + display counter data for all NICs

Default: 1

MPICH_OFI_CXI_COUNTER_FILE

Specifies a file containing an alternate list of Cassini counter names to collect. If this file is present, instead of collecting the default set of counters, MPI collects data for the counters specified in the file. Counter names must be listed one per line. For retry handler counters, prefix the counter name with “rh:”. When specifying this option, set MPICH_OFI_CXI_COUNTER_REPORT to 2 or higher. Setting MPICH_OFI_CXI_COUNTER_VERBOSE to 1 may be helpful for debugging. The default network timeout counters will be collected in addition to the file contents. Only applicable to Slingshot 11.

Default: not set

MPICH_OFI_CXI_COUNTER_REPORT_FILE

Specifies an optional output filename prefix for the counter report. By default, the counter report is written to stdout. When this variable is set to a filename, the detailed counter data produced with MPICH_OFI_CXI_COUNTER_REPORT options 3, 4 and 5 will be written to node-specific files with filenames of MPICH_OFI_CXI_COUNTER_REPORT_FILE.<hostname>. The user must have appropriate permission to create these files. This is useful when running on hundreds or thousands of nodes, where stdout can get jumbled or truncated by the launcher. If not specified, stdout is used. Only applicable to Slingshot 11.

Default: not set

MPICH_OFI_CXI_COUNTER_VERBOSE

If set to a non-zero value, this enables more verbose output about the Cassini counters being collected. Can be helpful for debugging and/or identifying which counters are being collected. Only applicable to Slingshot 11.

Default: 0

MPICH_OFI_CXI_PID_BASE

This variable specifies a base value to add to the value of the local_rank to create a unique CXI PID identifier. Typically this variable should not be set by the user. The case where setting this might be necessary is when running a job with multiple programming models that have conflicting CXI PIDs. By default, programs using both MPI and SHMEM will not have conflicting PIDs. Set to -1 to cause MPICH to use dynamic PIDs. Only applicable to Slingshot 11.

Default: 0

MPICH_OFI_DEFAULT_TCLASS

This selects the default traffic class the job will run in. All MPI transfers will be sent using the selected traffic class. Valid traffic classes are: TC_BEST_EFFORT, TC_DEDICATED_ACCESS, TC_BULK_DATA, or TC_LOW_LATENCY. Note some traffic classes may not exist on all systems and some may require prior WLM authorization for use. See MPICH_OFI_TCLASS_ERRORS for options on handling these traffic class errors. If an invalid or unauthorized traffic class is requested, MPI sets a traffic class of TC_UNSPEC which directs the CXI provider to use the default traffic class for the domain. Only applicable to Slingshot 11.

Default: TC_BEST_EFFORT

MPICH_OFI_NIC_MAPPING

Specifies the precise rank-to-NIC mapping to use on each node. This is evaluated only if the MPICH_OFI_NIC_POLICY variable is set to USER. This mapping is based on the zero-based local rank value, not global rank value. Each local rank must have a NIC mapping assigned by this variable. If there are fewer MPI ranks on any node, that portion of the MPICH_OFI_NIC_MAPPING string will be ignored. Add quotes around the entire string to prevent the shell from interpreting the value incorrectly.

The format is as follows
“nic_idx:local_mpi_ranks; nic_idx:local_mpi_ranks; nic_idx;local_mpi_ranks”

Examples assume 64 ranks placed per node, with each node having 2 or 3 NICs.
To assign local_rank 0 to NIC 0, and remaining ranks to NIC 1, use:

 MPICH_OFI_NIC_MAPPING="0:0; 1:1-63"

To assign local ranks 0,16,32,48 to NIC 0, and remaining ranks to NIC 1:

 MPICH_OFI_NIC_MAPPING="0:0,16,32,48; 1:1-15,17-31,33-47,49-63"

To assign local ranks 0-7 to NIC 0, 8-31 to NIC 2, and 32-63 to NIC 1:

 MPICH_OFI_NIC_MAPPING="0:0-7; 2:8-31; 1:32-63"

Default: not set

MPICH_OFI_NIC_POLICY

Selects the rank-to-NIC assignment policy used by Cray MPI. Each MPI rank will be assigned to exactly one NIC. There are four available options: [BLOCK | ROUND-ROBIN | NUMA | GPU | USER].

BLOCK
Selects a block distribution. Consecutive local ranks on a node are equally distributed among the available NICs on the node. The number of ranks on a node are divided by the number of NICs on that node (rounded up), with the first X local ranks assigned to NIC 0, the next X local ranks assigned to NIC 1, etc.

For example, with 22 ranks placed per node, and each node having 4 NICs
ranks 0-5 are assigned to NIC 0
ranks 6-11 are assigned to NIC 1
ranks 12-17 are assigned to NIC 2
ranks 18-21 are assigned to NIC 3
ROUND-ROBIN
Selects a round-robin distribution. The first local rank on a node is assigned to NIC 0, the second rank is assigned NIC 1, the third rank is assigned NIC 2, etc. When all NICs on the node have been assigned once, the next available local rank will be assigned NIC 0, and so on.

For example, with 22 ranks placed per node, and each node having 4 NICs
ranks 0,4,8,12,16,20 are assigned to NIC 0
ranks 1,5,9,13,17,21 are assigned to NIC 1
ranks 2,6,10,14,18 are assigned to NIC 2
ranks 3,7,11,15,19 are assigned to NIC 3
NUMA

Selects a NUMA-aware distribution. The local ranks are assigned to the NIC that is closest to the rank’s numa node affinity. If a rank is pinned to a core or subset of cores in numa node N, and a NIC is also mapped to numa node N, the rank will use that corresponding NIC. If a matching numa node between rank and NIC is not found, then the NIC in the closest numa node to the rank is selected. If multiple NICs are assigned to the same numa node, the local ranks will round-robin between them. Numa distances are analyzed to select the closest NIC.

For the NUMA policy to be successful when multiple NICs per node are available, the affinity of the ranks must be constrained (pinned) to cores contained within a single numa node. A rank is not allowed to float among cores that span numa nodes when selecting the NUMA policy. If that condition exists, the job will abort with an error message.

GPU

Selects a GPU-aware distribution. The local ranks are assigned to the NIC that is closest to the GPU selected by the user via a vendor API (e.g. CUDA or HIP). If multiple NICs are assigned to the same GPU/numa node, the local ranks will round-robin between them. Numa distances are analyzed to select the NIC closest to the user’s selected GPU.

USER

Supports a custom user-selection for NIC assignment. This selection requires the MPICH_OFI_NIC_MAPPING variable to also be set to indicate the precise rank-to-NIC assignment requested. See MPICH_OFI_NIC_MAPPING.

Default: BLOCK

MPICH_OFI_NIC_VERBOSE

If set to 1, verbose information pertaining to NIC selection is printed at the start of the job. All available NIC domain names, addresses and index values are displayed. Setting this variable to 2 displays additional details, including the specific NIC each rank has been assigned, which is based on MPICH_OFI_NIC_POLICY.

Default: 0

MPICH_OFI_NUM_NICS

Specifies the number of NICs the job can use on a per-node basis. By default, when multiple NICs per node are available, MPI attempts to use them all. If fewer NICs are desired, this variable can be set to indicate the maximum number of NICs per node MPI will use. By default, MPI uses consecutive NIC indices, starting with index 0.

To request MPI to use alternative NIC index values, an optional segment can be added to this variable by adding a colon followed by the desired nic index values. Add quotes around the entire string to prevent the shell from interpreting the value incorrectly.

For example:
To use 1 NIC per node, index 0, specify:
   export MPICH_OFI_NUM_NICS=1   (equivalent to MPICH_OFI_NUM_NICS="1:0")

To use 1 NIC per node, index 1, specify:
    export MPICH_OFI_NUM_NICS="1:1"

To use 2 NICs per node, index 0 and 1, specify:
   export MPICH_OFI_NUM_NICS=2    (equivalent to MPICH_OFI_NUM_NICS="2:0,1")

To use 2 NICs per node, index 1 and 3, specify
   export MPICH_OFI_NUM_NICS="2:1,3"
Default: not set (MPI uses all available NICs by default)

MPICH_OFI_RMA_STARTUP_CONNECT

By default, OFI connections between ranks are set up on demand. This allows for optimal performance while minimizing memory requirements. However, for Slingshot-10/Infiniband RMA jobs requiring an all-to-all on-node communication pattern, it may be beneficial to create OFI connections between PEs on a node in a coordinated manner at startup. If set to 1, Cray MPI will create connections between all ranks on each node in the job during MPI_Init. This option also automatically enables MPICH_OFI_STARTUP_CONNECT.

This option is not beneficial on a Slingshot-11 system.

Default: 0

MPICH_OFI_SKIP_NIC_SYMMETRY_TEST

If set to 1, the check for NIC symmetry performed during MPI_Init will be bypassed. By default, a symmetry check is run to make sure all the nodes in the job have the same number of NICs available. An asymmetric NIC layout can pose significant performance implications, especially if the user is unaware of this condition.

Default: 0

MPICH_OFI_STARTUP_CONNECT

By default, OFI connections between ranks are set up on demand. This allows for optimal performance while minimizing memory requirements. However, for Slingshot-10/Infiniband jobs requiring an all-to-all communication pattern, it may be beneficial to create all OFI connections in a coordinated manner at startup.If set to 1, Cray MPI will create connections between all ranks in the job during MPI_Init.

This option is not beneficial on a Slingshot-11 system.

Default: 0

MPICH_OFI_TCLASS_ERRORS

Determines how MPI will handle invalid traffic class selections specified in MPICH_OFI_DEFAULT_TCLASS. Some traffic classes may not be supported on certain systems, and some traffic classes will require prior WLM authorization before use. Valid options are: [WARN | SILENT | ERROR]. Options are not case-sensitive. Only applicable to Slingshot 11.

WARN

Displays an appropriate warning message and lets the job continue. If an unsupported/unauthorized traffic class is selected, the provider will use the default traffic class for the domain.

SILENT

Silently allow the job to run, falling back to the default traffic class for the domain.

ERROR

Displays an appropriate error message and terminates the job.

Default: WARN

MPICH_OFI_USE_PROVIDER

Specifies the libfabric provider to use. By default, the “verbs;ofi_rxm” provider is selected for Slingshot 10 systems, as this is the supported and optimized provider. On Slingshot 11 systems the default is the “cxi” provider. For debugging purposes, other libfabric providers may be requested by setting this variable to the desired provider name (i.e. sockets or “tcp;ofi_rxm”).
Default: "verbs;ofi_rxm" on Slingshot 10 systems
Default: "cxi" on Slingshot 11 systems

MPICH_OFI_USE_SCALABLE_STARTUP

If enabled, MPI will attempt to use the Scalable Startup optimization for launching, provided the launch requirements are met and the necessary data is supplied by the launcher and PMI. To disable use of Scalable Startup, set this to 0. Only applicable to Slingshot 11.

Default: 1

MPICH_OFI_VERBOSE

If set, more verbose output will be displayed during MPI_Init to verify which libfabric provider has been selected, along with the name and address of the NIC being used. This may be helpful for debugging errors encountered during MPI_Init.

Default: not set

COLLECTIVE ENVIRONMENT VARIABLES

MPICH_ALLGATHER_VSHORT_MSG

Adjusts the cutoff point at and below which the architecture-specific optimized gather/bcast algorithm is used instead of the optimized ring algorithm for MPI_Allgather. The gather/bcast algorithm is better suited for small messages.

Defaults:
For communicator sizes of <= 512 ranks, 1024 bytes.
For communicator sizes of > 512 ranks, 4096 bytes.

MPICH_ALLGATHERV_VSHORT_MSG

Adjusts the cutoff point at and below which the architecture-specific optimized gatherv/bcast algorithm is used instead of the optimized ring algorithm for MPI_Allgatherv. The gatherv/bcast algorithm is better suited for small messages.

Defaults:
For communicator sizes of <= 512 ranks, 1024 bytes.
For communicator sizes of > 512 ranks, 4096 bytes.

MPICH_ALLREDUCE_BLK_SIZE

Specifies the block size (in bytes) to use when dividing very large Allreduce messages into smaller blocks for better performance. The value is interpreted as bytes, unless the string ends in a K, which indicates kilobytes, or M, which indicates megabytes. Valid values are between 8192 and MAX_INT.

Default: 716800 bytes MPICH_ALLREDUCE_GPU_MAX_SMP_SIZE:: When GPU support is enabled, this variable specifies the maximum message size (in bytes) for which an SMP-aware allreduce algorithm is used. Larger allreduce messages will use a reduce-scatter-allgather algorithm. A value of 0 specifies an SMP-aware allreduce algorithm for all message sizes. The value is interpreted as bytes, unless the string ends in a K, which indicates kilobytes, or M, which indicates megabytes.

Default: 1024 bytes

MPICH_ALLREDUCE_MAX_SMP_SIZE

Specifies the maximum message size (in bytes) for which an SMP-aware allreduce algorithm is used. Larger allreduce messages will use a reduce-scatter-allgather algorithm. A value of 0 specifies an SMP-aware allreduce algorithm for all message sizes. The value is interpreted as bytes, unless the string ends in a K, which indicates kilobytes, or M, which indicates megabytes.

Default: 262144 bytes

MPICH_ALLREDUCE_NO_SMP

If set, MPI_Allreduce uses an algorithm that is not smp-aware. This provides a consistent ordering of the specified allreduce operation regardless of system configuration.

Note: This algorithm may not perform as well as the default smp-aware algorithms as it does not take advantage of rank topology.

Default: not set

MPICH_ALLTOALL_SHORT_MSG

Adjusts the cut-off points at and below which the store and forward Alltoall algorithm is used for short messages. The default value is dependent upon the total number of ranks in the MPI communicator used for the MPI_Alltoall call.

Defaults:
if communicator size <= 1024, 512 bytes
if communicator size > 1024 and <= 65536, 256 bytes
if communicator size > 65536 and <= 131072, 128 bytes
if communicator size > 131072, 64 bytes

MPICH_ALLTOALL_SYNC_FREQ

Adjusts the number of outstanding messages (the synchronization frequency) each rank participating in the Alltoall algorithm will allow. The defaults vary for each call, depending on several factors, including number of ranks on a node participating in the collective, and the message size.

Default: Varies from 1 to 24

MPICH_ALLTOALLV_THROTTLE

Sets the per-process maximum number of outstanding Isends and Irecvs that can be posted concurrently for the MPI_Alltoallv and MPI_Alltoallw algorithms. This setting also applies to the non-blocking MPI_Ialltoallv and MPI_Ialltoallw algorithms that use throttling. For sparsely-populated or small message Alltoallv/w data, setting this to a higher value may improve performance. For heavily-populated large message Alltoallv/w data, or when running at high process-per-node counts, consider decreasing this value to improve performance.

Default: 8

MPICH_BCAST_INTERNODE_RADIX

Used to set the radix of the inter-node tree. This can be set to any integer value greater than or equal to 2.

Default: 4

MPICH_BCAST_INTRANODE_RADIX

Used to set the radix of the intra-node tree. This can be set to any integer value greater than or equal to 2.

Default: 4

MPICH_BCAST_ONLY_TREE

If set to 1, MPI_Bcast uses an smp-aware tree algorithm regardless of data size. The tree algorithm generally scales well to high processor counts.

If set to 0, MPI_Bcast uses a variety of algorithms (tree, scatter, or ring) depending on message size and other factors.

Default: 1

MPICH_COLL_OPT_OFF

If set, disables collective optimizations which use nondefault, architecture-specific algorithms for some MPI collective operations. By default, all collective optimized algorithms are enabled.

To disable all collective optimized algorithms, set MPICH_COLL_OPT_OFF to 1.

To disable optimized algorithms for selected MPI collectives, set the value to a comma-separated list of the desired collective names. Names are not case-sensitive. Any unrecognizable name is flagged with a warning message and ignored. For example, to disable the MPI_Allgather optimized collective algorithm, set MPICH_COLL_OPT_OFF=mpi_allgather.

The following collective names are recognized: MPI_Allgather, MPI_Allgatherv, MPI_Allreduce, MPI_Alltoall, MPI_Alltoallv, MPI_Bcast, MPI_Gatherv, MPI_Scatterv, MPI_Igatherv, and MPI_Iallreduce.

Default: Not enabled.

MPICH_ENABLE_HCOLL

This enables the use of Mellanox’s HCOLL collectives offload feature when the UCX netmod is being used. The HCOLL libraries must be in the library search path, and HCOLL must be configured on the system. HCOLL will give optimized performance for some collectives at the cost of higher MPI communicator creation time. This feature will not be available if optimizations are disabled through MPICH_COLL_OPT_OFF.

Default: Not enabled

MPICH_GATHERV_MAX_TMP_SIZE

Only applicable to the Gatherv tree algorithm. Sets the maximum amount of temporary memory Gatherv will allow a rank to allocate when using the tree-based algorithm. Each rank allocates a different amount, with many allocating no extra memory. If any rank requires more than this amount of temporary buffer space, a different algorithm is used.

Default: 512M

MPICH_GATHERV_MIN_COMM_SIZE

Cray MPI offers two optimized Gatherv algorithms: a tree algorithm for small messages and a permission-to-send algorithm for larger messages. Set this value to the minimum communicator size to attempt use of either of the Cray optimized Gatherv algorithms. Smaller communicator sizes will use the ANL MPI_Gatherv algorithm.

Default: 64

MPICH_GATHERV_SHORT_MSG

Adjusts the cutoff point at and below which the optimized tree MPI_Gatherv algorithm is used instead of the optimized permission-to-send algorithm. The cutoff is in bytes, based on the average size of the variable MPI_Gatherv message sizes.

Default: 131072

MPICH_GPU_ALLGATHER_VSHORT_MSG_ALGORITHM

If set to 1, enables optimizations for small message MPI_Allgather operations with GPU-attached payloads. This variable is only relevant if MPICH_GPU_SUPPORT_ENABLED is set to 1 and MPICH_GPU_COLL_STAGING_AREA_OPT is also set to 1

Default: 1

MPICH_GPU_ALLREDUCE_BLK_SIZE

Controls the size of the GPU-attached staging buffer user for GPU-kernel-based optimizations for MPI_Allreduce and MPI_Reduce_scatter_block. Defaults to 8MB per process. There is evidence that suggests that larger values (~ 64MB) can offer improved Allreduce performance for very large payloads (100s of MB). Setting the default conservatively for now but allowing for additional tuning opportunities for specific use cases in the future. This variable is relevant only if MPICH_GPU_ALLREDUCE_USE_KERNEL and MPICH_GPU_SUPPORT_ENABLED are also set.

Default: 8388608

MPICH_GPU_ALLREDUCE_KERNEL_THRESHOLD

Allreduce operations with payloads larger than this threshold can utilize the GPU kernel-based optimization. This variable is relevant only if MPICH_GPU_ALLREDUCE_USE_KERNEL and MPICH_GPU_SUPPORT_ENABLED are also set.

Default: 131072

MPICH_GPU_ALLREDUCE_USE_KERNEL

If set, adds a hint that the use of device kernels for reduction operations is desired. MPI is not guaranteed to use a device kernel for all reduction operations. This variable is relevant only if MPICH_GPU_SUPPORT_ENABLED is set to 1. GPU kernel-based optimizations are currently disabled for Reductions that involve the use of non-contig MPI datatypes. This feature is also used only when user buffers are on GPU-attached memory regions. This optimization is applicable for MPI_Allreduce and MPI_Reduce_Scatter_block.

Default: 1

MPICH_GATHERV_SYNC_FREQ

Only applicable to the Gatherv permission-to-send algorithm. Adjusts the number of outstanding receives the root for Gatherv will allow.

Default: 16

MPICH_IALLGATHERV_THROTTLE

Sets the per-process maximum number of outstanding Isends and Irecvs that can be posted concurrently for the throttled MPI_Iallgatherv algorithm. This only applies if the throttled MPI_Iallgatherv algorithm is explicitly requested by setting MPICH_IALLGATHERV_INTRA_ALGORITHM=throttled. This algorithm may be beneficial when using a small number of ranks per node. By default a recursive_doubling, brucks or ring algorithm is chosen based on data size and other parameters.

Default: 6

MPICH_IGATHERV_MIN_COMM_SIZE

Set this value to the minimum communicator size to trigger use of the Cray optimized Igatherv permission-to-send algorithm. Smaller communicator sizes send without permission.

Default: 1000

MPICH_IGATHERV_SYNC_FREQ

Adjusts the maximum number of receives the root rank of the Cray optimized Igatherv algorithm can have outstanding.

Default: 100

MPICH_REDUCE_NO_SMP

If set, MPI_Reduce uses an algorithm that is not smp-aware. This provides a consistent ordering of the specified reduce operation regardless of system configuration.

Note: This algorithm may not perform as well as the default smp-aware algorithms as it does not take advantage of rank topology.

Default: not set

MPICH_REDUCE_SCATTER_COMMUTATIVE_LONG_MSG_SIZE

This environment variable applies to MPI_Reduce_scatter and MPI_Reduce_scatter_block. For the reduce_scatter functions, this variable specifies the cutoff size of the send buffer (in bytes) at and above which a pairwise exchange algorithm is attempted. In addition, the op must be commutative and the communicator size less than or equal to MPICH_REDUCE_SCATTER_MAX_COMMSIZE for the pairwise exchange algorithm to be used. For smaller send buffers, a recursive halving algorithm is used.

Default value: 524288

MPICH_REDUCE_SCATTER_MAX_COMMSIZE

This environment variable applies to MPI_Reduce_scatter and MPI_Reduce_scatter_block. For the reduce_scatter functions, this variable specifies the maximum communicator size that triggers use of the pairwise exchange algorithm, provided the op is commutative. The pairwise exchange algorithm is not well-suited for scaling to high process counts, so for larger communicators, a recursive halving algorithm is used by default instead.

Default value: 1000

MPICH_SCATTERV_MAX_TMP_SIZE

Only applicable to the Scatterv tree algorithm. Sets the maximum amount of temporary memory Scatterv will allow a rank to allocate when using the tree-based algorithm. Each rank allocates a different amount, with many allocating no extra memory. If any rank requires more than this amount of temporary buffer space, a different algorithm is used.

Default: 512M

MPICH_SCATTERV_MIN_COMM_SIZE

Cray MPI offers two optimized Scatterv algorithms: a tree algorithm for small messages and a staggered send algorithm for larger messages. Set this value to the minimum communicator size to attempt use of either of the Cray optimized Scatterv algorithms. Smaller communicator sizes will use the ANL MPI_Scatterv algorithm.

Default: 64

MPICH_SCATTERV_SHORT_MSG

Adjusts the cutoff point at and below which the optimized tree MPI_Scatterv algorithm is used instead of the optimized staggered send algorithm. The cutoff is in bytes, based on the average size of the variable MPI_Scatterv message sizes.

Default behavior if unset is:

For communicator sizes of < or = 512 ranks, 2048 bytes

For communicator sizes of > 512 ranks, 8192 bytes

MPICH_SCATTERV_SYNCHRONOUS

Only applicable to the ANL non-optimized Scatterv algorithm. The ANL MPI_Scatterv algorithm uses asynchronous sends for communicator sizes less than 200,000 ranks. If set, this environment variable causes the ANL MPI_Scatterv algorithm to switch to using blocking sends, which may be beneficial with large data sizes or high process counts.

For communicator sizes equal to or greater than 200,000 ranks, the blocking send algorithm is used by default.

Default: not enabled

MPICH_SCATTERV_SYNC_FREQ

Only applicable to the Scatterv staggered send algorithm. Adjusts the number of outstanding sends the root for Scatterv will use.

Default: 16

MPICH_SHARED_MEM_COLL_OPT

By default, the MPICH library will use the optimized shared-memory based design for collective operations. The supported collective operations are: MPI_Allreduce, MPI_Barrier, and MPI_Bcast.

To disable all available shared-memory optimizations, set MPICH_SHARED_MEM_COLL_OPT to 0.

To enable this feature for a specific set of collective operations, set MPICH_SHARED_MEM_COLL_OPT to a comma-separated list of collective names. For example, to enable this optimization for MPI_Bcast only, set MPICH_SHARED_MEM_COLL_OPT=MPI_Bcast. To enable this optimization for MPI_Allreduce only, set MPICH_SHARED_MEM_COLL_OPT=MPI_Allreduce. Unsupported names are flagged with a warning message and ignored.

Default: set

MPI-IO ENVIRONMENT VARIABLES

MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY

If enabled, displays the assignment of MPIIO collective buffering aggregators for reads/writes of a shared file, showing rank and node ID (nid). For example:

Aggregator Placement for /lus/scratch/myfile
RankReorderMethod=3  AggPlacementStride=-1
 AGG    Rank       nid
 ----  ------  --------
    0       0  nid00578
    1       4  nid00579
    2       1  nid00606
    3       5  nid00607
    4       2  nid00578
    5       6  nid00579
    6       3  nid00606
    7       7  nid00607

Default: Not enabled.

MPICH_MPIIO_AGGREGATOR_PLACEMENT_STRIDE

Partially controls to which nodes MPIIO collective buffering aggregators are assigned. See the notes below on the order of nodes. Network traffic and resulting I/O performance may be affected by the assignments.

If set to 1, consecutive nodes are used. The number of aggregators assigned per node is controlled by the cb_config_list hint. By default, no more than one aggregator per node will be assigned if there are at least as many nodes as aggregators.

If set to a value greater than 1, node selection is strided across the available nodes by this value. If the stride times the number of aggregators exceeds the number of nodes, the assignments will wrap around, which is usually not optimal for performance.

If set to -1, node selection is strided across available nodes by the value of the number of nodes divided by the number of aggregators (integer division, minimum value of 1). The purpose is to spread out the nodes to reduce network congestion.

Note: The order of nodes can be shown by setting the MPICH_RANK_REORDER_DISPLAY environment variable. This lists in rank order (rank for MPI_COMM_WORLD) the node on which each rank resides. When there is more than one rank per node, the node ID is repeated. When MPI has not done any rank reordering, all the ranks for the first node are listed first, then all the ranks for the second node are listed second, etc. When rank reordering has been done, see the MPICH_RANK_REORDER_METHOD environment variable, the order of the nodes can be very different. To spread the aggregators across the nodes, if MPICH_RANK_REORDER_METHOD=3, MPIIO sorts the list by nid and then by rank on that node, for the node order used when assigning aggregators. This has the desired effect of spreading the aggregators across the nodes assigned to the job.The current implementation is not file-specific. That is, the environment variable applies to all files opened with MPI_File_open().

Default: -1

MPICH_MPIIO_CB_ALIGN

Sets the default value for the cb_align hint. Files opened with MPI_File_open wil have this value for the cb_align hint unless the hint is set on a per file basis with either the MPICH_MPIIO_HINTS environment variable or from within a program with the MPI_Info_set() call.

Note: Only MPICH_MPIIO_CB_ALIGN == 2 is fully supported. Other values are for internal testing only.

Default: 2

MPICH_MPIIO_DVS_MAXNODES

Note: This environment variable in relevant only for file systems accessed from HPE system compute nodes via DVS server nodes; e.g. GPFS or PANFS.

As described in the dvs(5) man page, the environment variable DVS_MAXNODES can be used to set the stripe width—that is, the number of DVS server nodes—used to access a file in “stripe parallel mode.” For most files, and especially for small files, setting DVS_MAXNODES to 1 (“cluster parallel mode”) is preferred.

The MPICH_MPIIO_DVS_MAXNODES environment variable enables you to leave DVS_MAXNODES set to 1 and then use MPICH_MPIIO_DVS_MAXNODES to temporarily override DVS_MAXNODES when it is advantageous to specify wider striping for files being opened by the MPI_File_open() call. The range of values accepted by MPICH_MPIIO_DVS_MAXNODES goes from 1 to the number of server nodes specified on the mount with the nnodes mount option.

DVS_MAXNODES is not set by default. Therefore, for MPICH_MPIIO_DVS_MAXNODES to have any effect, DVS_MAXNODES must be defined before program startup and defined using exactly three characters, where the characters specify the decimal value and the remainder are underscore characters: for example, DVS_MAXNODES=12_. If DVS_MAXNODES is not defined or defined incorrectly, MPI-IO will ignore MPICH_MPIIO_DVS_MAXNODES. A warning message is issued if the value requested by the user does not match the value actually used by DVS.

MPICH_MPIIO_DVS_MAXNODES interacts with MPICH_MPIIO_HINTS. To determine the striping actually used, the order of precedence is:

striping_factor set using MPICH_MPIIO_HINTS, if set

striping_factor set using MPI_Info_set(), if set

MPICH_MPIIO_DVS_MAXNODES value, if set

DVS_MAXNODES, if set

DVS maxnodes=n mount option, if specified

Default: unset

MPICH_MPIIO_HINTS

If set, override the default value of one or more MPI I/O hints. This also overrides any values that were set by using calls to MPI_Info_set in the application code. The new values apply to the file the next time it is opened using an MPI_File_open() call.

After the MPI_File_open() call, subsequent MPI_Info_set calls can be used to pass new MPI I/O hints that take precedence over some of the environment variable values. Other MPI I/O hints such as striping_factor, striping_unit, cb_nodes, and cb_config_list cannot be changed after the MPI_File_open() call, as these are evaluated and applied only during the file open process.

An MPI_File_close call followed by an MPI_File_open call can be used to restart the MPI I/O hint evaluation process.

The syntax for this environment variable is a comma-separated list of specifications. Each individual specification is a pathname_pattern followed by a colon-separated list of one or more key=value pairs. In each key=value pair, the key is the MPI-IO hint name, and the value is its value as it would be coded for an MPI_Info_set library call.

For example:

MPICH_MPIIO_HINTS=spec1[,spec2,...]
Where each specification has the syntax:

pathname_pattern:key1=value1[:key2=value2:...]
The pathname_pattern can be an exact match with the filename argument used in the MPI_File_open() call or it can be a pattern as described below.

When a file is opened with MPI_File_open(), the list of hint specifications in the MPICH_MPIIO_HINTS environment variable is scanned. The first pathname_pattern matching the filename argument in the MPI_File_open() call is selected. Any hints associated with the selected pathname_pattern are applied to the file being opened. If no pattern matches, no hints from this specification are applied to the file.

The pathname_pattern follows standard shell pattern-matching rules with these meta-characters:

----------------------------------------------------------------
Pattern    Description
----------------------------------------------------------------
*          Match any number of characters
?          Match any single character
[a-b]      Match any single character between a and b, inclusive
\          Interpret the meta-character that follows literally
----------------------------------------------------------------
The simplest pathname_pattern is . Using this results in the specified hints being applied to all files opened with *MPI_File_open(). Use of this wildcard is discouraged because of the possibility that a library linked with the application may also open a file for which the hints are not appropriate. * The following example shows how to set hints for a set of files. The final specification in this example, for file /scratch/user/me/dump., has two *key=value pairs.

MPICH_MPIIO_HINTS=file1:direct_io=true,file2:romio_ds_write=disable,\
/scratch/user/me/dump.*:romio_cb_write=enable:cb_nodes=8
The following MPI-IO key values are supported on HPE systems.

abort_on_rw_error

If set to enable, causes MPI-IO to abort immediately after issuing an error message if an I/O error occurs during a system read() or system write() call. The valid values are enable and disable. See the MPICH_MPIIO_ABORT_ON_RW_ERROR environment variable for more details.

Default: disable

cb_align

Specifies which alignment algorithm to use for collective buffering. If set to 2, an algorithm is used to divide the I/O workload into Lustre stripe-sized pieces and assigns these pieces to collective buffering nodes (aggregators) so that each aggregator always accesses the same set of stripes and no other aggregator accesses those stripes. This is generally the optimal collective buffering mode as it minimizes the Lustre file system extent lock contention and thus reduces system I/O time.

Historically there have been a few different collective buffering alignment algorithms used on HPE systems. Currently only one of them, algorithm 2, is supported. The alignment value of 1 is no longer supported. The alignment values of 0 and 3 are not fully supported but are for internal testing only. Other algorithms may be supported in the future.

Default: 2

cb_buffer_size

Sets the buffer size in bytes for collective buffering.

This hint is not used with the current default collective buffering algorithm.

cb_config_list
Specifies by name which nodes are to serve as aggregators. The syntax for the value is:
#name1:maxprocesses[,name2:maxprocesses,...]#
Where name is either * (match all node names) or the name returned by MPI_Get_processor_name, and maxprocesses specifies the maximum number of processes on that node to serve as aggregators. If the value of the cb_nodes hint is greater than the number of compute nodes, the value of maxprocesses must be greater than 1 in order to assign the required number of aggregators. When the cb_align hint is set to 2 (the default), the aggregators are assigned using a round-robin method across compute nodes.

The pair of # characters beginning and ending the list are not part of the normal MPIIO hint syntax but are required. Because colon (:) characters are used in both this list and in the MPICH_MPIIO_HINTS environment variable syntax, the # characters are required in order to determine the meaning of colon (:) character.

This value cannot be changed after the file is opened.

Default: :

cb_nodes

Specifies the number of aggregators used to perform the physical I/O for collective I/O operations when collective buffering is enabled. On multi-core nodes, all cores share the same node name.

With the current default collective buffering algorithm, the best value for cb_nodes is usually the same as striping_factor (in other words, the stripe count).

Default: striping_factor

cray_cb_nodes_multiplier

Specifies the number of collective buffering aggregators (cb_nodes) per OST for Lustre files. In other words, the number of aggregators is the stripe count (striping_factor) times the multiplier. This may improve or degrade I/O performance, depending on the file locking mode and other conditions. When the locking mode is 0, a multiplier of 1 is usually best for writing the file. When the locking mode is 1 or 2, a multiplier of 2 or more is usually best for writing the file. If a locking mode is specified and both cb_nodes and cray_cb_nodes_multiplier hints are set, the cb_nodes hint is ignored. See cray_cb_write_lock_mode.

When reading a file with collective buffering, a multiplier of 2 or more often improves read performance.

Note: If the number of aggregators exceeds the number of compute nodes, performance generally won’t improve over 1 aggregator per compute node.

Default: 1

cray_cb_write_lock_mode

Specifies the file locking mode for accessing Lustre files. These modes do not apply when accessing other file systems. Valid values are:

0

Standard locking mode. Extent locks are held by each MPI rank accessing the file. The extent of each lock often exceeds the byte range needed by the rank. Locks are revoked and reissued when the extent of a lock held by one rank conflicts with the extent of a lock needed by another rank.

1

Shared lock locking mode. A single lock is shared by all MPI ranks that are writing the file. This lock mode is only applicable when collective buffering is enabled and is only valid if the only accesses to the file are writes and all the writes are done by the collective buffering aggregators. The romio_no_indep_rw hint must be set to true to use this locking mode. This is an explicit assertion that all file accesses will be with MPI collective I/O. Setting the romio_no_indep_rw hint to true also sets romio_cb_write and romio_cb_read to enable. Any other MPI I/O accesses will cause the program to abort and any non-MPI I/O access may cause the program to hang. Both HDF5 and netCDF do both collective and independent I/O so this locking mode is not appropriate for these APIs.

2

Lock ahead locking mode. Sets of non-overlapping extent locks are acquired ahead of time by all MPI ranks that are writing the file and the acquired locks are not expanded beyond the size requested by the ranks. This locking mode is only applicable when collective buffering is enabled but unlike locking mode 1, MPI independent I/O and non-MPI I/O are also allowed. However, to be a performance benefit, the majority of the I/O should be MPI collective I/O. This supports file access patterns such as that done by HDF5 and netCDF where some MPI independent I/O might occur in addition to MPI collective I/O. The romio_no_indep_rw hint does not need to be set to true. Also see the cray_cb_lock_ahead_num_extents hint.

Locking modes 1 and 2 reduce lock contention between multiple clients and therefore support greater parallelism by allowing multiple aggregators per OST to efficiently write to the file. Set the cray_cb_nodes_multiplier hint to 2 or more to get the increased parallelism. The optimal value depends on file system characteristics. Note that if lock mode 2 is not supported, a warning will be printed and the lock mode will be reset to 0.

Default: 0

direct_io

Enables the O_DIRECT mode for the specified file. The user is responsible for aligning the write or read buffer on a getpagesize() boundary. MPI-IO checks for alignment and aborts if it is not aligned. Valid values are true or false.

Default: false.

ind_rd_buffer_size

Specifies in bytes the size of the buffer to be used for data sieving on read.

Default: 4194304

ind_wr_buffer_size

Specifies in bytes the size of the buffer to be used for data sieving on write.

Default: 524288

romio_cb_read

Enables collective buffering on read when collective IO operations are used. Valid values are enable, disable, and automatic. In automatic mode, whether or not collective buffering is done is based on runtime heuristics. When MPICH_MPIIO_CB_ALIGN is set to 2, the heuristics favor collective buffering.

Default: automatic.

romio_cb_write

Enables collective buffering on write when collective IO operations are used. Valid values are enable, disable, and automatic. In automatic mode, whether or not collective buffering is done is based on runtime heuristics. When MPICH_MPIIO_CB_ALIGN is set to 2, the heuristics favor collective buffering.

Default: automatic.

romio_ds_read

Specifies if data sieving is to be done on read. Valid values are enable, disable, and automatic.

Default: disable

romio_ds_write

Specifies if data sieving is to be done on write. Valid values are enable, disable, and automatic. When set to automatic, data sieving on write is turned off if the MPI library has been initialized with MPI_THREAD_MULTIPLE. Setting the value enable will turn on data sieving on write irrespective of the thread environment and is safe as long as MPI-IO routines aren’t called concurrently from threads in a rank. Additionally, in order to avoid data corruption, it is necesary that data sieving is disabled if single-threaded applications write to a file using multiple communicators.

Default: automatic

romio_no_indep_rw

Enables an optimization in which only the aggregators open the file, thus limiting the number of system open calls. For this hint to be valid, all I/O on the file must be done by MPI collective I/O calls (that is, no independent I/O) and collective buffering must not be disabled. Valid values are true or false.

Default: false.

striping_factor

Specifies the number of Lustre file system stripes (stripe count) to assign to the file. This has no effect if the file already exists when the MPI_File_open() call is made. File striping cannot be changed after a file is created. Currently this hint applies only when MPICH_MPIIO_CB_ALIGN is set to 2.

The value 0 denotes the default value for the Lustre file system, or the value for the directory in which the file is created if the lfs setstripe command was used to set the stripe count of the directory to a value other than the system default. The value -1 means using all available OSTs for striping.

Default: 0

striping_unit

Specifies in bytes the size of the Lustre file system stripes (stripe size) assigned to the file. This has no effect if the file already exists when the MPI_File_open() call is made. File striping cannot be changed after a file is created. Currently this hint applies only when MPICH_MPIIO_CB_ALIGN is set to 2.

Default: the default value for the Lustre file system, or the value for the directory in which the file is created if the lfs setstripe command was used to set the stripe size of the directory to a value other than the system default.

overstriping_factor

Specifies the number of Lustre file system stripes (stripe count) to assign to the file when more stripes than the available OSTs are needed. This has no effect if the file already exists when the MPI_File_open() call is made. File overstriping cannot be changed after a file is created. Currently this hint applies only when MPICH_MPIIO_CB_ALIGN is set to 2. This hint will take precedence when used along with striping_factor.

The value 0 denotes the default value for the Lustre file system, or the value for the directory in which the file is created if the lfs setstripe command was used to set the overstripe count of the directory to a value other than the system default. The value -1 means using all available OSTs for overstriping.

Default: 0

MPICH_MPIIO_HINTS_DISPLAY

If enabled, causes rank 0 in the participating communicator to display the names and values of all MPI-IO hints that are set for the file being opened with the MPI_File_open call.

Default: not enabled.

MPICH_MPIIO_OFI_STARTUP_CONNECT

By default, OFI connections between ranks are set up on demand. This allows for optimal performance while minimizing memory requirements in Slingshot-10 and Infiniband systems. However, for MPIIO jobs requiring large number of PEs and IO aggregators, it may be beneficial to create OFI connections between PEs and IO aggregators in a coordinated manner at file open. If enabled, this feature will create connections between all ranks on each node and the IO aggregators in the job during MPI_File_open.

This option is not beneficial on a Slingshot-11 system.

Default: not enabled.

MPICH_MPIIO_OFI_STARTUP_NODES_AGGREGATOR

If MPICH_MPIIO_OFI_STARTUP_CONNECT is enabled, this specifies the number of nodes that will concurrently attempt to connect to each aggregator during the MPI_File_open. Increasing the value may be improve performance of the MPI_File_open in some configurations when MPICH_MPIIO_OFI_STARTUP_CONNECT is enabled.

Default: 2

MPICH_MPIIO_STATS

If set to 1, a summary of file write and read access patterns is written by rank 0 to stderr. This information provides some insight into how I/O performance may be improved. The information is provided on a per-file basis and is written when the file is closed. It does not provide any timing information.

If set to 2, a set of data files are written to the working directory, one file for each rank, with the filename prefix specified by the MPICH_MPIIO_STATS_FILE environment variable. The data is in comma-separated values (CSV) format, which can be summarized with the cray_mpiio_summary script in the /opt/cray/pe/mpich/version/ofi/mpich-cray/version*/bin* directory. Additional example scripts are provided in that directory to further process and display the data.

Default: not set

MPICH_MPIIO_STATS_FILE

Specifies the filename prefix for the set of data files written when MPICH_MPIIO_STATS is set to 2. The filename prefix may be a full absolute pathname or a relative pathname.

Summary plots of these files can be generated using the cray_mpiio_summary script from the /opt/cray/pe/mpich/version/ofi/mpich-cray/version*/bin* directory. Other example scripts for post-processing this data can also be found in /opt/cray/pe/mpich/version/ofi/mpich-cray/version*/bin*.

Default: cray_mpiio_stats

MPICH_MPIIO_STATS_INTERVAL_MSEC

Specifies the time interval in milliseconds for each MPICH_MPIIO_STATS data point.

Default: 250

MPICH_MPIIO_TIMERS

If set to 0 or not set at all, no timing data is collected.

If set to 1, timing data for different phases in MPI_IO is collected locally by each MPI process. During MPI_FILE_close the data is consolidated and printed. Some timing data is displayed in seconds, other data is displayed in clock ticks, possibly scaled down. The relative values of the reported times are more important to the analysis than the absolute time. See also MPICH_MPIIO_TIMERS_SCALE

More detailed information about MPI_IO performance can be obtained by using the MPICH_MPIIO_STATS feature and by using the CrayPat and Aprentice2 Timeline Report of I/O bandwidth.

Default: 0

MPICH_MPIIO_TIMERS_SCALE

Specifies the power of 2 to use to scale the times reported by MPICH_MPIIO_TIMERS. The raw times are collected in clock ticks. This generally is a very large number and reducing all the times by the same scaling factor makes for a more compact display.

If set to 0, or not set at all, MPI-IO automatically determines a scaling factor to limit the report times to 9 or fewer digits. This auto-determined value is displayed. To make run to run comparisons, you can set the scaling factor to your preferred value.

Default: 0

MPICH_MPIIO_TIME_WAITS

If set to non-zero, time how long this rank has to wait for other ranks to catch up. This separates true metadata time from imbalance time.

This is disabled when MPICH_MPIIO_TIMERS is not set. Otherwise it defaults to 1.

Default: 1

DYNAMIC PROCESS MANAGEMENT (DPM) ENVIRONMENT VARIABLES

HPE Cray MPICH supports Dynamic Process Management (DPM). This allows MPI applications to spawn additional ranks. Some special settings are required on different platforms and launchers in order to enable DPM. This document describes environment variables and command line options for enabling and configuring DPM on different platforms and launchers.

PMI uses the PMI_UNIVERSE_SIZE environment variable to indicate to MPI

the maximum number of ranks that it can have running in the same job. This directly translates to the MPI_UNIVERSE_SIZE attribute. Slurm currently sets this variable to the number of ranks in the base job-step. You can override this environment variable in the launched environment, though the launcher may still limit the number of spawned processes.

In order to save resources, the MPI library does not enable support

for the fabric if it was launched onto a single host. Fabric support is needed for DPM. If you intend to use DPM or intercommunicators with single-host applications, then set MPICH_SINGLE_HOST_ENABLED=0 to force the MPI library to initialize fabric support.

Slingshot Fabric with Cassini NICs

MPICH uses the Process Management Interface (PMI) to request the launcher to start additional processes. On Slurm systems, the PMI_SPAWN_SRUN_ARGS environment variable controls how PMI interacts with Slurm. By default it is set to “–mpi=cray_shasta –exclusive”. You may wish to add a –cpu-bind= option to the environment variable. See Slurm’s srun(1) man page for more options.

On Cassini systems, add –network to both the original srun and the PMI_SPAWN_SRUN_ARGS environment variable. It should be set to “–network=single_node_vni,job_vni,def_tles=0”. This will enable Slurm to subdivide the job’s slots into separate job-steps. The single_node_vni option will ensure that Slingshot security tokens are provided even if the application only runs on a single host. The job_vni option will ensure that an intra-job security token is provided. The def_tles option will prevent a limited Cassini resource that is rarely used from being exchausted by DPM.

With HPE’s Parallel Application Launch Service (PALS), PBS needs to be told that the job will need an intra-job security token. PBS ignores this option by default if the job only includes a single node. Therefore your qsub command line may look like “qsub -v HPE_SLINGSHOT_OPTS=get_job_vni:single_node_vni …”. This requires that the HPE Slingshot PBS hook be installed and configured.

With PALS, applications must be launched through mpiexec with the –job-aware option. This causes PALS to track the compute slots reserved through PBS. On Cassini systems, add “–network def_tles=0”. The def_tles option will prevent a limited Cassini resource that is rarely used from being exhausted by DPM. If you wish to use DPM on a single host, add –single-node-vni to cause PALS to provide HPE Slingshot VNI security tokens, which are not normally used on single hosts. See the mpiexec(1) man page for more information.

Slingshot Fabric with Mellanox NICs

Do not use Cassini-specific –network options with Mellanox NICs.

MPICH_DPM_DIR

Specifies the file path used for the dynamic process management directory. The directory must be cross-mounted on the compute nodes in order to be visible to MPI during an application’s execution. If MPICH_DPM_DIR is not set, MPI will attempt to create the directory in the user’s home directory. Note that the directory (and any files in the directory) are managed by MPI, and should not be directly modified by the user.

Default: not set

CONFORMING TO

This release of MPI derives from Argonne National Laboratory MPICH and implements the MPI-3.1 standard as documented by the MPI Forum in MPI: A Message Passing Interface Standard, Version 3.1, except as described in the NOTES section of this man page.

NOTES

MPI Singleton support

A singleton is an MPI program with a single rank (having rank number 0). A singleton launch is the execution of a one rank MPI program without a launcher. For example:

$> cat ./mpi_hello.c
/* MPI hello world example */

#include <stdio.h>
#include <mpi.h>

int main(int argc, char **argv)
{
    int rank = -1;
    int size = -1;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    printf("Hello world! I am rank %d of %d.\n", rank, size);

    MPI_Finalize();
    return 0;
}
$>
$> cc -o ./mpi_hello ./mpi_hello.c
$>
$> ./mpi_hello
Hello world! I am rank 0 of 1.
$>

Using MPICH ABI Compatibility

The primary focus of MPICH ABI Compatibility support is to support ISV (Independent Software Vendor) applications. Most ISV applications come with a set of MPI libraries. These MPI libraries may not work on a HPE system, but if they are ABI compatible with HPE Cray MPI one may be able to run the application using the high performance HPE libraries already installed on the system. Additional work may be required to get the ISV application to run depending on the particular launching scripts supplied with the ISV application. For more information please see the MPICH ABI Compatibility Status for Shasta white paper.

To use MPICH ABI compatibility one must have a dynamically linked MPI program built with:

1. A compiler that was supported by HPE at the time of compilation and that is compatible with a currently supported compiler.

2. The OS used to build the MPI program must be compatible with the OS on the HPE system the program will be run on.

3. A MPICH ABI compatible MPI implementation. See https://wiki.mpich.org/mpich/index.php/ABI_Compatibility_Initiative for a list of MPICH ABI compatible MPI implementations.

Given all of the above requirements, one can run the application by doing the following:

1. Make sure you have the programming environment module (PrgEnv-*) loaded for the compiler that was used to build the application. If necessary, change the compiler module to the version used to build the application or a version compatible with the compiler used to build the application.

2. Load the cray-mpich-abi module. Make sure to swap with the normal cray-mpich module or unload the cray-mpich module before loading the cray-mpich-abi module.

3. Launch the program with launcher supported by your HPE system. Note that one may have to translate launcher options, set (or not set) various environment variables, etc.

A number of older ISV applications (built with Intel MPI versions previous to Intel MPI 5.0) may work with HPE Cray MPI. Some earlier Intel MPI versions are ABI compatible with the MPI library but have different library names/version numbers. These applications might run by loading the cray-mpich-abi-pre-intel5.0 module instead of the cray-mpich-abi module. All other previously mentioned caveats about running ISV applications apply.

Using MPIxlate

MPIxlate is a utility provided by HPE Cray Programming Environment, to run, using HPE Cray MPI, applications built with other supported MPI shared library implementations that do not conform to MPICH ABI compatibility requirements.

For more information, please refer to the mpixlate(1) man page.

Maximum Tag Value Varies with Network Interconnect

It is important to note that the maximum tag value supported by HPE Cray MPI varies depending on what network interconnect and libraries are in use. These variations are dictated by underlying networking layers and hardware. Please note that the MPI standard only requires that the minimum maximum tag value be at least 32767 and that the tag value not change during the execution of an MPI program.

If you are using UCX libraries the maximum tag value is always 16777215.

If you are using OFI libraries and do not use the interconnect (run on a UAN/UAI via ./<prog> or run on a single node, even multiple ranks on a single node) the maximum tag value will be 536870911.

If you are using OFI libraries and multiple nodes:

1. Using industry standard NICs (SS-10) - maximum tag value is 268435455

2. Using the HPE Slingshot NIC (SS-11) - maximum tag value is 33554431

Processes Per Node Limitations

If you are using HPE Slingshot NICs (SS-11), there is a limitation on the maximum number of MPI ranks per NIC. The NIC has the resources to support a maximum of 254 ranks. If you try to use more than this many ranks, MPI will fail to initialize. We recommend running with multiple threads per process as an alternative to very high process counts.