compiler_interop

Date:

05-31-2024

NAME

compiler_interop - Overview of CCE compiler interoperability

DESCRIPTION

This man page describes CCE’s cross-language and cross-vendor interoperability capabilities and limitations. Successfully supporting large HPC applications requires several different forms of compiler and programming language interoperability. First, applications may require interoperability across two or more base languages (e.g., C, C++, or Fortran) and programming models (e.g., OpenMP, HIP, etc.). This often occurs when an application comprises several independent subcomponents, which may be developed by different teams, with different development philosophies and goals. Second, applications may require interoperability across two or more compiler vendors. This can occur when one or more vendor compilers exhibit functional or performance issues when used to compile source for a particular subcomponent, limiting compiler choice for that subcomponent. Therefore, it is critical to support linking object files generated by a variety of vendor compilers and written in a variety of base languages and programming models.

C and C++ Interoperability

The CCE C and C++ compilers are based on the Clang/LLVM compiler project and generally provide the same language interoperability guarantees as Clang, which supports strong interoperability with the GNU compiler toolchain. By convention, the CCE major version number matches the LLVM major version number on which a particular CCE release is based. The CCE release notes also indicate the LLVM base for each CCE release.

CCE adheres to the standard system ABI for x86-64 and aarch64 targets, ensuring C interoperability with other C compilers supporting the system ABI. This includes GNU and Clang based compilers.

C++ interoperability with other compilers depends on the C++ header files and libraries used by each compiler. CCE defaults to using standard system-provided GNU C++ headers and libraries, ensuring that C++ source files compiled by CCE are compatible with the system-provided GNU C++ compiler. Users may direct CCE to use headers and libraries from a specific version of GCC. See GCC Header and Library Selection for details.

Fortran Interoperability

The CCE Fortran compiler supports interoperability between Fortran and C through Fortran language standard C interoperability mechanisms. Cross-vendor Fortran interoperability is not generally supported because the Fortran language specification does not define a standard ABI or runtime library interface. All Fortran code contained in an application must typically be compiled by a single Fortran compiler.

GCC Header and Library Selection

The CCE C, C++, and Fortran compilers automatically use required GCC header directories and link with required GCC libraries. The compilers search the system for installed GCC headers and libraries, and select the highest GCC version found. Users may direct the compilers to use specific GCC headers and libraries via the mechanisms described in this section. Users may inspect the compiler’s GCC selection decisions using the debugging mechanisms described in this section.

Command Line Options

When the **–gcc-install-dir=***lib-dir* option is provided, the compiler uses the GCC libraries in lib-dir. The compiler automatically uses the appropriate header file directories for the libraries in lib-dir. The **–gcc-install-dir=***lib-dir* option is recognized by the C, C++, and Fortran compilers. For example, the following option directs the compiler to use the GCC libraries in /usr/lib64/gcc/x86_64-suse-linux/13:

--gcc-install-dir=/usr/lib64/gcc/x86_64-suse-linux/13

When the **–gcc-toolchain=***prefix* option is provided, the compiler uses the GCC headers and libraries from the highest GCC version found in prefix. The **–gcc-toolchain=***prefix* option is recognized by the C, C++, and Fortran compilers. For example, the following option directs the compiler to use the highest GCC version found in /usr:

--gcc-toolchain=/usr

The –gcc-install-dir= option (with nothing after the =) sets lib-dir to the empty string. The –gcc-toolchain= option (with nothing after the =) sets prefix to the empty string.

Precedence Rules

When multiple **–gcc-install-dir=***lib-dir* options are provided, only the last lib-dir takes effect. When multiple **–gcc-toolchain=***prefix* options are provided, only the last prefix takes effect. A **–gcc-install-dir=***lib-dir* option on the command line overrides any lib-dir provided via a configuration file. A **–gcc-toolchain=***prefix* option on the command line overrides any prefix provided via a configuration file. These precedence rules apply equally to empty strings and nonempty strings.

The compiler considers all configuration files and command line options, and computes a final lib-dir and prefix. If lib-dir is nonempty, the compiler only uses lib-dir. If lib-dir is empty and prefix is nonempty, the compiler only uses prefix. If lib-dir is empty and prefix is empty, the compiler uses it’s default GCC search algorithm. As a result, a prefix provided on the command line does not override a nonempty lib-dir provided via a configuration file.

Clearing lib-dir and prefix directs the compiler to search the system for the highest GCC version installed.

Configuration Files

Administrators may set systemwide GCC selection defaults by creating and editing compiler configuration files. There are separate configuration files for the C, C++, and Fortran compilers.

Configuration Files for C and C++

The C compiler reads the clang.cfg configuration file located in the same directory as the clang executable. The C compiler reads options from clang.cfg and processes them before any options provided on the command line. Providing a **–gcc-install-dir=***lib-dir* or **–gcc-toolchain=***prefix* option in clang.cfg directs the C compiler to use the specified GCC selection option by default.

The C++ compiler reads the clang++.cfg configuration file located in the same directory as the clang++ executable. The C++ compiler reads options from clang++.cfg and processes them before any options provided on the command line. Providing a **–gcc-install-dir=***lib-dir* or **–gcc-toolchain=***prefix* option in clang++.cfg directs the C++ compiler to use the specified GCC selection option by default.

For more information about C and C++ compiler configuration files, see the Clang Compiler User’s Manual (UsersManual.html) installed with CCE.

Configuration Files for Fortran

The Fortran compiler reads two configuration files that affect GCC selection: gcc_install_dir and gcc_toolchain. These files are in the share directory adjacent to the bin directory that contains the ftnfe executable. For example, in CCE 18.0.0 on x86_64, the gcc_install_dir and gcc_toolchain files are in the following directory:

/opt/cray/pe/cce/18.0.0/cce/x86_64/bin/../share

The Fortran compiler interprets the first line of gcc_install_dir as the lib-dir for the –gcc-install-dir option. The Fortran compiler interprets the first line of gcc_toolchain as the prefix for the –gcc-toolchain option. Providing a nonempty gcc_install_dir or gcc_toolchain file directs the Fortran compiler to use the specified GCC selection option by default.

Debugging

When the -v option is provided, the C and C++ compilers print verbose messages to standard error (stderr), including GCC selection messages. When the CCE_DEBUG_GCC_SELECTION environment variable is set to 1, the Fortran compiler prints GCC selection messages to standard error (stderr).

OpenMP Interoperability

CCE supports full OpenMP interoperability across all languages (C, C++, and Fortran) within the CCE compiler suite and limited OpenMP interoperability between CCE and other GNU and Clang based compilers. The mechanisms that enable such interoperability are CCE’s OpenMP runtime libraries and custom offload linker tool.

CCE provides two OpenMP runtime libraries: libcraymp, which is the primary CPU OpenMP runtime library this is needed by all OpenMP programs; and libcrayacc, which is the offloading library that is only needed for GPU offloading. CCE targets these OpenMP libraries for C, C++, and Fortran, ensuring cross-language OpenMP interoperability within the CCE compiler suite.

While CCE internally uses a non-standard compiler-runtime interface, CCE’s OpenMP runtime libraries also implement entry points for the GNU and Clang compiler-runtime interfaces. As a result, libcraymp can act as a drop-in replacement for GNU’s libgomp runtime library and Clang’s libomp runtime library; and, libcrayacc can act as a drop-in replacement for Clang’s libomptarget runtime library. This allows compiling OpenMP source files with CCE and other GNU or Clang based compilers and then linking against the CCE OpenMP runtime libraries, enabling cross-vendor OpenMP interoperability within a single application. Constraints for cross-vendor OpenMP offloading are as follows:

  • OpenMP interoperability is supported for Clang-based compilers when the LLVM version is less than or equal to CCE’s LLVM version; interoperability with objects compiled by newer versions of LLVM may work, but cannot be guaranteed. All OpenMP constructs and APIs are supported for Clang based compilers.

  • OpenMP interoperability is supported for GNU-based compilers for OpenMP 3.1 constructs and APIs.

  • OpenMP interoperability for GPU offloading is only supported for Clang-based compilers; only OpenMP CPU interoperability is supported for GNU-based compilers.

  • GPU offloading interoperability does not support cross-vendor function calls or references to global variables (i.e., accesses to shared global variables or function calls in offload regions need to be contained in files compiled by the same compiler). An application can contain offload regions that are compiled by different compilers as long as those regions are independent.

  • Applications must be linked against the CCE OpenMP runtime libraries, libcraymp and libcrayacc, rather than the GNU or Clang OpenMP runtime libraries. This is most easily accomplished by linking the application with the CCE compiler driver.

  • Applications that contain GPU offloading regions must be linked with the CCE OpenMP Offload Linker tool, which handles all of the device linking steps needed for “bundled” offload objects. The offload linker tool is invoked automatically when CCE is used to perform a final link command. The offload linker tool can also be invoked directly in a pre-link step when linking an application with another compiler driver. See the CCE OpenMP Offload Linker section in the intro_openmp man page for more details.

CCE provides distinct libcrayacc library builds for each supported GPU target, and each of these libraries has a direct dependence on the offload libraries provided by the corresponding GPU vendor.

  • libcrayacc is CCE’s NVIDIA offloading library, and it has a dependence on NVIDIA’s libcuda library.

  • libcrayacc_amdgpu is CCE’s AMD GPU offloading library, and it has a dependence on AMD’s libamdhip64 library.

  • libcrayacc_x86_64 and libcrayacc_aarch64 are CCE’s CPU offloading libraries; these libraries support “host offloading”, and therefore do not have a dependence on any third-party GPU offloading libraries.

For interoperability with native offloading models (e.g., CUDA or HIP), CCE supports the standard OpenMP interop construct and associated API functions.

OpenACC Interoperability

The CCE Fortran OpenACC implementation uses the same offloading runtime library as OpenMP, libcrayacc, and therefore natively supports mixing OpenACC and OpenMP offload in a single application. CCE does not support OpenACC interoperability with other compilers.

HIP Interoperability

CCE supports AMD GPU offloading using AMD’s Heterogeneous-Compute Interface for Portability (HIP) programming model. This support is provided through the open source HIP implementation available in Clang, and therefore CCE’s HIP implementation is interoperable with other HIP compiler implementations based on the same LLVM version.

CCE’s HIP implementation relies upon HIP header files and device libraries provided by an external AMD ROCm install. The rocm module or --rocm-path compiler flag can be used to specify which version of ROCm should be used with CCE. The CCE release notes indicate which versions of ROCm are validated and supported with a particular CCE release.

Object files that contain both HIP and OpenMP offloading may be linked together in the same application, but use of those models may not appear together in the same source file.

Cray Programming Environment Implications

Building a single executable or shared library with multiple compilers can be difficult when using the standard Cray Programming Environment (CPE) module environment, since the CrayPE compiler driver can only target one compiler vendor at a time (controlled by the top-level PrgEnv modules). This is not a major challenge if compiler interoperability only occurs at shared library granularity (i.e., each shared library is compiled entirely with one vendor’s compiler), since the appropriate PrgEnv module only needs to be loaded or swapped prior to building each shared library component. However, if different source files within a single executable or shared library need to be built with different compilers, the PrgEnv module must be swapped prior to building those individual source files. This module toggling can be hidden in a build system or in user-written compiler wrapper scripts; but, it may be more straightforward to entirely bypass the module environment and CrayPE compiler driver, instead invoking each compiler driver directly. CCE supports direct invocation of the compiler through the craycc, crayCC, and crayftn commands. Note that currently loaded modules will not have any impact when invoking a compiler driver directly - users must explicitly specify the appropriate compiler targeting flags, include paths, and linker flags, including flags needed for MPI. Alternatively, Cray MPI provides a set of MPI compiler wrappers (mpicc, mpicxx, and mpifort) that add the compiler and linker flags needed for compiling with Cray MPI.

Example

This section provides a demonstration of heterogeneous compiler interoperability using the open-source “Goulash” project authored by developers at Lawrence Livermore National Laboratory (https://github.com/LLNL/goulash). This demonstration focuses on the “rush_larsen_interop2.5_gpu_omp_hip_mpi” test case, which exercises MPI, HIP GPU offload, OpenMP GPU offload, and OpenMP CPU threading from several base languages – C, C++, and Fortran – and several different compiler toolchains – CCE, AMD ROCm, and GNU. The target hardware is an HPE Cray EX supercomputer containing compute nodes populated with AMD EPYC “Trento” CPUs and AMD MI250X GPUs. HIP code is compiled with relocatable device code (-fgpu-rdc) because that is consistent with how many HPC applications are built, even though it is not strictly required for this test case. The configure and build setup is as follows:

# Specify common compiler flags
export OMP_CPU_CXXFLAGS="-O3 -fopenmp"
export OMP_GPU_CXXFLAGS="-O3 -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a --rocm-path=${ROCM_PATH}"
export HIP_CPPFLAGS="-D__HIP_PLATFORM_AMD__ -I${ROCM_PATH}/include"
export HIP_CXXFLAGS="-O3 --offload-arch=gfx90a -std=c++11 --rocm-path=${ROCM_PATH} -x hip -fgpu-rdc"

# Setup MPI wrappers for each compiler vendor using  an "env" command
# to extend the the PATH environment variable, allowing the MPI wrapper
# to find the appropriate compiler driver
export CCE_MPICXX="env MPICH_CXX=$(which crayCC) $(which mpicxx)"
export CCE_MPIFORT="env MPICH_FC=$(which crayftn) $(which mpifort)"
export ROCM_MPICXX="env MPICH_CXX=$(which amdclang) $(which mpicxx)"
export GNU_MPICXX="env MPICH_CXX=$(which g++) $(which mpicxx)"

# Capture library search paths
export INTEROP_LD_LIBRARY_PATH=${ROCM_PATH}/lib:$(pkg-config libfabric --variable=libdir)

# Define variables consumed by Goulash makefile
export INTEROP_CXX="${CCE_MPICXX}"
export INTEROP_CPPFLAGS="${HIP_CPPFLAGS}"
export INTEROP_CXXFLAGS="${OMP_GPU_CXXFLAGS}"
export INTEROP_COMPILERID="-DCOMPILERID=cce-${CRAY_CC_VERSION}"
export INTEROP_LDFLAGS="--hip-link --offload-arch=gfx90a -L${ROCM_PATH}/lib"
export INTEROP_LDLIBS="-lamdhip64 -lmpifort_cray"
export COMPILER1_GPU_OMP_HIP_MPI_CXX="${CCE_MPICXX}"
export COMPILER1_GPU_OMP_HIP_MPI_CPPFLAGS="${HIP_CPPFLAGS}"
export COMPILER1_GPU_OMP_HIP_MPI_CXXFLAGS="${OMP_GPU_CXXFLAGS}"
export COMPILER1_GPU_OMP_HIP_MPI_COMPILERID="-DCOMPILERID=cce-${CRAY_CC_VERSION}"
export COMPILER1_GPU_OMP_HIP_MPI_LDFLAGS="${INTEROP_LDFLAGS}"
export COMPILER1_GPU_OMP_HIP_MPI_LDLIBS="${INTEROP_LDLIBS}"
export COMPILER1_CPU_OMP_MPI_CXX="${CCE_MPICXX}"
export COMPILER1_CPU_OMP_MPI_CPPFLAGS=""
export COMPILER1_CPU_OMP_MPI_CXXFLAGS="${OMP_CPU_CXXFLAGS}"
export COMPILER1_CPU_OMP_MPI_COMPILERID="-DCOMPILERID=cce-${CRAY_CC_VERSION}"
export COMPILER1_CPU_OMP_MPI_LDFLAGS=""
export COMPILER1_CPU_OMP_MPI_LDLIBS=""
export COMPILER2_CPU_OMP_MPI_CXX="${ROCM_MPICXX}"
export COMPILER2_CPU_OMP_MPI_CPPFLAGS=""
export COMPILER2_CPU_OMP_MPI_CXXFLAGS="${OMP_CPU_CXXFLAGS}"
export COMPILER2_CPU_OMP_MPI_COMPILERID="-DCOMPILERID=$(basename ${ROCM_PATH})"
export COMPILER2_CPU_OMP_MPI_LDFLAGS=""
export COMPILER2_CPU_OMP_MPI_LDLIBS=""
export COMPILER3_CPU_OMP_MPI_CXX="${GNU_MPICXX}"
export COMPILER3_CPU_OMP_MPI_CPPFLAGS=""
export COMPILER3_CPU_OMP_MPI_CXXFLAGS="${OMP_CPU_CXXFLAGS} -fPIC"
export COMPILER3_CPU_OMP_MPI_COMPILERID="-DCOMPILERID=\"$(${COMPILER3_CPU_OMP_MPI_CXX} --version | head -n1)\""
export COMPILER3_CPU_OMP_MPI_LDFLAGS=""
export COMPILER3_CPU_OMP_MPI_LDLIBS=""
export COMPILER1_GPU_OMP_MPI_CXX="${CCE_MPICXX}"
export COMPILER1_GPU_OMP_MPI_CPPFLAGS=""
export COMPILER1_GPU_OMP_MPI_CXXFLAGS="${OMP_GPU_CXXFLAGS}"
export COMPILER1_GPU_OMP_MPI_COMPILERID="-DCOMPILERID=cce-${CRAY_CC_VERSION}"
export COMPILER1_GPU_OMP_MPI_LDFLAGS=""
export COMPILER1_GPU_OMP_MPI_LDLIBS=""
export COMPILER2_GPU_OMP_MPI_CXX="${ROCM_MPICXX}"
export COMPILER2_GPU_OMP_MPI_CPPFLAGS=""
export COMPILER2_GPU_OMP_MPI_CXXFLAGS="${OMP_GPU_CXXFLAGS}"
export COMPILER2_GPU_OMP_MPI_COMPILERID="-DCOMPILERID=$(basename ${ROCM_PATH})"
export COMPILER2_GPU_OMP_MPI_LDFLAGS=""
export COMPILER2_GPU_OMP_MPI_LDLIBS=""
export COMPILER1_GPU_HIP_MPI_CXX="${CCE_MPICXX}"
export COMPILER1_GPU_HIP_MPI_CPPFLAGS="${HIP_CPPFLAGS}"
export COMPILER1_GPU_HIP_MPI_CXXFLAGS="${HIP_CXXFLAGS}"
export COMPILER1_GPU_HIP_MPI_COMPILERID="-DCOMPILERID=cce-${CRAY_CC_VERSION}"
export COMPILER1_GPU_HIP_MPI_LDFLAGS=""
export COMPILER1_GPU_HIP_MPI_LDLIBS=""
export COMPILER2_GPU_HIP_MPI_CXX="${ROCM_MPICXX}"
export COMPILER2_GPU_HIP_MPI_CPPFLAGS="${HIP_CPPFLAGS}"
export COMPILER2_GPU_HIP_MPI_CXXFLAGS="${HIP_CXXFLAGS}"
export COMPILER2_GPU_HIP_MPI_COMPILERID="-DCOMPILERID=$(basename ${ROCM_PATH})"
export COMPILER2_GPU_HIP_MPI_LDFLAGS=""
export COMPILER2_GPU_HIP_MPI_LDLIBS=""
export COMPILER1_GPU_LAMBDA_HIP_MPI_CXX="${CCE_MPICXX}"
export COMPILER1_GPU_LAMBDA_HIP_MPI_CPPFLAGS="${HIP_CPPFLAGS}"
export COMPILER1_GPU_LAMBDA_HIP_MPI_CXXFLAGS="${HIP_CXXFLAGS}"
export COMPILER1_GPU_LAMBDA_HIP_MPI_COMPILERID="-DCOMPILERID=cce-${CRAY_CC_VERSION}"
export COMPILER1_GPU_LAMBDA_HIP_MPI_LDFLAGS=""
export COMPILER1_GPU_LAMBDA_HIP_MPI_LDLIBS=""
export COMPILER2_GPU_LAMBDA_HIP_MPI_CXX="${ROCM_MPICXX}"
export COMPILER2_GPU_LAMBDA_HIP_MPI_CPPFLAGS="${HIP_CPPFLAGS}"
export COMPILER2_GPU_LAMBDA_HIP_MPI_CXXFLAGS="${HIP_CXXFLAGS}"
export COMPILER2_GPU_LAMBDA_HIP_MPI_COMPILERID="-DCOMPILERID=$(basename ${ROCM_PATH})"
export COMPILER2_GPU_LAMBDA_HIP_MPI_LDFLAGS=""
export COMPILER2_GPU_LAMBDA_HIP_MPI_LDLIBS=""
export COMPILER1_CPU_OMP_MPI_FORT_FC="${CCE_MPIFORT}"
export COMPILER1_CPU_OMP_MPI_FORT_CPPFLAGS=""
export COMPILER1_CPU_OMP_MPI_FORT_FCFLAGS="${OMP_CPU_CXXFLAGS}"
export COMPILER1_CPU_OMP_MPI_FORT_COMPILERID="-DCOMPILERID=\\\"cce-${CRAY_FTN_VERSION}\\\""
export COMPILER1_CPU_OMP_MPI_FORT_LDFLAGS=""
export COMPILER1_CPU_OMP_MPI_FORT_LDLIBS=""
export COMPILER1_GPU_OMP_MPI_FORT_FC="${CCE_MPIFORT}"
export COMPILER1_GPU_OMP_MPI_FORT_CPPFLAGS=""
export COMPILER1_GPU_OMP_MPI_FORT_FCFLAGS="${OMP_CPU_CXXFLAGS} -haccel=amd_gfx90a -gno-heterogeneous-dwarf"
export COMPILER1_GPU_OMP_MPI_FORT_COMPILERID="-DCOMPILERID=\\\"cce-${CRAY_FTN_VERSION}\\\""
export COMPILER1_GPU_OMP_MPI_FORT_LDFLAGS=""
export COMPILER1_GPU_OMP_MPI_FORT_LDLIBS=""

# Purge modules after capturing all necessary paths
module purge

git clone https://github.com/LLNL/goulash

# Build test case
make -C goulash/tests/rush_larsen/rush_larsen_interop2.5_gpu_omp_hip_mpi -e

The individual compile and link commands from an example build are shown below:

env MPICH_FC=/opt/cray/pe/cce/17.0.1/bin/crayftn /opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0/bin/mpifort \
    -c -DCOMPILERID=\"cce-17.0.1\" -O3 -fopenmp -haccel=amd_gfx90a -gno-heterogeneous-dwarf \
    rush_larsen_gpu_omp_mpi_fort_compiler1.F90 -o rush_larsen_gpu_omp_fort_compiler1_mpi.o

env MPICH_FC=/opt/cray/pe/cce/17.0.1/bin/crayftn /opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0/bin/mpifort \
    -c -DCOMPILERID=\"cce-17.0.1\" -O3 -fopenmp \
    rush_larsen_cpu_omp_mpi_fort_compiler1.F90 -o rush_larsen_cpu_omp_fort_compiler1_mpi.o

env MPICH_CXX=/opt/rocm-6.0.0/bin/amdclang /opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0/bin/mpicxx \
    -c -DCOMPILERID=rocm-6.0.0 -D__HIP_PLATFORM_AMD__ -I/opt/rocm-6.0.0/include -O3 --offload-arch=gfx90a \
    -std=c++11 --rocm-path=/opt/rocm-6.0.0 -x hip -fgpu-rdc \
    rush_larsen_gpu_lambda_hip_mpi_compiler2.cc -o rush_larsen_gpu_lambda_hip_mpi_compiler2.o

env MPICH_CXX=/opt/cray/pe/cce/17.0.1/bin/crayCC /opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0/bin/mpicxx \
    -c -DCOMPILERID=cce-17.0.1 -D__HIP_PLATFORM_AMD__ -I/opt/rocm-6.0.0/include -O3 --offload-arch=gfx90a \
    -std=c++11 --rocm-path=/opt/rocm-6.0.0 -x hip -fgpu-rdc \
    rush_larsen_gpu_lambda_hip_mpi_compiler1.cc -o rush_larsen_gpu_lambda_hip_mpi_compiler1.o

env MPICH_CXX=/opt/rocm-6.0.0/bin/amdclang /opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0/bin/mpicxx \
    -c -DCOMPILERID=rocm-6.0.0 -D__HIP_PLATFORM_AMD__ -I/opt/rocm-6.0.0/include -O3 --offload-arch=gfx90a \
    -std=c++11 --rocm-path=/opt/rocm-6.0.0 -x hip -fgpu-rdc \
    rush_larsen_gpu_hip_mpi_compiler2.cc -o rush_larsen_gpu_hip_mpi_compiler2.o

env MPICH_CXX=/opt/cray/pe/cce/17.0.1/bin/crayCC /opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0/bin/mpicxx \
    -c -DCOMPILERID=cce-17.0.1 -D__HIP_PLATFORM_AMD__ -I/opt/rocm-6.0.0/include -O3 --offload-arch=gfx90a \
    -std=c++11 --rocm-path=/opt/rocm-6.0.0 -x hip -fgpu-rdc \
    rush_larsen_gpu_hip_mpi_compiler1.cc -o rush_larsen_gpu_hip_mpi_compiler1.o

env MPICH_CXX=/opt/rocm-6.0.0/bin/amdclang /opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0/bin/mpicxx \
    -c -DCOMPILERID=rocm-6.0.0 -O3 -fopenmp \
    -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a \
    --rocm-path=/opt/rocm-6.0.0 \
    rush_larsen_gpu_omp_mpi_compiler2.cc -o rush_larsen_gpu_omp_mpi_compiler2.o

env MPICH_CXX=/opt/cray/pe/cce/17.0.1/bin/crayCC /opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0/bin/mpicxx \
    -c -DCOMPILERID=cce-17.0.1 -O3 -fopenmp \
    -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a \
    --rocm-path=/opt/rocm-6.0.0 \
    rush_larsen_gpu_omp_mpi_compiler1.cc -o rush_larsen_gpu_omp_mpi_compiler1.o

env MPICH_CXX=/usr/bin/g++ /opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0/bin/mpicxx \
    -c -DCOMPILERID="g++ (SUSE Linux) 7.5.0" -O3 -fopenmp -fPIC \
    rush_larsen_cpu_omp_mpi_compiler3.cc -o rush_larsen_cpu_omp_mpi_compiler3.o

env MPICH_CXX=/opt/rocm-6.0.0/bin/amdclang /opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0/bin/mpicxx \
    -c -DCOMPILERID=rocm-6.0.0 -O3 -fopenmp \
    rush_larsen_cpu_omp_mpi_compiler2.cc -o rush_larsen_cpu_omp_mpi_compiler2.o

env MPICH_CXX=/opt/cray/pe/cce/17.0.1/bin/crayCC /opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0/bin/mpicxx \
    -c -DCOMPILERID=cce-17.0.1 -O3 -fopenmp \
    rush_larsen_cpu_omp_mpi_compiler1.cc -o rush_larsen_cpu_omp_mpi_compiler1.o

env MPICH_CXX=/opt/cray/pe/cce/17.0.1/bin/crayCC /opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0/bin/mpicxx \
    -c -DCOMPILERID=cce-17.0.1 -D__HIP_PLATFORM_AMD__ -I/opt/rocm-6.0.0/include -O3 -fopenmp \
    -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a \
    --rocm-path=/opt/rocm-6.0.0 \
    rush_larsen_interop2.5_gpu_omp_hip_mpi.cc -o rush_larsen_interop2.5_gpu_omp_hip_mpi.o

env MPICH_CXX=/opt/cray/pe/cce/17.0.1/bin/crayCC /opt/cray/pe/mpich/8.1.29/ofi/crayclang/17.0/bin/mpicxx \
    -DCOMPILERID=cce-17.0.1 -D__HIP_PLATFORM_AMD__ -I/opt/rocm-6.0.0/include -O3 -fopenmp \
    -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a \
    --rocm-path=/opt/rocm-6.0.0 \
    rush_larsen_gpu_omp_fort_compiler1_mpi.o rush_larsen_cpu_omp_fort_compiler1_mpi.o \
    rush_larsen_gpu_lambda_hip_mpi_compiler2.o rush_larsen_gpu_lambda_hip_mpi_compiler1.o \
    rush_larsen_gpu_hip_mpi_compiler2.o rush_larsen_gpu_hip_mpi_compiler1.o \
    rush_larsen_gpu_omp_mpi_compiler2.o rush_larsen_gpu_omp_mpi_compiler1.o \
    rush_larsen_cpu_omp_mpi_compiler3.o rush_larsen_cpu_omp_mpi_compiler2.o \
    rush_larsen_cpu_omp_mpi_compiler1.o rush_larsen_interop2.5_gpu_omp_hip_mpi.o \
    --hip-link --offload-arch=gfx90a -L/opt/rocm-6.0.0/lib -lamdhip64 -lmpifort_cray \
    -o rush_larsen_interop2.5_gpu_omp_hip_mpi

The resulting executable was launched with two MPI ranks across two compute nodes using 32 OpenMP threads each, performing 10 iterations of each variant, each performing 10 trials with a 10 GB data size:

# Run test case
env LD_LIBRARY_PATH=${INTEROP_LD_LIBRARY_PATH} \
    OMP_NUM_THREADS=32 \
    ${SRUN} -n 2 -N 2 -c 32 \
    goulash/tests/rush_larsen/rush_larsen_interop2.5_gpu_omp_hip_mpi/rush_larsen_interop2.5_gpu_omp_hip_mpi 10 10 10

The executable emits frequent progress messages as it runs, indicating that the internal validation checks have passed at each step. The data checks for the last iteration appear as follows:

0:  41.113 (0.416s): PASSED Data check 10 10.00000000  m_gate[0]=0.976324219401755 cpu_omp_mpi_compiler1 [cce-17.0.1]
0:  41.197 (0.427s): PASSED Data check 10 10.00000000  m_gate[0]=0.976324219401755 cpu_omp_mpi_compiler2 [rocm-6.0.0]
0:  45.830 (0.418s): PASSED Data check 10 10.00000000  m_gate[0]=0.976324219401755 cpu_omp_mpi_compiler3 [g++ (SUSE Linux) 7.5.0]
0:   1.767 (0.409s): PASSED Data check 10 10.00000000  m_gate[0]=0.976324219401755 gpu_omp_mpi_compiler1 [cce-17.0.1]
0:   1.964 (0.409s): PASSED Data check 10 10.00000000  m_gate[0]=0.976324219401755 gpu_omp_mpi_compiler2 [rocm-6.0.0]
0:   1.771 (0.406s): PASSED Data check 10 10.00000000  m_gate[0]=0.976324219401755 gpu_hip_mpi_compiler1 [cce-17.0.1]
0:   1.779 (0.408s): PASSED Data check 10 10.00000000  m_gate[0]=0.976324219401755 gpu_hip_mpi_compiler2 [rocm-6.0.0]
0:   1.767 (0.408s): PASSED Data check 10 10.00000000  m_gate[0]=0.976324219401755 gpu_lambda_hip_mpi_compiler1 [cce-17.0.1]
0:   1.769 (0.407s): PASSED Data check 10 10.00000000  m_gate[0]=0.976324219401755 gpu_lambda_hip_mpi_compiler2 [rocm-6.0.0]
0:  23.648 (0.808s): PASSED Data check 10 10.00000000  m_gate[0]=0.976324219401755 cpu_omp_mpi_fort_compiler1 [cce-17.0.1]
0:   2.140 (0.394s): PASSED Data check 10 10.00000000  m_gate[0]=0.976324219401755 gpu_omp_mpi_fort_compiler1 [cce-17.0.1]

The final output messages indicate that the test completed successfully:

IOP Rank 0: 1593.631 (2.162s): Measuring free GPU memory (all tasks) after calling rush_larsen_gpu_omp_mpi_fort_compiler1()  Iter 10 of 10
IOP Rank 0: 1593.645 (0.013s): VARGPUMEM:    0.00% GPU mem variation after calling rush_larsen_gpu_omp_mpi_fort_compiler1()  Iter 10 of 10
IOP Rank 0: 1593.645 (0.000s): MINGPUMEM:   53.77 gb free GPU memory after calling rush_larsen_gpu_omp_mpi_fort_compiler1()  Iter 10 of 10
IOP Rank 0: 1593.645 (0.000s): AVGGPUMEM:   53.77 gb free GPU memory after calling rush_larsen_gpu_omp_mpi_fort_compiler1()  Iter 10 of 10
IOP Rank 0: 1593.645 (0.000s): MAXGPUMEM:   53.77 gb free GPU memory after calling rush_larsen_gpu_omp_mpi_fort_compiler1()  Iter 10 of 10
IOP Rank 0: 1593.645 (0.000s): GPUMEMFREE:  53.77 gb free GPU memory after calling rush_larsen_gpu_omp_mpi_fort_compiler1()  Iter 10 of 10
IOP Rank 0: 1593.645 (0.000s): INTEROP_PASS: ========== Completed interoperability tests 10 10 10.00000000 (interop_gpu_omp_hip_mpi [cce-17.0.1]) ==========

SEE ALSO

crayftn(1), craycc(1), crayCC(1), ftn(1), cc(1), CC(1), intro_openmp(7), intro_mpi(7)