About the HPE CPE Installation Guide: CSM on HPE Cray Supercomputing EX Systems

The HPE Cray Programming Environment Installation Guide: CSM on HPE Cray Supercomputing EX Systems (S-8003) contains procedures for installing the HPE Cray Programming Environment (CPE) and third-party programming environment components, including TotalView, Forge, and AMD and Intel compilers.

This publication is intended for system administrators who want to:

  • Install or reinstall of all CPE components,

  • Install additional licensed components, or

  • Customize the programming environment files before use.

This guide assumes that users have:

  • Familiarity with standard Linux and open source tools, including Ansible, YAML, and (optionally) Kubernetes, and

  • Access to a method of obtaining the most current CPE tar files.

See the HPE Cray Programming Environment User Guide: CSM on HPE Cray Supercomputing EX Systems (S-8005) for a complete list of components and modules installed as part of CPE.

Release information

This publication supports the installation of CPE 24.07 on HPE Cray Supercomputing EX systems with:

  • HPE Cray System Management (CSM) software version 1.5,

  • HPE Cray Supercomputing Operating System Software (COS) 24.7 (COS Base 3.1.0/USS 1.1.0), and

  • SUSE Linux Enterprise Server (SLES) 15 SP5.

COS 23.11 (and later) components comprise:

  • COS Base

  • HPE Cray Supercomputing User Services Software (USS)

  • HPE SUSE Linux Enterprise Server

Variable substitutions

Use the following variable substitutions throughout the included procedures.

  • <CPE_RELEASE> = 24.07

  • <CPE_VERSION> = 24.07

  • <spX> or <SPX> = SP5

Supporting documentation

References to specific and related HPE Cray Supercomputing EX documentation are found throughout this guide. Direct links to these references (for supported COS/CSM releases defined in CPE Installation Prerequisites) are provided in HPE Cray Supercomputing EX documentation links.

Record of revision

New in the CPE 24.07 publication

New in the CPE 24.03 publication

New in the CPE 23.12 publication

New in the CPE 23.09 publication

New in the CPE 23.05 publication

  • Added the Installing and upgrading CPE using the Install and Upgrade Framework procedure section.

  • Added new note about nid reassignments for PBS in the Working with PBS configurations and nodes section.

  • Added the Configuring PBS for systems with Slingshot and HPE 200GB NICs section.

  • Added the Configuring Slingshot Traffic Classes in PBS section.

  • Added the Troubleshooting failing Slurm PXC pods section.

  • Added the Troubleshooting Slurm Database Lost Connection Errors section.

  • Combined the formerly separate procedures for installing and upgrading PBS into one. The procedures for these tasks are now located in the Installing or upgrading the PBS Professional workload manager section.

  • Simplified and updated the Slurm configuration procedure in the Update Slurm configuration section. The procedure now requires you to restart slurmctld.

  • Moved the Enable PBS to use Low Noise mode section into the Configure PBS during or post installation section.

  • Updated the procedure in the Installing or upgrading the Slurm workload manager section.

  • Updated the procedure in the Installing or upgrading PBS Professional workload manager section.

  • Updated the procedure in the Configuring CPE using CFS section.

  • Updated the Update Slurm configuration section to indicate the configuration update is also used to reassign compute node hostnames.

  • Updated the procedure in the Update Slurm Ansible configuration during or post installation section.

  • Updated the procedure in the Update PBS Ansible configuration during or post installation section.

  • Updated instructions on how to use crypkg-gen to create a modulefile for Intel oneAPI in the Create modulefiles for third-party products section.

  • Updated the Back up Slurm Spool Directory section.

  • Updated the Restore Slurm Spool Directory from Backup section.

  • Updated the AMD AOCC Compiler and Intel oneAPI product examples in the Installing or upgrading CPE section.

  • Updated the HPE Cray Supercomputing EX software documentation links table to support both the COS 2.4.X/CSM 1.3.X and COS 2.5.X/CSM 1.4.X releases, and added links to Ceph latency issues documentation.

  • Updated the compatible compiler version table in the Module path aliases and current compatibility versions section. Updated the cce and gcc compatible compiler versions, and added the amd compatible compiler version.

New in the CPE 23.02 (Revision A) publication

  • Added instructions for adding a master branch in git in the Update Slurm Ansible configuration during or post installation and Update PBS Ansible configuration during or post installation sections.

  • Updated the Create modulefiles for third-party products section to include a note about how to use the crypkg-gen utility to create an Intel modulefile.

  • Moved the Enable PBS to use Low Noise mode section into the Configure PBS during or post installation section.

New in the CPE 23.02 publication

  • Updated previously optional procedures to mandatory procedures. Mandatory procedures now include:

    • Updating settings for Slurm installation

    • Configuring/Updating settings

    • Configuring CSM software for HSN connectivity during PBS installation/upgrade

  • Updated the SLUMBLOB_VERSION variable from 1.2.8 to 1.2.9 in the Release information section.

  • Added a step in the Updating settings for Slurm installation section that details how to check whether LDAP is configured.

  • Updated the HSN network subnet setting information in the Configuring CSM software for HSN connectivity during Slurm installation or upgrade section.

  • Added a step in the Updating settings for Slurm installation and Configuring CSM software for HSN connectivity during PBS installation/upgrade sections detailing how to restart all UAI pods.

  • Updated the SwitchParameters option listing and resources listing in the Configure Slurm for systems with Slingshot and HPE 200GB NICs section.

  • Added a step in the Configure Slurm for systems with Slingshot and HPE 200GB NICs section that details how to enable Instant On in Slurm.

  • Added the Resolving Slurm Pods that are Stuck in a ContainerCreating State section.

  • Added a step in the Update PBS Ansible configuration during or post installation section that details how to configure PBS to copy output files directly to a file system instead of using scp.

  • Added the Configuring PBS for high availability section.

  • Moved the Enable PBS to use Low Noise mode section under the PBS troubleshooting and administrative tasks section.

  • Changed the COS 2.3.X/CSM 1.2.0 HTML link to a PDF link for the Getting Started Guide (S-8000) line item that is referenced in the table shown in the HPE Cray Supercomputing EX software documentation links section.

  • Obsoleted former release sections:

    • Run the Slurm bringup script

    • Run the PBS Bringup Script

    • Run the Upgrade PBS Bringup Script

    • Recover after UAN Image Corruption during Slurm Installation or Upgrade

  • Incorporated minor editorial updates.

Publication Title

Date

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray Supercomputing EX Systems (24.07) S-8003

August 2024

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (24.03) S-8003

May 2024

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (23.12) S-8003

December 2023

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (23.09) S-8003

September 2023

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (23.05) S-8003

June 2023

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (23.02 Rev A) S-8003

March 2023

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (23.02) S-8003

February 2023

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.12) S-8003

December 2022

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.11) S-8003

November 2022

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.10) S-8003

October 2022

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.09) S-8003

September 2022

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.08 Rev A) S-8003

August 2022

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.08) S-8003

August 2022

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.06 Rev A) S-8003

July 2022

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.06) S-8003

June 2022

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.05) S-8003

May 2022

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.04) S-8003

April 2022

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.03) S-8003

March 2022

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.02) S-8003

February 2022

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (21.12) S-8003

December 2021

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (21.11) S-8003 Rev A

November 2021

HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (21.11) S-8003

November 2021

HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.10) S-8003

October 2021

HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.09) S-8003

September 2021

HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.08) S-8003

August 2021

HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.07) S-8003 Rev A

July 2021

HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.07) S-8003

July 2021

HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.06) S-8003

June 2021

HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.05) S-8003

May 2021

HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.04) S-8003

April 2021

HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.03) S-8003

March 2021

HPE Cray Asynchronous Installer Guide (21.03) S-8003

March 2021

HPE Cray Asynchronous Installer Guide (20.11) S-8003

November 2020

HPE Cray Asynchronous Installer Guide (20.10) S-8003

October 2020

HPE Cray Asynchronous Installer Guide (20.09) S-8003

September 2020

HPE Cray Asynchronous Installer Guide (20.08) S-8003

August 2020

Cray Asynchronous Installer Guide (20.06) S-8003

June 2020

Cray Asynchronous Installer Guide (20.05) S-8003

May 2020

Cray Asynchronous Installer Guide (20.04) S-8003

April 2020

Cray Asynchronous Installer Guide (20.03) S-8003

March 2020

Cray Asynchronous Installer Guide (20.02) S-8003

February 2020

Cray Shasta Asynchronous Installer Guide (20.01) S-8003

January 2020

Typographic conventions

This style indicates program code, reserved words, library functions, command-line prompts, screen output, file/path names, variables, and other software constructs. \ (backslash) At the end of a command line, indicates the Linux shell line continuation character (lines joined by a backslash are parsed as a single line).

Command prompt conventions

Host name and account in command prompts: The host name in a command prompt indicates where the command must be run. The account that must run the command is also indicated in the prompt.

  • The root or super-user account always has the # character at the end of the prompt.

  • Any non-root account is indicated with account@hostname>. A nonprivileged account is referred to as user.

Node abbreviations: The following list contains abbreviations for nodes used in command prompts.

  • CN - Compute Nodes

  • NCN - Non-Compute Nodes

  • AN - Application Node (special type of NCN)

  • UAN - User Access Node (special type of AN)

Command prompts: The following list contains command prompts used in this guide.

  • ncn-m001# - Run the command as root on the specific NCN-M (NCN that is a Kubernetes master node) with hostname ncn-m001.

  • ncn-w001# - Run the command as root on the specific NCN-W (NCN that is a Kubernetes worker node) with hostname ncn-w001.

  • uan01# - Run the command on a specific UAN.

  • cn# - Run the command as root on any CN. Note that a CN has a hostname of the form nid123456 (that is, “nid” and a six-digit, zero padded number).

  • pod# - Run the command as root within a Kubernetes pod.

Copying and Pasting from a PDF

Using the Copy and Paste functions from a PDF is unreliable. Although copying and pasting a command line typically works, copying and pasting formatted file content (for example, JSON, YAML) typically fails. To ensure that file content is copied and pasted correctly while performing the procedures in this guide:

  1. Copy the content from the PDF.

  2. Paste it to a neutral editing form and add the necessary formatting.

  3. Copy the content from the neutral form and paste it into the console.

Tip: It is always a good idea to double-check copied/pasted commands for correctness, as some commands may not render correctly in the PDF.

Downloading HPE Cray Supercomputing EX software

To download HPE Cray Supercomputing EX software, refer to the HPE Support Center or download it directly from My HPE Software Center. The HPE Support Center contains a wealth of documentation, training videos, knowledge articles, and alerts for HPE Cray Supercomputing EX systems. It provides the most detailed information about a release as well as direct links to product firmware, software, and patches available through My HPE Software Center.

Downloading the software through the HPE Support Center

HPE recommends downloading software through the HPE Support Center because of the many other resources available on the website.

  1. Visit the HPE Cray Supercomputing EX product page on the HPE Support Center.

  2. Search for specific product info, such as the full software name or recipe name and version.

    For example, search for “Slingshot 2.1” or “Cray System Software with CSM 24.3.0.”

  3. Find the desired software in the search results and select it to review details.

  4. Select Obtain Software and select Sign in Now when prompted.

    If a customer’s Entitlement Order Number (EON) is tied to specific hardware rather than software, the software is available without providing account credentials. Access the software instead by selecting Download Software and skip the next step in this procedure.

  5. Enter account credentials when prompted and accept the HPE License Terms.

    To download software, customers must ensure their Entitlement Order Number (EON) is active under My Contracts & Warranties on My HPE Software Center. If customers have trouble with the EON or are not entitled to a product, they must contact their HPE contract administrator or sales representative for assistance.

  6. Choose the needed software and documentation files to download and select curl Copy to access the files.

    Just like the software files, the documentation files change with each release. In addition to the official documentation, valuable information for a release is often available in files that include the phrase README in their name. Be sure to select and review these files in detail.

    HPE recommends the curl Copy option, which downloads a single text file with curl commands to use on the desired system. You must run the curl commands within 24 hours of downloading them or download new commands if more than 24 hours have passed.

    To validate the security of the downloads, you can later compare the files on the desired system against the checksums provided by HPE underneath each selected download.

  7. Save the text file to a central location.

  8. On the system where the software will be downloaded, run a shell script to execute the text file that includes the curl commands.

    For example:

    ncn-m001# bash -x <TEXT_FILE_PATH>
    

    The -x option in this example tracks the download progress of each curl command in the text file.

Downloading the software directly from the My HPE Software Center

Users already familiar with a release can save time by downloading software directly from My HPE Software Center.

  1. Visit My HPE Software Center and select Sign in.

  2. Enter account credentials when prompted and select Software in the left navigation bar.

  3. Search for specific product info, such as the full software name or recipe name and version.

    For example, search for “Slingshot 2.1” or “Cray System Software with CSM 24.3.0.”

  4. Find the desired software in the search results and review details by selecting Product Details under the Action column.

    Image of Product Detailsoption

  5. Select Go To Downloads Page and accept the HPE License Terms.

    To download software, customers must ensure their Entitlement Order Number (EON) is active under My Contracts & Warranties. If customers have trouble with the EON or are not entitled to a product, they must contact their HPE contract administrator or sales representative for assistance.

  6. Choose the needed software and documentation files to download and select curl Copy to access the files.

    Just like the software files, the documentation files change with each release. In addition to the official documentation, valuable information for a release is often available in files that include the phrase README in their name. Be sure to select and review these files in detail.

    HPE recommends the curl Copy option, which downloads a single text file with curl commands to use on the desired system. You must run the curl commands within 24 hours of downloading them or download new commands if more than 24 hours have passed.

    To validate the security of the downloads, you can later compare the files on the desired system against the checksums provided by HPE underneath each selected download

  7. Save the text file to a central location.

  8. On the system where the software will be downloaded, run a shell script to execute the text file that includes the curl commands.

    For example:

    ncn-m001# bash -x <TEXT_FILE_PATH>
    

    The -x option in this example tracks the download progress of each curl command in the text file.

About Ansible

Ansible is an open-source software provisioning and configuration management tool. The CPE Installer leverages Ansible playbooks and roles to install CPE components. Below is an example of the pe_deploy.yml playbook:

---
- hosts: uai:Application_UAN:Application:Compute
  any_errors_fatal: true
  gather_facts: no
  remote_user: root
 
  pre_tasks:
    - name: Unmount any overlays first
      command: bash /etc/cray-pe.d/pe_overlay.sh cleanup
      when:
        - not cray_cfs_image
        - forcecleanup | default(false)
      ignore_errors: yes
 
  roles:
    - { role: cray.pe_deploy, cray_pe_pkg: aocc, when: not cray_cfs_image }
    - { role: cray.pe_deploy, cray_pe_pkg: intel, when: not cray_cfs_image }
    - { role: cray.pe_deploy, when: not cray_cfs_image }
 
  post_tasks:
    - name: Run mount overlay setup script
      command: bash /etc/cray-pe.d/pe_overlay.sh
      when:
        - not cray_cfs_image
        - not forcecleanup | default(false)

IMPORTANT: You must update Ansible .yml files when performing custom installations. These files should be updated with great caution. The syntax of Ansible files does not support using tabs for editing, only spaces. See the Ansible Documentation for more information and details about Ansible syntax.

Installation prerequisites

Before installing HPE Cray Programming Environment on HPE Cray Supercomputing EX systems running HPE Cray System Management (CSM), make sure that your system complies with supported systems. See Release information section of this guide for more details. Also, you must retain:

  • Root administrator access permissions to properly run the CPE Installer. Ansible needs these permissions to create the directory structure and install various elements of the CPE. Root access is not required to run the CPE; root access is required to only install or upgrade CPE.

  • Familiarity with:

    • Linux - To properly run the CPE Installer, an understanding of Linux file system basics is necessary.

    • Ansible - Knowledge of running Ansible and using Ansible playbooks is required. See Ansible Documentation for more information.

    • YAML - YAML is a human-readable data-serialization language. Ansible playbooks are stored in .yml format. Knowledge of YAML is not necessary to run Ansible playbooks but is useful for image customization.

    • Kubernetes (optional) - If you are installing CPE on containerized User Access Instance (UAI) nodes, an understanding of Kubernetes could be helpful but is not necessary to install or use the nodes.

Installing and upgrading CPE using the Install and Upgrade Framework procedure

The Install and Upgrade Framework (IUF) is a CLI- and API-based process used to install, upgrade, and deploy CPE. The IUF process offers advantages for installing CPE onto compatible HPE Cray Supercomputing EX systems. These benefits include minimized user intervention, reduced time constraints, and a more automated and simplified method for installing CPE. The IUF method can be used with CSM 1.4 or higher.

The instructions in this chapter provides detailed steps and information for using the IUF process to install or upgrade CPE.

Installing or Upgrading CPE with IUF

PREREQUISITES

Be sure to:

  • Review Installation Prerequisites before proceeding with these installation and upgrade procedures.

  • Download third-party compilers from their respective websites (for example, AOCC, Intel). CPE does not distribute third-party compilers.

  • Use CSM 1.4 or higher for installing or upgrading CPE.

OBJECTIVE

This procedure details how to install or upgrade the base HPE Cray Programming Environment on an HPE Cray Supercomputing EX system using IUF. The same instructions are followed whether installing CPE using IUF for the first time or upgrading CPE using IUF on a previously installed system.

PROCEDURE

  1. SSH into the management node:

    user@hostname> ssh root@<system>-ncn-m001
    
  2. Create a directory for the activity media. For example:

    ncn-m001# mkdir -p /etc/cray/upgrade/csm/<activity_name>
    
  3. Copy media to your activity directory:

    ncn-m001# cd /etc/cray/upgrade/csm/<activity_name>
    ncn-m001# cp ../reference_media/cpe-<CPE_RELEASE>-sles15-<spX>-\
    csm-<CPE_VERSION>.tar.gz .
    
  4. Copy in reference bootprep config files:

    ncn-m001# cp -a /etc/cray/upgrade/csm/hpc-csm-software-recipe-23.1.18/vcs/* .
    
  5. If this is a CPE upgrade only, make a copy of /etc/cray/upgrade/admin/site_vars.yaml, and update the suffix and note values so that any artifacts created can be easily associated with the CPE upgrade.

    Example site_var.yaml:

    default:
       network_type: "cassini"
       suffix: "cpe-23.5.4.upgrade"
       note: "bob-"
    
  6. Run the iuf CLI command:

    ncn-m001# iuf -a cpemedia -m /etc/cray/upgrade/csm/cpemedia run --site-vars \
    /etc/cray/upgrade/csm/site_vars.yaml --recipe-vars product_vars.yaml --bootprep-config-managed \
    bootprep/compute-and-uan-bootprep.yaml --bootprep-config-management \
    bootprep/management-bootprep.yaml -b process-media -e update-vcs-config
    

    The above example uses cpemedia as the activity_name.

    See the IUF section of the Cray System Management Documentation for details on iuf command line options.

  7. Verify that CPE installed successfully. Use kubectl to print all CPE versions in the product catalog configmap, and double-check that the latest version is also in the list. Note that the latest CPE version is likely not at the end of the output; scroll up through the output to locate the latest CPE version.

    ncn-m001# kubectl get cm cray-product-catalog -n services -o json | jq -r .data.cpe
    ...
    <CPE_VERSION>:
      configuration:
        clone_url: https://vcs.hostname.com/vcs/cray/cpe-config-management.git
        commit: 341017e953c3c57dd46ddbccec168ca28af9199a
        import_branch: cray/cpe/<CPE_VERSION>
        import_date: 2023-01-24 20:10:42.950742
        ssh_url: git@host.com:cray/cpe-config-management.git
    
  8. (Optional) Upload third-party artifact(s) to a Nexus repository.

    Change directory to the expanded CPE artifacts that exist in the media folder. In this case, use /etc/cray/upgrade/csm/cpemedia. Third-party packages can be copied over to the /etc/cray/upgrade/csm/cpemedia folder.

    ncn-m001# cd /etc/cray/upgrade/csm/cpemedia/cpe-<CPE_VERSION>-sles15-sp4
    

    The CPE release tar file contains a script, install-3p.sh, that uploads third-party packages to Nexus repositories. New repositories are automatically created if they do not already exist. The script has two modes of operation:

    • Uploading a file:

      ncn-m001# install-3p.sh <FILE> <REPO_NAME>
      
    • Uploading RPM files, where <RPM_DIR> is a directory of RPMs.

      ncn-m001# install-3p.sh <RPM_DIR> <REPO_NAME>
      

    Uploading RPM files automatically generates RPM repository metadata required for installation using Zypper.

    Specific product examples:

    AMD AOCC Compiler

    ncn-m001# cpe-<CPE_RELEASE>-sles15-<spX>/install-3p.sh \
    aocc-compiler-3.2.0.tar aocc-compiler-3.2.0-linux-x86_64-raw
    

    ARM Forge

    ncn-m001# cpe-<CPE_RELEASE>-sles15-<spX>/install-3p.sh \
    arm-forge-21.1.2-linux-x86_64.tar arm-forge-21.1.2-linux-x86_64-raw
    

    Intel oneAPI

    Note that the oneAPI 2022.2.0 release uses the version string “2022.2.0” for RPM versions and installation paths. Therefore, it is the version number needed for installation scripts.

    ncn-m001# tar xf intel-oneapi-2022.2.0.tar
    ncn-m001# cpe-<CPE_RELEASE>-sles15-sp4/install-3p.sh \
    intel-oneapi-2022.2.0/ intel-oneapi-2022.2.0
    

    TotalView

    ncn-m001# cpe-<CPE_RELEASE>-sles15-<spX>/install-3p.sh \
    totalview-2022.1.11-0.x86_64.rpm totalview-2022.1.11-linux-x86_64-yum
    

Installation of CPE is now complete. If other HPE Cray Supercomputing EX software products are being installed or upgraded in conjunction with CPE, refer to the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM S-8052 to determine which step to execute next; see HPE Cray Supercomputing EX software documentation links for a direct link. Otherwise, continue to the next sections of this document for operations to configure and deploy new CPE images.

Optional third-party product image customization

PREREQUISITES

The CPE package must be installed, and third-party artifacts must be available in a Nexus repository.

OBJECTIVE

Configure third-party compilers AOCC, Forge, and Intel oneAPI into a new CPE image for deployment with the CPE deployment.

HPE provides Ansible customization roles for the AMD AOCC Compiler, Intel oneAPI, and Forge. Some steps in this procedure use the AOCC customization as an example; however, the procedure is similar for the other products.

Product ansible roles:

  • AMD AOCC Compiler: cray.pe_aocc_customize

  • Intel oneAPI: cray.pe_intel_customize

  • Forge: cray.pe_forge_customize

  • Totalview: cray.pe_totalview_customize

  • NVIDIA: cray.pe_nvidia_customize

For products supporting GPUs (NVIDIA HPC SDK, AMD ROCm), refer to the GPU Support section of HPE Cray Supercomputing User Services Software Administration Guide: CSM on HPE Cray Supercomputing EX Systems for installation instructions; see HPE Cray Supercomputing EX software documentation links for a direct link to the appropriate guide.

PROCEDURE

  1. (IUF Installation) Clone and create the local VCS repository:

    ncn-m001# mkdir -p /var/tmp/cpe
    ncn-m001# cd /var/tmp/cpe
    ncn-m001# git clone https://crayvcs@api-gw-service-nmn.local/vcs/cray/cpe-config-management.git
    Enter password: <password obtained from secret credentials>
    ncn-m001# cd /var/tmp/cpe/cpe-config-management
    ncn-m001# git checkout integration-<CPE_RELEASE>
    
  2. (Non-IUF Installation) The CPE install.sh script executed earlier cloned a local VCS repository. Change directory into the new path and continue:

    ncn-m001# cd /var/tmp/cpe/cpe-config-management
    
  3. Verify that the default variables for the image customization role match the values used earlier for uploading to Nexus:

    ncn-m001# vi roles/cray.pe_aocc_customize/defaults/main.yml
    

    For the cray.pe_intel_customize role, intel_pkgs can be modified to install a different set of oneAPI components.

  4. (For systems with COS 2.5) Add the pre_tasks code to the image customization playbook:

    ncn-n001# vi pe_aocc_customize.yaml
    hosts: Application:Compute
    any_errors_fatal: true
    remote_user: root
    pre_tasks:
       -  name: Remove PTF repository
          zypper_repository:
             name: "SUSE-SLE-Module-Basesystem-15-SP4-x86_64-PTF"
             state: absent
    roles:
       - { role: ca-cert, when: cray_cfs_image | default(false) }
       - { role: cray.pe_aocc_customize, when: cray_cfs_image | default(false) }
    
  5. (Forge Only) Copy the license file to roles/cray.pe_forge_customization/files/License.dat or populate the existing empty License.dat file with the license information.

  6. (Totalview only) Copy the License.dat or tv_license_file license file to roles/cray.pe_totalview/files/, and update roles/cray.pe_totalview/defaults/main.yml based on the type of license.

    Example using the FNP license:

    totalview:
    ...
    license_path: "/opt/toolworks/FNP_license"
    license_file: "License.dat"
    

    Example using the FNE license:

    totalview:
    ...  
    license_path: "/opt/toolworks/FNE_license"
    license_file: "tv_license_file"
    
  7. Add, commit, and push changes:

    ncn-m001# git commit -am "Add customizations to install the AMD AOCC compiler"
    ncn-m001# git push -u origin cpe-<CPE_RELEASE>-integration
    
  8. Run the CPE image customization script with a parameter (aocc, intel, nvidia, forge, totalview), specifying which built-in playbook to use:

    If a previous version of CPE is installed:

    1. Determine the IMS image ID of the cpe-barebones image:

      ncn-m001# kubectl get cm cray-product-catalog -n services -o json | jq -r .data.cpe
      
    2. Set the environment variable BASE_IMG_ID to the IMS image ID:

      ncn-m001# export BASE_IMG_ID=<IMS_image_ID>
      
    3. Run the CPE image customization script:

      ncn-m001# ./cpe-custom-img.sh aocc
      

      A new image (deployable with the provided CPE image) is created after the CFS session completes.

  9. Record the result_id for use when preparing the CPE deployment:

    ncn-m001# cray cfs sessions describe cpe-aocc-customization \
    --format json | jq -r .status.artifacts[].result_id
    
    0e54050a-c43c-4534-ba38-7191838e348d
    

Repeat the steps above for each third-party product image that needs CPE support customization. Then, continue to the Configuring CPE Using CFS section to prepare the CPE deployment.

Configuring CPE using CFS

PREREQUISITES

Make sure you have:

OBJECTIVE

This section provides details for preparing a CPE CFS layer for product integration.

PROCEDURE

Some of the following steps are specific to IUF installations. Steps annotated with Non-IUF Installation are for non-IUF environments. Steps annotated with IUF Installation are specific only to IUF environments.

  1. Clone the CPE configuration management repository, and checkout the integration branch. The CPE install.sh or iuf run command executed earlier created a new local integration branch.

    ncn-m001# mkdir -p /var/tmp/cpe
    ncn-m001# cd /var/tmp/cpe
    ncn-m001# git clone https://crayvcs@api-gw-service-nmn.local/vcs/cray/cpe-config-management.git
    Enter password: <password obtained from secret credentials>
    ncn-m001# cd /var/tmp/cpe/cpe-config-management
    ncn-m001# git checkout integration-<CPE_RELEASE>
    
  2. Configure images to deploy. The order of roles is important: The first is the top-most layer and also the default image; lower layers and non-default images must follow.

    The cray_pe_pkg parameter values include:

    • base: Contains the base CPE content, including PrgEnv-cray and PrgEnv-gnu. This value is the default value if cray_pe_pkg is not set.

    • intel: Contains PrgEnv-intel for Intel oneAPI support.

    • aocc: Contains PrgEnv-aocc for AMD Optimizing C/C++ Compiler support.

    • amd: Contains PrgEnv-amd for AMD ROCm support. Deploy this image for AMD GPU-enabled systems.

    • nvidia: Contains PrgEnv-nvhpc for NVIDIA HPC SDK support. Deploy this image for NVIDIA GPU-enabled systems.

    To deploy a customized image, set img_id to the IMS image ID of the customized image (recorded during [Optional third-party product image customization] (#optional-third-party-product-image-customization)), and give the image a unique name with img_name.

    The following is an example of how to deploy CPE base and aocc images with the current and a previous version of CPE, along with two versions of the AOCC compiler.

    Example:

    To deploy supported third-party CPE images for both x86 and ARM/AArch64 nodes, in addition to hybrid environments:

      ncn-m001# vim pe_deploy.yml
      roles:
        - { role: cray.pe_deploy, when: not cray_cfs_image }
        - { role: cray.pe_deploy, cray_pe_version: "21.10", when: not cray_cfs_image }
        - { role: cray.pe_deploy, cray_pe_pkg: aocc, when: (not cray_cfs_image) and \
            (ansible_architecture == 'x86_64') }
        - { role: cray.pe_deploy, cray_pe_pkg: aocc, cray_pe_version: "21.10", when: \
          (not cray_cfs_image) and (ansible_architecture == 'x86_64') }
        - { role: cray.pe_deploy, img_name: "aocc-compiler-3.1.0", \
            img_id: "1f506586-e447-4c2a-b38d-1158cb29e4f8", when: (not cray_cfs_image) and \
            (ansible_architecture == 'x86_64') }
        - { role: cray.pe_deploy, img_name: "aocc-compiler-3.0.0", \
            img_id: "0e54050a-c43c-4534-ba38-7191838e348d", when: (not cray_cfs_image) and \
            (ansible_architecture == 'x86_64') }
    

    If the ansible_architecture variable is undefined, the system automatically determines x86_64 or AArch64 nodes on which to deploy, such as the first two example lines above (base-latest and base-21.10). If the variable is defined, the update applies to the target set of nodes. In the above example, aocc-latest, aocc-21.10, aocc-compiler-3.1.0, and aocc-compiler-3.0.0 are limited to x86_64 nodes since they are not supported on AArch64.

    You can use:

    • git diff origin/integration-<prev_release>.. to check differences between latest and previous integration branches.

    • git checkout origin/integration-<prev_release> -- pe_deploy.yml to pick up previously customized files, as needed.

  3. (Optional) Customize site modules. You can set in roles/cray.pe_deploy/default/main.yml custom values to meet site-specific needs for:

    • cray-pe-configuration.csh.j2

    • cray-pe-configuration.sh.j2

    The above modules are in roles/cray.pe_deploy/templates.

    Acceptable variables include:

    • cray_pe_module_prog: Defines the default module handling system, either Lmod (Lua) or Environment Modules (TCL).

    • cray_pe_default_prgenv: Defines the default programming environment.

    • cray_pe_mpaths: Defines any site-specific paths to be added to MODULEPATH to make site modules available.

    • cray_pe_init_module_list: Defines the modules to be loaded on login.

    • cray_pe_site_module_list: Defines additional site modules to be loaded upon login.

    • cray_pe_prgenv_module_list: Defines modules to be swapped as part of the PrgEnv module.

    • cray_pe_one_off_set_defaults: Defines a list of paths to set_default scripts to be run at deploy time. This variable enables you to set default versions at the component level.

    For example, to set Lmod as the default module handling system in the image in roles/cray.pe_deploy/defaults/main.yml, set cray_pe_module_prog: lmod.

  4. Commit and push the changed files in git:

    ncn-m001# git commit -am "Update CPE packages and image layers"
    ncn-m001# git push -u origin integration-<CPE_RELEASE>
    
  5. (IUF installation) IUF creates the CFS configuration layer for CPE, provides a bootprep file that will be used to customize images/personalize nodes, and creates BOS session templates.

    Example:

    The example specifies activity_name == cpemedia and shows other example paths. The administrator sets up IUF-related files in Installing or Upgrading CPE with IUF.

    Note: On systems with only AArch64 nodes:

    1. Edit the /etc/cray/upgrade/csm/site_vars.yaml, and using the latest CPE version, for example, add:

      cpe-aarch64:
        version: 23.12.3
        working_branch: "{{ working_branch }}"
      
    2. Edit both the /etc/cray/upgrade/csm/cpemedia/bootprep/compute-and-uan-bootprep.yaml file and the management-bootprep.yaml file by updating them from:

      name: cpe
      version: "{{cpe.version}}"
      branch: "{{cpe.working_branch}}"
      

      To:

      name: cpe-aarch64
      version: "{{cpe_aarch64.version}}"
      branch: "{{cpe_aarch64.working_branch}}"
      
      ncn-m001# cd /etc/cray/upgrade/csm/cpemedia
      ncn-m001# iuf -a cpemedia -m /etc/cray/upgrade/csm/cpemedia run --site-vars \
        /etc/cray/upgrade/csm/site_vars.yaml --recipe-vars product_vars.yaml \
        --bootprep-config-managed bootprep/compute-and-uan-bootprep.yaml \
        --bootprep-config-management bootprep/management-bootprep.yaml \
        -b update-cfs-config -e prepare-images
      
  6. (Non-IUF installation) CPE includes an operation automation script that creates a new CFS configuration with the latest CPE version and commit ID. The script has two optional parameters (CFS_name and apply):

    ncn-m001# cpe-cfs.sh [CFS_name] [apply]
    

    Note, however, that if:

    • No parameters are specified; the script uses the latest cpe-yy.mm-integration branch in CFS.

    • SAT (excluding SAT 2.2.15 or earlier) is installed; the script also outputs a section of yaml code for use in a sat bootprep input file for integration with other products. Refer to the SAT bootprep section of the SAT product stream documentation for more information. See HPE Cray Supercomputing EX software documentation links for a direct link.

    • The CFS_name parameter is specified; the script proposes a new .json file that adds or replaces any existing CPE layer.

    • The apply parameter is specified; the script modifies the CFS config using the proposed .json file. HPE recommends a trial run without the apply parameter to verify the results, then rerun with the apply parameter to incorporate the changes. For example:

      ncn-m001# ./cpe-cfs.sh cos-config-2.1.27 [apply]
      ...
        ________________________________________
        Updating new CPE CFS configuration ...
        {
          "lastUpdated": "2021-10-13T21:05:40Z",
          "layers": [
            {
              "cloneUrl": "https://api-gw-service-nmn.local/vcs/cray/cpe-config-management.git",
              "commit": "4194bd87979f876400fa9159a60985dacee06a3b",
              "name": "cpe-21.11-integration",
              "playbook": "pe_deploy.yml"
            }
          ],
          "name": "cpe-21.11-integration"
        }
        ________________________________________
        Generating new layers for cos-config-2.0.27 ...
        Proposed new layers for cos-config-2.0.27:
        {
          "layers": }
      ...
      
  7. (Non-IUF installation) If sat bootprep is not used (for example, for NCN-personalization of UAI hosts), then the cpe-cfs.sh [CFS_name] parameter must be specified:

    1. (HPE recommended) Run the script without the apply parameter to verify the results, then rerun it again with the apply parameter to incorporate the changes. Do not use the apply parameter if sat bootprep is used.

    2. Rerun the script with the apply parameter for COS, UAN, and NCN personalization CFS configurations as necessary.

    3. Update BOS session templates to ensure the latest CPE CFS configs are included on all nodes after reboots. Refer to Configuration Management in the Cray System Management Administration Guide for details; see HPE Cray Supercomputing EX software documentation links for a direct link.

    4. Run module list to check if PE is ready to use after CFS completes on a compute or UAN node.

      Example Output:

      Note that module versions below are examples only, and may differ from those currently loaded on the system. For current CPE release product versions, see the release announcement.

      nid000001# module list
      Currently Loaded Modulefiles:
      1) craype-x86-rome          5) xpmem/2.6.2-2.5_2.27--gd067c3f.shasta   9) cray-mpich/8.1.28
      2) libfabric/1.15.2.0       6) cce/17.0.0                             10) cray-libsci/23.12.5
      3) perftools-base/23.12.0   8) craype/2.7.30                          11) PrgEnv-cray/8.5.0
      4) craype-network-ofi       7) cray-dsmml/0.2.2     
      

Enabling CPE in UAIs

PREREQUISITES

  • HPE CPE must be installed on an HPE Cray Supercomputing EX system running CSM. See Installation Prerequisites for version requirements.

  • A WLM must be customized into a compute image.

OBJECTIVE

This procedure ensures that UAIs run CPE after CPE is installed.

PROCEDURE

  1. Check out cpe-config-management from VCS using git:

    1. Acquire the VCS username and password for git operations:

      ncn-m001# kubectl get secret -n services vcs-user-credentials \ |
      --template={{.data.vcs_username}} | base64 -d 
      ncn-m001# kubectl get secret -n services vcs-user-credentials \ |
      --template={{.data.vcs_password}} | base64 -d
      
    2. Clone the VCS repository:

      git clone https://api-gw-service-nmn.local/vcs/cray/cpe-config-management.git 
      cd cpe-config-management
      
    3. List available branches:

      git branch -r

  2. If CPE was previously installed, clear out any UAS projection paths that CPE previously created by running a script from the earlier CPE branch:

    ncn-m001# git checkout integration-<prev_cpe_version>
    ncn-m001# bash roles/cray.pe_deploy/files/uas_setup_pe.sh clean
    
  3. To set up a new or changed compute image for CPE, run the uas_compute_init.sh script. Where <BOS_session_template> corresponds to a template that includes COS and WLM. Running the script without this parameter shows a list of available choices.

    ncn-m001# git checkout integration-<latest_cpe_version>
    ncn-m001# bash roles/cray.pe_deploy/files/uas_compute_init.sh \
    <BOS_session_template>
    
  4. Do one of the following:

    • If the uas_compute_init.sh script runs to completion and no errors occurred, then add CPE to the management-x.y CFS layer, and run it again on the worker (UAI) nodes. This UAI procedure is now complete, and the remaining steps in this procedure can be skipped.

    • If CPE needs to be added to the management-x.y CFS layer:

      1. Locate the management-x.y.z CFS configuration that was created during the upgrade (for example, management-23.11.1):

        cray cfs configurations list |jq -r '.[]| .name' | sort | egrep "^management"

      2. Verify that the CPE layer exists in the CFS configuration:

        cray cfs configurations describe management-23.11.1 --format json

  5. If errors exist, continue to the next step, which breaks down the script and references the documentation on which it is based.

  6. To update the UAS projection paths for CPE, run the uas_setup_pe.sh script. This script can also be run anytime the CPE paths are reset in the UAS.

    ncn-m001# bash roles/cray.pe_deploy/files/uas_setup_pe.sh
    
  7. Check that the uai HSM group exists. The group should contain all worker nodes designated as UAI hosts.

    ncn-m001# cray hsm groups describe uai
    
  8. If the uai group label does not exist, run the helper script:

    ncn-m001# /opt/cray/csm/scripts/node_management/make_node_groups -u
    

    See User Access Service in the Cray System Management Administration Guide for further information; see HPE Cray Supercomputing EX software documentation links for a direct link.

  9. Check for a UAI image based on compute nodes:

    ncn-m001# cray uas admin config images list
    [[results]]
    default = true
    image_id = "85c7fd74-c410-4920-a452-bd84d27d238e"
    imagename = "registry.local/cray/cray-uai-compute:latest"
    
  10. If no similar image name exists, see User Access Service in the Cray System Management Administration Guide for details on creating and registering a custom UAI image. See HPE Cray Supercomputing EX software documentation links for a direct link. Note that the BOS session template-specific name may vary. CPE requires one built for computes (COS) with a workload manager included.

  11. If the compute image is not set as default, set it:

    ncn-m001# cray uas admin config images update --default yes \
    5c7fd74-c410-4920-a452-bd84d27d238e
    

    CPE is now enabled for UAIs.

  12. Use the cpe-cfs.sh script (described in Configuring CPE Using CFS), or add CPE to the management-x.y CFS layer manually, and run it again on the worker (UAI) nodes. See the Cray System Management Administration Guide for details; see HPE Cray Supercomputing EX software documentation links for a direct link.

After successfully adding CPE to the management CFS layer and running it again on the worker (UAI) nodes, configuration of the HPE Cray Programming Environment is complete. See the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM S-8052 for further installation instructions; see HPE Cray Supercomputing EX software documentation links for a direct link to the appropriate document.

Install previously released CPE packages for CSM on HPE Cray Supercomputing EX systems

PREQUISITES

HPE CPE must be installed on an HPE Cray Supercomputing EX system running CSM; for version requirements, see Installation Prerequisites.

OBJECTIVE

Install a previously released CPE package, <PREV_RELEASE>, after installing the latest CPE.

Previously released CPE packages must use the installer that comes with the latest CPE package. This procedure installs the <PREV_RELEASE> package along with the latest release.

IMPORTANT: Throughout this procedure, replace instances of:

  • <PREV_RELEASE> with the desired previous release’s YY.MM value

PROCEDURE

  1. Download the old CPE tar file, extract it into a path <untar_path>, and then run the following commands.

    ncn-m001# SQFS=CPE-base.x86_64-<PREV_RELEASE>.squashfs
    ncn-m001# cray artifacts create boot-images PE/$SQFS <untar_path>/squashfs/$SQFS
    
  2. Check out the integration branch from the VCS git repo, and update two files to include the <PREV_RELEASE> package for deployment.

    ncn-m001# git checkout integration-<PREV_RELEASE>
    ncn-m001# vi pe_deploy.yml
    roles:
        - { role: cray.pe_deploy, when: not cray_cfs_image }
        - { role: cray.pe_deploy, cray_pe_version: <PREV_RELEASE>, when: not cray_cfs_image } 
    
  3. Update the CFS configuration layers (compute/COS, management-x.y, and UAN) to point to the new integration branch commit ID.

Create modulefiles for third-party products

PREREQUISITES

Third-party packages must be downloaded and installed.

OBJECTIVE

These instructions use crypkg-gen to create a modulefile for a specific version of a supported third-party product. This setup allows a site to set a specific version as default.

The following tasks are necessary and can be embedded in a script where a third-party product is being installed.

PROCEDURE

  1. Load craypkg-gen module.

    ncn-w001# source /opt/cray/pe/modules/default/init/bash
    ncn-w001# module use /opt/cray/pe/modulefiles
    ncn-w001# module load craypkg-gen
    
  2. Generate module and set default scripts for products. Where:

    AMD Optimizing C/C++ Compiler: (requires craypkg-gen >= 1.3.16)

    ncn-w001# craypkg-gen -m /opt/AMD/aocc-compiler-<MODULE_VERSION>/
    

    NVIDIA HPC SDK (requires craypkg-gen >= 1.3.16)

    ncn-w001# craypkg-gen -m /opt/nvidia/hpc_sdk/Linux_x86_64/<MODULE_VERSION>/
    

    Intel oneAPI

    The Intel compiler must be installed in a directory or a symbolic link that follows the <PREFIX>/oneapi/compiler/<VERSION> format before craypkg-gen can create an Intel modulefile. The craypkg-gen utility creates the intel, intel-classic, and intel-oneapi modulefiles after the process completes successfully.

    ncn-w001# craypkg-gen -m /opt/intel/oneapi/compilers/<MODULE_VERSION>/
    
  3. Run a set default script.

    ncn-w001# /opt/admin-pe/set_default_craypkg/set_default_<MODULE_NAME>_<MODULE_VERSION>
    

Lmod custom dynamic hierarchy

Lmod enables a user to dynamically modify their user environment through Lua modules. The CPE implementation of Lmod capitalizes on its hierarchical structure, including the Lmod module auto-swapping function. This structure allows module dependencies to determine the branches of the tree-like hierarchy. Lmod allows static and dynamic hierarchical module paths. Lmod provides full support for static paths, which build the hierarchy based on the current set of modules loaded. Alongside static paths, CPE implemented dynamic paths for a subset of the Lmod hierarchy (compilers, networks, CPUs, and MPIs). Dynamic paths give an advanced level of flexibility for detecting multiple dependency paths and allow custom paths to join an existing CPE-designated Lmod hierarchy without modifying customer modulefiles.

Static Lmod hierarchy

Modules dependent on one or more modules being loaded are not visible to a user until their prerequisite modules are loaded. When the prerequisite modules are loaded, it adds the static paths of the dependent modules to the MODULEPATH environment variable, thereby exposing the dependent modules to the user. For more detailed information on Lmod static module hierarchies, please consult User Guide for Lmod.

Dynamic Lmod hierarchy

The CPE custom dynamic Lmod hierarchy abbreviates the overall Lmod hierarchy tree by relying on compatibility and not directly on a prerequisite version. Therefore, dependent modules do not need to exist in a new branch every time their prerequisite modules change versions. Instead, dynamic paths use a compatibility version that increases when a new prerequisite module version breaks compatibility in some way. The number following the module path alias (for example, 1.0 in x86-rome/1.0 and ofi/1.0) identifies the compatible version.

Module path aliases and current compatibility versions

Compatible versions listed in the following tables include the minimum supported versions.

Compiler

SLES Module Alias/Compatible Version

amd

amd/4.0

cce

crayclang/17.0

gcc

gnu/12.0

aocc

aocc/4.1

intel

intel/2023.2

nvidia (x86)

nvidia/20

nvidia (aarch64)

nvidia/23.11

Network

Module Alias/Compatible Version

craype-network-none

none/1.0

craype-network-ofi

ofi/1.0

craype-network-ucx

ucx/1.0

CPU

Module Alias/Compatible Version

craype-x86-milan

x86-milan/1.0

craype-x86-rome

x86-rome/1.0

craype-x86-trento

x86-trento/1.0

MPI

Module Alias/Compatible Version

cray-mpich

cray-mpich/8.0

cray-mpich-abi

cray-mpich/8.0

cray-mpich-abi-pre-intel-5.0

cray-mpich/8.0

cray-mpich-ucx

cray-mpich/8.0

cray-mpich-ucx-abi

cray-mpich/8.0

cray-mpich-ucx-abi-pre-intel-5.0

cray-mpich/8.0

Custom dynamic hierarchy

The CPE custom dynamic hierarchy extension allows custom module paths to join an existing Lmod hierarchy implementation within CPE without modifying customer modulefiles. Custom dynamic module types that CPE supports include:

  • Compiler

  • Network

  • CPU

  • MPI

  • Compiler/Network

  • Compiler/CPU

  • Compiler/Network/CPU/MPI

As each custom dynamic module type loads, a handshake occurs using special pre-defined environment variables. When all hierarchical prerequisites are met, the paths of the dependent modulefiles are added to the MODULEPATH environment variable, thereby exposing the dependent modules to the user.

Tip: For Lmod to assist a user optimally, HPE recommends that a compiler, network, CPU, and MPI module are loaded. Lmod cannot detect modules hidden in dynamic paths without one of each type of module being loaded.

Create a custom dynamic hierarchy

PREREQUISITES

Set Lmod as the default module handling system before initiating this procedure.

OBJECTIVE

For the CPE custom dynamic hierarchy to detect the desired Lmod module path, one or more custom dynamic environment variables must be created according to the requirements defined within this procedure.

PROCEDURE

To create a custom dynamic environment variable:

  1. Begin the environment variable name with LMOD_CUSTOM_.

  2. Append the descriptor of the module type that the environment variable will represent. The module types and descriptors are:

    Module Type

    Descriptor

    Compiler

    COMPILER_

    Network

    NETWORK_

    CPU

    CPU_

    MPI

    MPI_

    Compiler/Network

    COMNET_

    Compiler/CPU

    COMCPU_

    Compiler/Network/CPU/MPI

    CNCM_

    Example: The custom dynamic environment variable for the combined compiler and CPU module begins with LMOD_CUSTOM_COMCPU_.

  3. Following the descriptor, append all prerequisite module aliases along with their respective compatible versions. See Module Path Aliases and Current Compatibility Versions for more information. The format of the module path alias/compatible version string for each module type is shown below. Note that due to publishing issues, long module alias/compatible version strings are split across two lines as indicated below.

    Module Type: Module Path Alias/Compatible Version String

    Compiler: <compiler_name>/<compatible_version>

    Network: <network_name>/<compatible_version>

    CPU: <cpu_name>/<compatible_version>

    MPI:

    <compiler_name>/<compatible_version>/<network_name>/<compatible_version>/

    <mpi_name>/<compatible_version>

    Compiler/Network: <compiler_name>/<compatible_version/<network_name>/<compatible_version>

    Compiler/CPU: <compiler_name>/<compatible_version>/<cpu_name>/<compatible_version>

    Compiler/Network/CPU/MPI:

    <compiler_name>/<compatible_version>/<network_name>/<compatible_version>/

    <cpu_name>/<compatible_version>/<mpi_name>/<compatible_version>

    To create an acceptably formatted environment variable name, replace all slashes and dots in the module alias/compatible version string with underscores. Also, all letters must be in uppercase format.

    Example Module Path Alias/Compatible Version Strings:

  • Compiler = cce

    The path alias/compatible version string (values found in Module Path Aliases and Current Compatibility Versions) is crayclang/10.0; therefore, the text added to the environment variable name is:

    CRAYCLANG_10_0

  • Network = craype-network-ofi

    The path alias/compatible version string is ofi/1.0; therefore, the environment variable text is:

    OFI_1_0

  • CPU = craype-x86-rome

    The path alias/compatible version string is x86-rome/1.0; therefore, the environment variable text is:

    X86_ROME_1_0

  • MPI = cray-mpich

    cray-mpich has two prerequisite module types (compiler and network). Therefore, the environment variable must include the alias/compatible version for the desired compiler, network, and MPI. For a cray-mpich module dependent on cce and craype-network-ofi, the path alias/compatible version string is crayclang/10.0/ofi/1.0/cray_mpich/8.0; therefore, the environment variable text is:

    CRAYCLANG_10_0_OFI_1_0_CRAY_MPICH_8_0.

  • Compiler/Network = cce with craype-network-ofi

    The path alias/compatible version string is crayclang/10.0/ofi/1.0; therefore, the environment variable text is:

    CRAYCLANG_10_0_OFI_1_0

  • Compiler/CPU = cce with craype-x86-rome

    The path alias/compatible version string is crayclang/10.0/x86-rome/1.0; therefore, the environment variable text is:

    CRAYCLANG_10_0_X86_ROME_1_0

  • Compiler/Network/CPU/MPI = cce, craype-network-ofi, craype-x86-rome, and cray-mpich

    The path alias/compatible version string is crayclang/10.0/ofi/1.0/x86-rome/1.0/cray-mpich/8.0; therefore, the environment variable text is:

    CRAYCLANG_10_0_OFI_1_0_X86_ROME_1_0_CRAY_MPICH_8_0

  1. Append _PREFIX following the final module/compatibility text instance:

    Example: Network = craype-network-ofi

    The custom dynamic environment variable is LMOD_CUSTOM_NETWORK_OFI_1_0_PREFIX.

    Creation of the custom dynamic environment variable is now complete.

  2. Add the custom dynamic environment variable to the user environment by exporting it with its value set to the Lmod module path:

    # export LMOD_CUSTOM_NETWORK_OFI_1_0_PREFIX=<lmod_module_path>
    

    Example: Network = craype-network-ofi

    All modulefiles in <lmod_module_path> are shown to users whenever craype-network-ofi is loaded.

Troubleshooting common issues

Check here for various troubleshooting topics, which will be added, as necessary.

Some nodes see errors when CFS configurations are applied while updating CPE, and the logs show pe_overlay.sh failed

This issue typically occurs if a process, or interactive shell, has a lock on a path within CPE overlay-mounted paths. A reboot of the affected nodes should mitigate the issue. Afterwards, a re-run of CFS should work on affected nodes.

Before rebooting:

  1. Make sure no UAIs are still running on that worker node if this an NCN node:

    ncn-m001# cray uas uais list
    
  2. Run the following commands manually on the affected node(s) (for example, a UAI host node, UAN, or compute node). The example below uses uan01.

    uan01# bash /etc/cray-pe.d/pe_overlay.sh cleanup
    uan01# find /var/opt/cray/pe/pe_images -maxdepth 1 -exec umount -f {} \;
    uan01# find /var/opt/cray/pe -maxdepth 1 -exec umount -f {} \;
    uan01# mount | grep pe_image
    

    The mount command should list no mounts; otherwise, lsof pe_overlay_path might help narrow down which process may need to be terminated to free up the path.

If no previous CPE mounts are active, a re-run of CFS on the affected node(s) should be successful.

Replace an installed CPE release squashfs file for redeployment

If an incorrect CPE image is installed for a release (e.g., service pack 2 instead of service pack 3), follow this procedure to delete both the CPE base image in S3 storage and the CPS cache, and then redeploy the correct image. Repeat as necessary for optional packages such as amd, aocc, intel, or nvidia.

  1. Delete the CPE base image in S3 storage and the CPS cache.

    ncn-m001# cray artifacts delete boot-images PE/CPE-base.x86_64-<CPE_RELEASE>.squashfs
    ncn-m001# cray cps contents delete \
    --s3path s3://boot-images/PE/CPE-base.x86_64-<CPE_RELEASE>.squashfs
    
  2. Rerun CPE install.sh from the correct .tar package.

    ncn-m001# cpe-<CPE_RELEASE>-sles15-<spX>/install.sh
    
  3. Finally, either:

    1. Rerun CFS on affected nodes.

      ncn-m001# cray cfs components update --enabled true --state '[]' --error-count 0 <xnode>
      

    Or:

    1. Reboot all or a limited number (using --limit parameter) of nodes.

      ncn-m001# cray bos session create --template-uuid cos-sessionTemplate-x.y.z \
      --operation reboot [--limit xnode]