Copyright and Version
© Copyright 2022-2024 Hewlett Packard Enterprise Development LP. All third-party marks are the property of their respective owners.
: -LocalBuild
Doc git hash: 898e74b1bcdba046cce65e32fc5aa4391548bc4d
Generated: Thu Aug 29 2024
About the HPE CPE Installation Guide: CSM on HPE Cray Supercomputing EX Systems
The HPE Cray Programming Environment Installation Guide: CSM on HPE Cray Supercomputing EX Systems (S-8003) contains procedures for installing the HPE Cray Programming Environment (CPE) and third-party programming environment components, including TotalView, Forge, and AMD and Intel compilers.
This publication is intended for system administrators who want to:
Install or reinstall of all CPE components,
Install additional licensed components, or
Customize the programming environment files before use.
This guide assumes that users have:
Familiarity with standard Linux and open source tools, including Ansible, YAML, and (optionally) Kubernetes, and
Access to a method of obtaining the most current CPE tar files.
See the HPE Cray Programming Environment User Guide: CSM on HPE Cray Supercomputing EX Systems (S-8005) for a complete list of components and modules installed as part of CPE.
Release information
This publication supports the installation of CPE 24.07 on HPE Cray Supercomputing EX systems with:
HPE Cray System Management (CSM) software version 1.5,
HPE Cray Supercomputing Operating System Software (COS) 24.7 (COS Base 3.1.0/USS 1.1.0), and
SUSE Linux Enterprise Server (SLES) 15 SP5.
COS 23.11 (and later) components comprise:
COS Base
HPE Cray Supercomputing User Services Software (USS)
HPE SUSE Linux Enterprise Server
Variable substitutions
Use the following variable substitutions throughout the included procedures.
<CPE_RELEASE> =
24.07
<CPE_VERSION> =
24.07
<spX> or <SPX> =
SP5
Supporting documentation
References to specific and related HPE Cray Supercomputing EX documentation are found throughout this guide. Direct links to these references (for supported COS/CSM releases defined in CPE Installation Prerequisites) are provided in HPE Cray Supercomputing EX documentation links.
Record of revision
New in the CPE 24.07 publication
Added the Downloading HPE Cray Supercomputing EX software section.
Added the HPE Cray Supercomputing EX documentation links chapter.
Updated the introduction to this chapter.
Updated the Release information section.
Updated the Variable substitutions section.
Updated the Supporting documentation section.
Updated the title of this guide to HPE Cray Programming Environment Installation Guide: CSM on HPE Cray Supercomputing EX Systems.
Updated the Installing or Upgrading CPE with IUF section.
Updated the Objective and Procedure in the Optional third-party product image customization section.
Updated the Procedure in the Configuring CPE using CFS chapter.
Updated the Introduction and Procedure in the Enabling CPE in UAIs chapter.
Updated the Prerequisite section of the Install previously released CPE packages for CSM or HPE Cray Supercomputing EX Systems chapter
Incorporated minor editorial updates.
New in the CPE 24.03 publication
Updated the Release information section.
Updated the Variable substitutions section.
Removed the Installing or upgrading CPE chapter.
Updated the Installation prerequisites section.
Updated the Installing or Upgrading CPE with IUF section.
Updated the Prerequisites section of the Optional third-party product image customization chapter.
Updated the Enabling CPE in UAIs chapter.
Updated the Configuring CPE using CFS chapter.
Updated the Module path aliases and current compatibility versions section.
Updated the table in the HPE Cray Supercomputing EX documentation links chapter.
New in the CPE 23.12 publication
Updated the Release information section.
Updated the Objectives in the Optional third-party product image customization section.
Updated the procedure in the Configuring CPE using CFS section.
Updated the compiler versions in the Module path aliases and current compatibility versions section.
Updated the table in the HPE Cray Supercomputing EX documentation links section.
New in the CPE 23.09 publication
Updated the Release information section.
Updated the Intel oneAPI example of the procedure in the Installing or upgrading CPE section.
Updated the procedure in the Enabling CPE in UAIs section.
Updated the procedure in the Install previously released CPE packages for CSM on HPE Cray Supercomputing EX systems section.
Updated the procedure in the Optional third-party product image customization section.
Removed the:
About the Slurm Install and Upgrade Framework Usage procedure section.
About the PBS install and upgrade framework usage procedure section.
Installing a workload manager chapter.
New in the CPE 23.05 publication
Added the Installing and upgrading CPE using the Install and Upgrade Framework procedure section.
Added new note about nid reassignments for PBS in the Working with PBS configurations and nodes section.
Added the Configuring PBS for systems with Slingshot and HPE 200GB NICs section.
Added the Configuring Slingshot Traffic Classes in PBS section.
Added the Troubleshooting failing Slurm PXC pods section.
Added the Troubleshooting Slurm Database Lost Connection Errors section.
Combined the formerly separate procedures for installing and upgrading PBS into one. The procedures for these tasks are now located in the Installing or upgrading the PBS Professional workload manager section.
Simplified and updated the Slurm configuration procedure in the Update Slurm configuration section. The procedure now requires you to restart
slurmctld
.Moved the Enable PBS to use Low Noise mode section into the Configure PBS during or post installation section.
Updated the procedure in the Installing or upgrading the Slurm workload manager section.
Updated the procedure in the Installing or upgrading PBS Professional workload manager section.
Updated the procedure in the Configuring CPE using CFS section.
Updated the Update Slurm configuration section to indicate the configuration update is also used to reassign compute node hostnames.
Updated the procedure in the Update Slurm Ansible configuration during or post installation section.
Updated the procedure in the Update PBS Ansible configuration during or post installation section.
Updated instructions on how to use
crypkg-gen
to create a modulefile for Intel oneAPI in the Create modulefiles for third-party products section.Updated the Back up Slurm Spool Directory section.
Updated the Restore Slurm Spool Directory from Backup section.
Updated the AMD AOCC Compiler and Intel oneAPI product examples in the Installing or upgrading CPE section.
Updated the HPE Cray Supercomputing EX software documentation links table to support both the COS 2.4.X/CSM 1.3.X and COS 2.5.X/CSM 1.4.X releases, and added links to Ceph latency issues documentation.
Updated the compatible compiler version table in the Module path aliases and current compatibility versions section. Updated the
cce
andgcc
compatible compiler versions, and added theamd
compatible compiler version.
New in the CPE 23.02 (Revision A) publication
Added instructions for adding a master branch in git in the Update Slurm Ansible configuration during or post installation and Update PBS Ansible configuration during or post installation sections.
Updated the Create modulefiles for third-party products section to include a note about how to use the
crypkg-gen
utility to create an Intel modulefile.Moved the Enable PBS to use Low Noise mode section into the Configure PBS during or post installation section.
New in the CPE 23.02 publication
Updated previously optional procedures to mandatory procedures. Mandatory procedures now include:
Updating settings for Slurm installation
Configuring/Updating settings
Configuring CSM software for HSN connectivity during PBS installation/upgrade
Updated the SLUMBLOB_VERSION variable from 1.2.8 to 1.2.9 in the Release information section.
Added a step in the Updating settings for Slurm installation section that details how to check whether LDAP is configured.
Updated the HSN network subnet setting information in the Configuring CSM software for HSN connectivity during Slurm installation or upgrade section.
Added a step in the Updating settings for Slurm installation and Configuring CSM software for HSN connectivity during PBS installation/upgrade sections detailing how to restart all UAI pods.
Updated the
SwitchParameters
option listing and resources listing in the Configure Slurm for systems with Slingshot and HPE 200GB NICs section.Added a step in the Configure Slurm for systems with Slingshot and HPE 200GB NICs section that details how to enable Instant On in Slurm.
Added the Resolving Slurm Pods that are Stuck in a ContainerCreating State section.
Added a step in the Update PBS Ansible configuration during or post installation section that details how to configure PBS to copy output files directly to a file system instead of using
scp
.Added the Configuring PBS for high availability section.
Moved the Enable PBS to use Low Noise mode section under the PBS troubleshooting and administrative tasks section.
Changed the COS 2.3.X/CSM 1.2.0 HTML link to a PDF link for the Getting Started Guide (S-8000) line item that is referenced in the table shown in the HPE Cray Supercomputing EX software documentation links section.
Obsoleted former release sections:
Run the Slurm bringup script
Run the PBS Bringup Script
Run the Upgrade PBS Bringup Script
Recover after UAN Image Corruption during Slurm Installation or Upgrade
Incorporated minor editorial updates.
Publication Title |
Date |
---|---|
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray Supercomputing EX Systems (24.07) S-8003 |
August 2024 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (24.03) S-8003 |
May 2024 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (23.12) S-8003 |
December 2023 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (23.09) S-8003 |
September 2023 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (23.05) S-8003 |
June 2023 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (23.02 Rev A) S-8003 |
March 2023 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (23.02) S-8003 |
February 2023 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.12) S-8003 |
December 2022 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.11) S-8003 |
November 2022 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.10) S-8003 |
October 2022 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.09) S-8003 |
September 2022 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.08 Rev A) S-8003 |
August 2022 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.08) S-8003 |
August 2022 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.06 Rev A) S-8003 |
July 2022 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.06) S-8003 |
June 2022 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.05) S-8003 |
May 2022 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.04) S-8003 |
April 2022 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.03) S-8003 |
March 2022 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (22.02) S-8003 |
February 2022 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (21.12) S-8003 |
December 2021 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (21.11) S-8003 Rev A |
November 2021 |
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX (21.11) S-8003 |
November 2021 |
HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.10) S-8003 |
October 2021 |
HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.09) S-8003 |
September 2021 |
HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.08) S-8003 |
August 2021 |
HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.07) S-8003 Rev A |
July 2021 |
HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.07) S-8003 |
July 2021 |
HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.06) S-8003 |
June 2021 |
HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.05) S-8003 |
May 2021 |
HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.04) S-8003 |
April 2021 |
HPE Cray Programming Environment Installation Guide: CSM 1.4 on HPE Cray EX (21.03) S-8003 |
March 2021 |
HPE Cray Asynchronous Installer Guide (21.03) S-8003 |
March 2021 |
HPE Cray Asynchronous Installer Guide (20.11) S-8003 |
November 2020 |
HPE Cray Asynchronous Installer Guide (20.10) S-8003 |
October 2020 |
HPE Cray Asynchronous Installer Guide (20.09) S-8003 |
September 2020 |
HPE Cray Asynchronous Installer Guide (20.08) S-8003 |
August 2020 |
Cray Asynchronous Installer Guide (20.06) S-8003 |
June 2020 |
Cray Asynchronous Installer Guide (20.05) S-8003 |
May 2020 |
Cray Asynchronous Installer Guide (20.04) S-8003 |
April 2020 |
Cray Asynchronous Installer Guide (20.03) S-8003 |
March 2020 |
Cray Asynchronous Installer Guide (20.02) S-8003 |
February 2020 |
Cray Shasta Asynchronous Installer Guide (20.01) S-8003 |
January 2020 |
Typographic conventions
This style
indicates program code, reserved words, library functions,
command-line prompts, screen output, file/path names, variables, and
other software constructs. \
(backslash) At the end of a command line,
indicates the Linux shell line continuation character (lines joined by a
backslash are parsed as a single line).
Command prompt conventions
Host name and account in command prompts: The host name in a command prompt indicates where the command must be run. The account that must run the command is also indicated in the prompt.
The root or super-user account always has the
#
character at the end of the prompt.Any non-root account is indicated with
account@hostname>
. A nonprivileged account is referred to asuser
.
Node abbreviations: The following list contains abbreviations for nodes used in command prompts.
CN - Compute Nodes
NCN - Non-Compute Nodes
AN - Application Node (special type of NCN)
UAN - User Access Node (special type of AN)
Command prompts: The following list contains command prompts used in this guide.
ncn-m001#
- Run the command as root on the specific NCN-M (NCN that is a Kubernetes master node) with hostnamencn-m001
.ncn-w001#
- Run the command as root on the specific NCN-W (NCN that is a Kubernetes worker node) with hostnamencn-w001
.uan01#
- Run the command on a specific UAN.cn#
- Run the command as root on any CN. Note that a CN has a hostname of the formnid123456
(that is, “nid” and a six-digit, zero padded number).pod#
- Run the command as root within a Kubernetes pod.
Copying and Pasting from a PDF
Using the Copy and Paste functions from a PDF is unreliable. Although copying and pasting a command line typically works, copying and pasting formatted file content (for example, JSON, YAML) typically fails. To ensure that file content is copied and pasted correctly while performing the procedures in this guide:
Copy the content from the PDF.
Paste it to a neutral editing form and add the necessary formatting.
Copy the content from the neutral form and paste it into the console.
Tip: It is always a good idea to double-check copied/pasted commands for correctness, as some commands may not render correctly in the PDF.
Downloading HPE Cray Supercomputing EX software
To download HPE Cray Supercomputing EX software, refer to the HPE Support Center or download it directly from My HPE Software Center. The HPE Support Center contains a wealth of documentation, training videos, knowledge articles, and alerts for HPE Cray Supercomputing EX systems. It provides the most detailed information about a release as well as direct links to product firmware, software, and patches available through My HPE Software Center.
Downloading the software through the HPE Support Center
HPE recommends downloading software through the HPE Support Center because of the many other resources available on the website.
Visit the HPE Cray Supercomputing EX product page on the HPE Support Center.
Search for specific product info, such as the full software name or recipe name and version.
For example, search for “Slingshot 2.1” or “Cray System Software with CSM 24.3.0.”
Find the desired software in the search results and select it to review details.
Select Obtain Software and select Sign in Now when prompted.
If a customer’s Entitlement Order Number (EON) is tied to specific hardware rather than software, the software is available without providing account credentials. Access the software instead by selecting Download Software and skip the next step in this procedure.
Enter account credentials when prompted and accept the HPE License Terms.
To download software, customers must ensure their Entitlement Order Number (EON) is active under My Contracts & Warranties on My HPE Software Center. If customers have trouble with the EON or are not entitled to a product, they must contact their HPE contract administrator or sales representative for assistance.
Choose the needed software and documentation files to download and select curl Copy to access the files.
Just like the software files, the documentation files change with each release. In addition to the official documentation, valuable information for a release is often available in files that include the phrase README in their name. Be sure to select and review these files in detail.
HPE recommends the curl Copy option, which downloads a single text file with curl commands to use on the desired system. You must run the curl commands within 24 hours of downloading them or download new commands if more than 24 hours have passed.
To validate the security of the downloads, you can later compare the files on the desired system against the checksums provided by HPE underneath each selected download.
Save the text file to a central location.
On the system where the software will be downloaded, run a shell script to execute the text file that includes the curl commands.
For example:
ncn-m001# bash -x <TEXT_FILE_PATH>
The
-x
option in this example tracks the download progress of each curl command in the text file.
Downloading the software directly from the My HPE Software Center
Users already familiar with a release can save time by downloading software directly from My HPE Software Center.
Visit My HPE Software Center and select Sign in.
Enter account credentials when prompted and select Software in the left navigation bar.
Search for specific product info, such as the full software name or recipe name and version.
For example, search for “Slingshot 2.1” or “Cray System Software with CSM 24.3.0.”
Find the desired software in the search results and review details by selecting Product Details under the Action column.
Select Go To Downloads Page and accept the HPE License Terms.
To download software, customers must ensure their Entitlement Order Number (EON) is active under My Contracts & Warranties. If customers have trouble with the EON or are not entitled to a product, they must contact their HPE contract administrator or sales representative for assistance.
Choose the needed software and documentation files to download and select curl Copy to access the files.
Just like the software files, the documentation files change with each release. In addition to the official documentation, valuable information for a release is often available in files that include the phrase README in their name. Be sure to select and review these files in detail.
HPE recommends the curl Copy option, which downloads a single text file with curl commands to use on the desired system. You must run the curl commands within 24 hours of downloading them or download new commands if more than 24 hours have passed.
To validate the security of the downloads, you can later compare the files on the desired system against the checksums provided by HPE underneath each selected download
Save the text file to a central location.
On the system where the software will be downloaded, run a shell script to execute the text file that includes the curl commands.
For example:
ncn-m001# bash -x <TEXT_FILE_PATH>
The
-x
option in this example tracks the download progress of each curl command in the text file.
About Ansible
Ansible is an open-source software provisioning and configuration
management tool. The CPE Installer leverages Ansible playbooks and roles
to install CPE components. Below is an example of the pe_deploy.yml
playbook:
---
- hosts: uai:Application_UAN:Application:Compute
any_errors_fatal: true
gather_facts: no
remote_user: root
pre_tasks:
- name: Unmount any overlays first
command: bash /etc/cray-pe.d/pe_overlay.sh cleanup
when:
- not cray_cfs_image
- forcecleanup | default(false)
ignore_errors: yes
roles:
- { role: cray.pe_deploy, cray_pe_pkg: aocc, when: not cray_cfs_image }
- { role: cray.pe_deploy, cray_pe_pkg: intel, when: not cray_cfs_image }
- { role: cray.pe_deploy, when: not cray_cfs_image }
post_tasks:
- name: Run mount overlay setup script
command: bash /etc/cray-pe.d/pe_overlay.sh
when:
- not cray_cfs_image
- not forcecleanup | default(false)
IMPORTANT: You must update Ansible .yml
files when performing
custom installations. These files should be updated with great caution.
The syntax of Ansible files does not support using tabs for editing,
only spaces. See the Ansible Documentation
for more information and details about Ansible syntax.
Installation prerequisites
Before installing HPE Cray Programming Environment on HPE Cray Supercomputing EX systems running HPE Cray System Management (CSM), make sure that your system complies with supported systems. See Release information section of this guide for more details. Also, you must retain:
Root administrator access permissions to properly run the CPE Installer. Ansible needs these permissions to create the directory structure and install various elements of the CPE. Root access is not required to run the CPE; root access is required to only install or upgrade CPE.
Familiarity with:
Linux - To properly run the CPE Installer, an understanding of Linux file system basics is necessary.
Ansible - Knowledge of running Ansible and using Ansible playbooks is required. See Ansible Documentation for more information.
YAML - YAML is a human-readable data-serialization language. Ansible playbooks are stored in
.yml
format. Knowledge of YAML is not necessary to run Ansible playbooks but is useful for image customization.Kubernetes (optional) - If you are installing CPE on containerized User Access Instance (UAI) nodes, an understanding of Kubernetes could be helpful but is not necessary to install or use the nodes.
Installing and upgrading CPE using the Install and Upgrade Framework procedure
The Install and Upgrade Framework (IUF) is a CLI- and API-based process used to install, upgrade, and deploy CPE. The IUF process offers advantages for installing CPE onto compatible HPE Cray Supercomputing EX systems. These benefits include minimized user intervention, reduced time constraints, and a more automated and simplified method for installing CPE. The IUF method can be used with CSM 1.4 or higher.
The instructions in this chapter provides detailed steps and information for using the IUF process to install or upgrade CPE.
Installing or Upgrading CPE with IUF
PREREQUISITES
Be sure to:
Review Installation Prerequisites before proceeding with these installation and upgrade procedures.
Download third-party compilers from their respective websites (for example, AOCC, Intel). CPE does not distribute third-party compilers.
Use CSM 1.4 or higher for installing or upgrading CPE.
OBJECTIVE
This procedure details how to install or upgrade the base HPE Cray Programming Environment on an HPE Cray Supercomputing EX system using IUF. The same instructions are followed whether installing CPE using IUF for the first time or upgrading CPE using IUF on a previously installed system.
PROCEDURE
SSH into the management node:
user@hostname> ssh root@<system>-ncn-m001
Create a directory for the activity media. For example:
ncn-m001# mkdir -p /etc/cray/upgrade/csm/<activity_name>
Copy media to your activity directory:
ncn-m001# cd /etc/cray/upgrade/csm/<activity_name> ncn-m001# cp ../reference_media/cpe-<CPE_RELEASE>-sles15-<spX>-\ csm-<CPE_VERSION>.tar.gz .
Copy in reference bootprep config files:
ncn-m001# cp -a /etc/cray/upgrade/csm/hpc-csm-software-recipe-23.1.18/vcs/* .
If this is a CPE upgrade only, make a copy of
/etc/cray/upgrade/admin/site_vars.yaml
, and update the suffix and note values so that any artifacts created can be easily associated with the CPE upgrade.Example
site_var.yaml
:default: network_type: "cassini" suffix: "cpe-23.5.4.upgrade" note: "bob-"
Run the
iuf
CLI command:ncn-m001# iuf -a cpemedia -m /etc/cray/upgrade/csm/cpemedia run --site-vars \ /etc/cray/upgrade/csm/site_vars.yaml --recipe-vars product_vars.yaml --bootprep-config-managed \ bootprep/compute-and-uan-bootprep.yaml --bootprep-config-management \ bootprep/management-bootprep.yaml -b process-media -e update-vcs-config
The above example uses
cpemedia
as theactivity_name
.See the IUF section of the Cray System Management Documentation for details on
iuf
command line options.Verify that CPE installed successfully. Use
kubectl
to print all CPE versions in the product catalog configmap, and double-check that the latest version is also in the list. Note that the latest CPE version is likely not at the end of the output; scroll up through the output to locate the latest CPE version.ncn-m001# kubectl get cm cray-product-catalog -n services -o json | jq -r .data.cpe ... <CPE_VERSION>: configuration: clone_url: https://vcs.hostname.com/vcs/cray/cpe-config-management.git commit: 341017e953c3c57dd46ddbccec168ca28af9199a import_branch: cray/cpe/<CPE_VERSION> import_date: 2023-01-24 20:10:42.950742 ssh_url: git@host.com:cray/cpe-config-management.git
(Optional) Upload third-party artifact(s) to a Nexus repository.
Change directory to the expanded CPE artifacts that exist in the media folder. In this case, use
/etc/cray/upgrade/csm/cpemedia
. Third-party packages can be copied over to the/etc/cray/upgrade/csm/cpemedia
folder.ncn-m001# cd /etc/cray/upgrade/csm/cpemedia/cpe-<CPE_VERSION>-sles15-sp4
The CPE release tar file contains a script,
install-3p.sh
, that uploads third-party packages to Nexus repositories. New repositories are automatically created if they do not already exist. The script has two modes of operation:Uploading a file:
ncn-m001# install-3p.sh <FILE> <REPO_NAME>
Uploading RPM files, where
<RPM_DIR>
is a directory of RPMs.ncn-m001# install-3p.sh <RPM_DIR> <REPO_NAME>
Uploading RPM files automatically generates RPM repository metadata required for installation using Zypper.
Specific product examples:
AMD AOCC Compiler
ncn-m001# cpe-<CPE_RELEASE>-sles15-<spX>/install-3p.sh \ aocc-compiler-3.2.0.tar aocc-compiler-3.2.0-linux-x86_64-raw
ARM Forge
ncn-m001# cpe-<CPE_RELEASE>-sles15-<spX>/install-3p.sh \ arm-forge-21.1.2-linux-x86_64.tar arm-forge-21.1.2-linux-x86_64-raw
Intel oneAPI
Note that the oneAPI 2022.2.0 release uses the version string “2022.2.0” for RPM versions and installation paths. Therefore, it is the version number needed for installation scripts.
ncn-m001# tar xf intel-oneapi-2022.2.0.tar ncn-m001# cpe-<CPE_RELEASE>-sles15-sp4/install-3p.sh \ intel-oneapi-2022.2.0/ intel-oneapi-2022.2.0
TotalView
ncn-m001# cpe-<CPE_RELEASE>-sles15-<spX>/install-3p.sh \ totalview-2022.1.11-0.x86_64.rpm totalview-2022.1.11-linux-x86_64-yum
Installation of CPE is now complete. If other HPE Cray Supercomputing EX software products are being installed or upgraded in conjunction with CPE, refer to the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM S-8052 to determine which step to execute next; see HPE Cray Supercomputing EX software documentation links for a direct link. Otherwise, continue to the next sections of this document for operations to configure and deploy new CPE images.
Optional third-party product image customization
PREREQUISITES
The CPE package must be installed, and third-party artifacts must be available in a Nexus repository.
OBJECTIVE
Configure third-party compilers AOCC, Forge, and Intel oneAPI into a new CPE image for deployment with the CPE deployment.
HPE provides Ansible customization roles for the AMD AOCC Compiler, Intel oneAPI, and Forge. Some steps in this procedure use the AOCC customization as an example; however, the procedure is similar for the other products.
Product ansible roles:
AMD AOCC Compiler:
cray.pe_aocc_customize
Intel oneAPI:
cray.pe_intel_customize
Forge:
cray.pe_forge_customize
Totalview:
cray.pe_totalview_customize
NVIDIA:
cray.pe_nvidia_customize
For products supporting GPUs (NVIDIA HPC SDK, AMD ROCm), refer to the GPU Support section of HPE Cray Supercomputing User Services Software Administration Guide: CSM on HPE Cray Supercomputing EX Systems for installation instructions; see HPE Cray Supercomputing EX software documentation links for a direct link to the appropriate guide.
PROCEDURE
(IUF Installation) Clone and create the local VCS repository:
ncn-m001# mkdir -p /var/tmp/cpe ncn-m001# cd /var/tmp/cpe ncn-m001# git clone https://crayvcs@api-gw-service-nmn.local/vcs/cray/cpe-config-management.git Enter password: <password obtained from secret credentials> ncn-m001# cd /var/tmp/cpe/cpe-config-management ncn-m001# git checkout integration-<CPE_RELEASE>
(Non-IUF Installation) The CPE
install.sh
script executed earlier cloned a local VCS repository. Change directory into the new path and continue:ncn-m001# cd /var/tmp/cpe/cpe-config-management
Verify that the default variables for the image customization role match the values used earlier for uploading to Nexus:
ncn-m001# vi roles/cray.pe_aocc_customize/defaults/main.yml
For the
cray.pe_intel_customize
role,intel_pkgs
can be modified to install a different set of oneAPI components.(For systems with COS 2.5) Add the
pre_tasks
code to the image customization playbook:ncn-n001# vi pe_aocc_customize.yaml hosts: Application:Compute any_errors_fatal: true remote_user: root pre_tasks: - name: Remove PTF repository zypper_repository: name: "SUSE-SLE-Module-Basesystem-15-SP4-x86_64-PTF" state: absent roles: - { role: ca-cert, when: cray_cfs_image | default(false) } - { role: cray.pe_aocc_customize, when: cray_cfs_image | default(false) }
(Forge Only) Copy the license file to
roles/cray.pe_forge_customization/files/License.dat
or populate the existing emptyLicense.dat
file with the license information.(Totalview only) Copy the
License.dat
ortv_license_file
license file toroles/cray.pe_totalview/files/
, and updateroles/cray.pe_totalview/defaults/main.yml
based on the type of license.Example using the FNP license:
totalview: ... license_path: "/opt/toolworks/FNP_license" license_file: "License.dat"
Example using the FNE license:
totalview: ... license_path: "/opt/toolworks/FNE_license" license_file: "tv_license_file"
Add, commit, and push changes:
ncn-m001# git commit -am "Add customizations to install the AMD AOCC compiler" ncn-m001# git push -u origin cpe-<CPE_RELEASE>-integration
Run the CPE image customization script with a parameter (
aocc
,intel
,nvidia
,forge
,totalview
), specifying which built-in playbook to use:If a previous version of CPE is installed:
Determine the IMS image ID of the
cpe-barebones
image:ncn-m001# kubectl get cm cray-product-catalog -n services -o json | jq -r .data.cpe
Set the environment variable
BASE_IMG_ID
to the IMS image ID:ncn-m001# export BASE_IMG_ID=<IMS_image_ID>
Run the CPE image customization script:
ncn-m001# ./cpe-custom-img.sh aocc
A new image (deployable with the provided CPE image) is created after the CFS session completes.
Record the
result_id
for use when preparing the CPE deployment:ncn-m001# cray cfs sessions describe cpe-aocc-customization \ --format json | jq -r .status.artifacts[].result_id 0e54050a-c43c-4534-ba38-7191838e348d
Repeat the steps above for each third-party product image that needs CPE support customization. Then, continue to the Configuring CPE Using CFS section to prepare the CPE deployment.
Configuring CPE using CFS
PREREQUISITES
Make sure you have:
Installed the CPE package for systems running CSM 1.4.X/COS 2.5.X or later. See Installing or Upgrading CPE with IUF for installation information.
Completed any optional image customization or third-party product installations. See Optional third-party product image customization for more information.
OBJECTIVE
This section provides details for preparing a CPE CFS layer for product integration.
PROCEDURE
Some of the following steps are specific to IUF installations. Steps annotated with Non-IUF Installation are for non-IUF environments. Steps annotated with IUF Installation are specific only to IUF environments.
Clone the CPE configuration management repository, and checkout the integration branch. The CPE
install.sh
oriuf run
command executed earlier created a new local integration branch.ncn-m001# mkdir -p /var/tmp/cpe ncn-m001# cd /var/tmp/cpe ncn-m001# git clone https://crayvcs@api-gw-service-nmn.local/vcs/cray/cpe-config-management.git Enter password: <password obtained from secret credentials> ncn-m001# cd /var/tmp/cpe/cpe-config-management ncn-m001# git checkout integration-<CPE_RELEASE>
Configure images to deploy. The order of roles is important: The first is the top-most layer and also the default image; lower layers and non-default images must follow.
The
cray_pe_pkg
parameter values include:base
: Contains the base CPE content, includingPrgEnv-cray
andPrgEnv-gnu
. This value is the default value ifcray_pe_pkg
is not set.intel
: ContainsPrgEnv-intel
for Intel oneAPI support.aocc
: ContainsPrgEnv-aocc
for AMD Optimizing C/C++ Compiler support.amd
: ContainsPrgEnv-amd
for AMD ROCm support. Deploy this image for AMD GPU-enabled systems.nvidia
: ContainsPrgEnv-nvhpc
for NVIDIA HPC SDK support. Deploy this image for NVIDIA GPU-enabled systems.
To deploy a customized image, set
img_id
to the IMS image ID of the customized image (recorded during [Optional third-party product image customization] (#optional-third-party-product-image-customization)), and give the image a unique name withimg_name
.The following is an example of how to deploy CPE
base
andaocc
images with the current and a previous version of CPE, along with two versions of the AOCC compiler.Example:
To deploy supported third-party CPE images for both x86 and ARM/AArch64 nodes, in addition to hybrid environments:
ncn-m001# vim pe_deploy.yml roles: - { role: cray.pe_deploy, when: not cray_cfs_image } - { role: cray.pe_deploy, cray_pe_version: "21.10", when: not cray_cfs_image } - { role: cray.pe_deploy, cray_pe_pkg: aocc, when: (not cray_cfs_image) and \ (ansible_architecture == 'x86_64') } - { role: cray.pe_deploy, cray_pe_pkg: aocc, cray_pe_version: "21.10", when: \ (not cray_cfs_image) and (ansible_architecture == 'x86_64') } - { role: cray.pe_deploy, img_name: "aocc-compiler-3.1.0", \ img_id: "1f506586-e447-4c2a-b38d-1158cb29e4f8", when: (not cray_cfs_image) and \ (ansible_architecture == 'x86_64') } - { role: cray.pe_deploy, img_name: "aocc-compiler-3.0.0", \ img_id: "0e54050a-c43c-4534-ba38-7191838e348d", when: (not cray_cfs_image) and \ (ansible_architecture == 'x86_64') }
If the
ansible_architecture
variable is undefined, the system automatically determines x86_64 or AArch64 nodes on which to deploy, such as the first two example lines above (base-latest
andbase-21.10
). If the variable is defined, the update applies to the target set of nodes. In the above example,aocc-latest
,aocc-21.10
,aocc-compiler-3.1.0
, andaocc-compiler-3.0.0
are limited to x86_64 nodes since they are not supported on AArch64.You can use:
git diff origin/integration-<prev_release>..
to check differences between latest and previous integration branches.git checkout origin/integration-<prev_release> -- pe_deploy.yml
to pick up previously customized files, as needed.
(Optional) Customize site modules. You can set in
roles/cray.pe_deploy/default/main.yml
custom values to meet site-specific needs for:cray-pe-configuration.csh.j2
cray-pe-configuration.sh.j2
The above modules are in
roles/cray.pe_deploy/templates
.Acceptable variables include:
cray_pe_module_prog
: Defines the default module handling system, either Lmod (Lua) or Environment Modules (TCL).cray_pe_default_prgenv
: Defines the default programming environment.cray_pe_mpaths
: Defines any site-specific paths to be added toMODULEPATH
to make site modules available.cray_pe_init_module_list
: Defines the modules to be loaded on login.cray_pe_site_module_list
: Defines additional site modules to be loaded upon login.cray_pe_prgenv_module_list
: Defines modules to be swapped as part of the PrgEnv module.cray_pe_one_off_set_defaults
: Defines a list of paths toset_default
scripts to be run at deploy time. This variable enables you to set default versions at the component level.
For example, to set Lmod as the default module handling system in the image in
roles/cray.pe_deploy/defaults/main.yml
, setcray_pe_module_prog: lmod
.Commit and push the changed files in git:
ncn-m001# git commit -am "Update CPE packages and image layers" ncn-m001# git push -u origin integration-<CPE_RELEASE>
(IUF installation) IUF creates the CFS configuration layer for CPE, provides a bootprep file that will be used to customize images/personalize nodes, and creates BOS session templates.
Example:
The example specifies
activity_name == cpemedia
and shows other example paths. The administrator sets up IUF-related files in Installing or Upgrading CPE with IUF.Note: On systems with only AArch64 nodes:
Edit the
/etc/cray/upgrade/csm/site_vars.yaml
, and using the latest CPE version, for example, add:cpe-aarch64: version: 23.12.3 working_branch: "{{ working_branch }}"
Edit both the
/etc/cray/upgrade/csm/cpemedia/bootprep/compute-and-uan-bootprep.yaml
file and themanagement-bootprep.yaml
file by updating them from:name: cpe version: "{{cpe.version}}" branch: "{{cpe.working_branch}}"
To:
name: cpe-aarch64 version: "{{cpe_aarch64.version}}" branch: "{{cpe_aarch64.working_branch}}"
ncn-m001# cd /etc/cray/upgrade/csm/cpemedia ncn-m001# iuf -a cpemedia -m /etc/cray/upgrade/csm/cpemedia run --site-vars \ /etc/cray/upgrade/csm/site_vars.yaml --recipe-vars product_vars.yaml \ --bootprep-config-managed bootprep/compute-and-uan-bootprep.yaml \ --bootprep-config-management bootprep/management-bootprep.yaml \ -b update-cfs-config -e prepare-images
(Non-IUF installation) CPE includes an operation automation script that creates a new CFS configuration with the latest CPE version and commit ID. The script has two optional parameters (
CFS_name
andapply
):ncn-m001# cpe-cfs.sh [CFS_name] [apply]
Note, however, that if:
No parameters are specified; the script uses the latest
cpe-yy.mm-integration
branch in CFS.SAT (excluding SAT 2.2.15 or earlier) is installed; the script also outputs a section of yaml code for use in a
sat bootprep
input file for integration with other products. Refer to the SAT bootprep section of the SAT product stream documentation for more information. See HPE Cray Supercomputing EX software documentation links for a direct link.The
CFS_name
parameter is specified; the script proposes a new.json
file that adds or replaces any existing CPE layer.The
apply
parameter is specified; the script modifies the CFS config using the proposed.json
file. HPE recommends a trial run without theapply
parameter to verify the results, then rerun with theapply
parameter to incorporate the changes. For example:ncn-m001# ./cpe-cfs.sh cos-config-2.1.27 [apply] ... ________________________________________ Updating new CPE CFS configuration ... { "lastUpdated": "2021-10-13T21:05:40Z", "layers": [ { "cloneUrl": "https://api-gw-service-nmn.local/vcs/cray/cpe-config-management.git", "commit": "4194bd87979f876400fa9159a60985dacee06a3b", "name": "cpe-21.11-integration", "playbook": "pe_deploy.yml" } ], "name": "cpe-21.11-integration" } ________________________________________ Generating new layers for cos-config-2.0.27 ... Proposed new layers for cos-config-2.0.27: { "layers": } ...
(Non-IUF installation) If
sat bootprep
is not used (for example, for NCN-personalization of UAI hosts), then thecpe-cfs.sh [CFS_name]
parameter must be specified:(HPE recommended) Run the script without the
apply
parameter to verify the results, then rerun it again with theapply
parameter to incorporate the changes. Do not use theapply
parameter ifsat bootprep
is used.Rerun the script with the
apply
parameter for COS, UAN, and NCN personalization CFS configurations as necessary.Update BOS session templates to ensure the latest CPE CFS configs are included on all nodes after reboots. Refer to Configuration Management in the Cray System Management Administration Guide for details; see HPE Cray Supercomputing EX software documentation links for a direct link.
Run
module list
to check if PE is ready to use after CFS completes on a compute or UAN node.Example Output:
Note that module versions below are examples only, and may differ from those currently loaded on the system. For current CPE release product versions, see the release announcement.
nid000001# module list Currently Loaded Modulefiles: 1) craype-x86-rome 5) xpmem/2.6.2-2.5_2.27--gd067c3f.shasta 9) cray-mpich/8.1.28 2) libfabric/1.15.2.0 6) cce/17.0.0 10) cray-libsci/23.12.5 3) perftools-base/23.12.0 8) craype/2.7.30 11) PrgEnv-cray/8.5.0 4) craype-network-ofi 7) cray-dsmml/0.2.2
Enabling CPE in UAIs
PREREQUISITES
HPE CPE must be installed on an HPE Cray Supercomputing EX system running CSM. See Installation Prerequisites for version requirements.
A WLM must be customized into a compute image.
OBJECTIVE
This procedure ensures that UAIs run CPE after CPE is installed.
PROCEDURE
Check out
cpe-config-management
from VCS using git:Acquire the VCS username and password for git operations:
ncn-m001# kubectl get secret -n services vcs-user-credentials \ | --template={{.data.vcs_username}} | base64 -d ncn-m001# kubectl get secret -n services vcs-user-credentials \ | --template={{.data.vcs_password}} | base64 -d
Clone the VCS repository:
git clone https://api-gw-service-nmn.local/vcs/cray/cpe-config-management.git cd cpe-config-management
List available branches:
git branch -r
If CPE was previously installed, clear out any UAS projection paths that CPE previously created by running a script from the earlier CPE branch:
ncn-m001# git checkout integration-<prev_cpe_version> ncn-m001# bash roles/cray.pe_deploy/files/uas_setup_pe.sh clean
To set up a new or changed compute image for CPE, run the
uas_compute_init.sh
script. Where<BOS_session_template>
corresponds to a template that includes COS and WLM. Running the script without this parameter shows a list of available choices.ncn-m001# git checkout integration-<latest_cpe_version> ncn-m001# bash roles/cray.pe_deploy/files/uas_compute_init.sh \ <BOS_session_template>
Do one of the following:
If the
uas_compute_init.sh
script runs to completion and no errors occurred, then add CPE to the management-x.y CFS layer, and run it again on the worker (UAI) nodes. This UAI procedure is now complete, and the remaining steps in this procedure can be skipped.If CPE needs to be added to the management-x.y CFS layer:
Locate the management-x.y.z CFS configuration that was created during the upgrade (for example, management-23.11.1):
cray cfs configurations list |jq -r '.[]| .name' | sort | egrep "^management"
Verify that the CPE layer exists in the CFS configuration:
cray cfs configurations describe management-23.11.1 --format json
If errors exist, continue to the next step, which breaks down the script and references the documentation on which it is based.
To update the UAS projection paths for CPE, run the
uas_setup_pe.sh
script. This script can also be run anytime the CPE paths are reset in the UAS.ncn-m001# bash roles/cray.pe_deploy/files/uas_setup_pe.sh
Check that the uai HSM group exists. The group should contain all worker nodes designated as UAI hosts.
ncn-m001# cray hsm groups describe uai
If the uai group label does not exist, run the helper script:
ncn-m001# /opt/cray/csm/scripts/node_management/make_node_groups -u
See User Access Service in the Cray System Management Administration Guide for further information; see HPE Cray Supercomputing EX software documentation links for a direct link.
Check for a UAI image based on compute nodes:
ncn-m001# cray uas admin config images list [[results]] default = true image_id = "85c7fd74-c410-4920-a452-bd84d27d238e" imagename = "registry.local/cray/cray-uai-compute:latest"
If no similar image name exists, see User Access Service in the Cray System Management Administration Guide for details on creating and registering a custom UAI image. See HPE Cray Supercomputing EX software documentation links for a direct link. Note that the BOS session template-specific name may vary. CPE requires one built for computes (COS) with a workload manager included.
If the compute image is not set as default, set it:
ncn-m001# cray uas admin config images update --default yes \ 5c7fd74-c410-4920-a452-bd84d27d238e
CPE is now enabled for UAIs.
Use the
cpe-cfs.sh
script (described in Configuring CPE Using CFS), or add CPE to the management-x.y CFS layer manually, and run it again on the worker (UAI) nodes. See the Cray System Management Administration Guide for details; see HPE Cray Supercomputing EX software documentation links for a direct link.
After successfully adding CPE to the management CFS layer and running it again on the worker (UAI) nodes, configuration of the HPE Cray Programming Environment is complete. See the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM S-8052 for further installation instructions; see HPE Cray Supercomputing EX software documentation links for a direct link to the appropriate document.
Install previously released CPE packages for CSM on HPE Cray Supercomputing EX systems
PREQUISITES
HPE CPE must be installed on an HPE Cray Supercomputing EX system running CSM; for version requirements, see Installation Prerequisites.
OBJECTIVE
Install a previously released CPE package, <PREV_RELEASE>, after installing the latest CPE.
Previously released CPE packages must use the installer that comes with the latest CPE package. This procedure installs the <PREV_RELEASE> package along with the latest release.
IMPORTANT: Throughout this procedure, replace instances of:
<PREV_RELEASE> with the desired previous release’s
YY.MM
value
PROCEDURE
Download the old CPE tar file, extract it into a path <untar_path>, and then run the following commands.
ncn-m001# SQFS=CPE-base.x86_64-<PREV_RELEASE>.squashfs ncn-m001# cray artifacts create boot-images PE/$SQFS <untar_path>/squashfs/$SQFS
Check out the integration branch from the VCS git repo, and update two files to include the <PREV_RELEASE> package for deployment.
ncn-m001# git checkout integration-<PREV_RELEASE> ncn-m001# vi pe_deploy.yml roles: - { role: cray.pe_deploy, when: not cray_cfs_image } - { role: cray.pe_deploy, cray_pe_version: <PREV_RELEASE>, when: not cray_cfs_image }
Update the CFS configuration layers (compute/COS, management-x.y, and UAN) to point to the new integration branch commit ID.
Create modulefiles for third-party products
PREREQUISITES
Third-party packages must be downloaded and installed.
OBJECTIVE
These instructions use crypkg-gen
to create a modulefile for a
specific version of a supported third-party product. This setup allows a
site to set a specific version as default.
The following tasks are necessary and can be embedded in a script where a third-party product is being installed.
PROCEDURE
Load
craypkg-gen
module.ncn-w001# source /opt/cray/pe/modules/default/init/bash ncn-w001# module use /opt/cray/pe/modulefiles ncn-w001# module load craypkg-gen
Generate module and set default scripts for products. Where:
AMD Optimizing C/C++ Compiler: (requires
craypkg-gen
>= 1.3.16)ncn-w001# craypkg-gen -m /opt/AMD/aocc-compiler-<MODULE_VERSION>/
NVIDIA HPC SDK (requires
craypkg-gen
>= 1.3.16)ncn-w001# craypkg-gen -m /opt/nvidia/hpc_sdk/Linux_x86_64/<MODULE_VERSION>/
Intel oneAPI
The Intel compiler must be installed in a directory or a symbolic link that follows the
<PREFIX>/oneapi/compiler/<VERSION>
format beforecraypkg-gen
can create an Intel modulefile. Thecraypkg-gen
utility creates theintel
,intel-classic
, andintel-oneapi
modulefiles after the process completes successfully.ncn-w001# craypkg-gen -m /opt/intel/oneapi/compilers/<MODULE_VERSION>/
Run a
set default
script.ncn-w001# /opt/admin-pe/set_default_craypkg/set_default_<MODULE_NAME>_<MODULE_VERSION>
Lmod custom dynamic hierarchy
Lmod enables a user to dynamically modify their user environment through Lua modules. The CPE implementation of Lmod capitalizes on its hierarchical structure, including the Lmod module auto-swapping function. This structure allows module dependencies to determine the branches of the tree-like hierarchy. Lmod allows static and dynamic hierarchical module paths. Lmod provides full support for static paths, which build the hierarchy based on the current set of modules loaded. Alongside static paths, CPE implemented dynamic paths for a subset of the Lmod hierarchy (compilers, networks, CPUs, and MPIs). Dynamic paths give an advanced level of flexibility for detecting multiple dependency paths and allow custom paths to join an existing CPE-designated Lmod hierarchy without modifying customer modulefiles.
Static Lmod hierarchy
Modules dependent on one or more modules being loaded are not visible to
a user until their prerequisite modules are loaded. When the
prerequisite modules are loaded, it adds the static paths of the
dependent modules to the MODULEPATH
environment variable, thereby
exposing the dependent modules to the user. For more detailed
information on Lmod static module hierarchies, please consult User
Guide for
Lmod.
Dynamic Lmod hierarchy
The CPE custom dynamic Lmod hierarchy abbreviates the overall Lmod
hierarchy tree by relying on compatibility and not directly on a
prerequisite version. Therefore, dependent modules do not need to exist
in a new branch every time their prerequisite modules change versions.
Instead, dynamic paths use a compatibility version that increases when a
new prerequisite module version breaks compatibility in some way. The
number following the module path alias (for example, 1.0
in
x86-rome/1.0
and ofi/1.0
) identifies the compatible version.
Module path aliases and current compatibility versions
Compatible versions listed in the following tables include the minimum supported versions.
Compiler |
SLES Module Alias/Compatible Version |
---|---|
|
amd/4.0 |
|
crayclang/17.0 |
|
gnu/12.0 |
|
aocc/4.1 |
|
intel/2023.2 |
|
nvidia/20 |
|
nvidia/23.11 |
Network |
Module Alias/Compatible Version |
---|---|
|
none/1.0 |
|
ofi/1.0 |
|
ucx/1.0 |
CPU |
Module Alias/Compatible Version |
---|---|
|
x86-milan/1.0 |
|
x86-rome/1.0 |
|
x86-trento/1.0 |
MPI |
Module Alias/Compatible Version |
---|---|
|
cray-mpich/8.0 |
|
cray-mpich/8.0 |
|
cray-mpich/8.0 |
|
cray-mpich/8.0 |
|
cray-mpich/8.0 |
|
cray-mpich/8.0 |
Custom dynamic hierarchy
The CPE custom dynamic hierarchy extension allows custom module paths to join an existing Lmod hierarchy implementation within CPE without modifying customer modulefiles. Custom dynamic module types that CPE supports include:
Compiler
Network
CPU
MPI
Compiler/Network
Compiler/CPU
Compiler/Network/CPU/MPI
As each custom dynamic module type loads, a handshake occurs using
special pre-defined environment variables. When all hierarchical
prerequisites are met, the paths of the dependent modulefiles are added
to the MODULEPATH
environment variable, thereby exposing the dependent
modules to the user.
Tip: For Lmod to assist a user optimally, HPE recommends that a compiler, network, CPU, and MPI module are loaded. Lmod cannot detect modules hidden in dynamic paths without one of each type of module being loaded.
Create a custom dynamic hierarchy
PREREQUISITES
Set Lmod as the default module handling system before initiating this procedure.
OBJECTIVE
For the CPE custom dynamic hierarchy to detect the desired Lmod module path, one or more custom dynamic environment variables must be created according to the requirements defined within this procedure.
PROCEDURE
To create a custom dynamic environment variable:
Begin the environment variable name with
LMOD_CUSTOM_
.Append the descriptor of the module type that the environment variable will represent. The module types and descriptors are:
Module Type
Descriptor
Compiler
COMPILER_
Network
NETWORK_
CPU
CPU_
MPI
MPI_
Compiler/Network
COMNET_
Compiler/CPU
COMCPU_
Compiler/Network/CPU/MPI
CNCM_
Example: The custom dynamic environment variable for the combined compiler and CPU module begins with
LMOD_CUSTOM_COMCPU_
.Following the descriptor, append all prerequisite module aliases along with their respective compatible versions. See Module Path Aliases and Current Compatibility Versions for more information. The format of the module path alias/compatible version string for each module type is shown below. Note that due to publishing issues, long module alias/compatible version strings are split across two lines as indicated below.
Module Type: Module Path Alias/Compatible Version String
Compiler: <compiler_name>/<compatible_version>
Network: <network_name>/<compatible_version>
CPU: <cpu_name>/<compatible_version>
MPI:
<compiler_name>/<compatible_version>/<network_name>/<compatible_version>/
<mpi_name>/<compatible_version>
Compiler/Network: <compiler_name>/<compatible_version/<network_name>/<compatible_version>
Compiler/CPU: <compiler_name>/<compatible_version>/<cpu_name>/<compatible_version>
Compiler/Network/CPU/MPI:
<compiler_name>/<compatible_version>/<network_name>/<compatible_version>/
<cpu_name>/<compatible_version>/<mpi_name>/<compatible_version>
To create an acceptably formatted environment variable name, replace all slashes and dots in the module alias/compatible version string with underscores. Also, all letters must be in uppercase format.
Example Module Path Alias/Compatible Version Strings:
Compiler =
cce
The path alias/compatible version string (values found in Module Path Aliases and Current Compatibility Versions) is
crayclang/10.0
; therefore, the text added to the environment variable name is:CRAYCLANG_10_0
Network =
craype-network-ofi
The path alias/compatible version string is
ofi/1.0
; therefore, the environment variable text is:OFI_1_0
CPU =
craype-x86-rome
The path alias/compatible version string is
x86-rome/1.0
; therefore, the environment variable text is:X86_ROME_1_0
MPI =
cray-mpich
cray-mpich
has two prerequisite module types (compiler and network). Therefore, the environment variable must include the alias/compatible version for the desired compiler, network, and MPI. For acray-mpich
module dependent oncce
andcraype-network-ofi
, the path alias/compatible version string iscrayclang/10.0/ofi/1.0/cray_mpich/8.0
; therefore, the environment variable text is:CRAYCLANG_10_0_OFI_1_0_CRAY_MPICH_8_0
.Compiler/Network =
cce
withcraype-network-ofi
The path alias/compatible version string is
crayclang/10.0/ofi/1.0
; therefore, the environment variable text is:CRAYCLANG_10_0_OFI_1_0
Compiler/CPU =
cce
withcraype-x86-rome
The path alias/compatible version string is
crayclang/10.0/x86-rome/1.0
; therefore, the environment variable text is:CRAYCLANG_10_0_X86_ROME_1_0
Compiler/Network/CPU/MPI =
cce
,craype-network-ofi
,craype-x86-rome
, andcray-mpich
The path alias/compatible version string is
crayclang/10.0/ofi/1.0/x86-rome/1.0/cray-mpich/8.0
; therefore, the environment variable text is:CRAYCLANG_10_0_OFI_1_0_X86_ROME_1_0_CRAY_MPICH_8_0
Append
_PREFIX
following the final module/compatibility text instance:Example: Network =
craype-network-ofi
The custom dynamic environment variable is
LMOD_CUSTOM_NETWORK_OFI_1_0_PREFIX
.Creation of the custom dynamic environment variable is now complete.
Add the custom dynamic environment variable to the user environment by exporting it with its value set to the Lmod module path:
# export LMOD_CUSTOM_NETWORK_OFI_1_0_PREFIX=<lmod_module_path>
Example: Network =
craype-network-ofi
All modulefiles in
<lmod_module_path>
are shown to users whenevercraype-network-ofi
is loaded.
Troubleshooting common issues
Check here for various troubleshooting topics, which will be added, as necessary.
Some nodes see errors when CFS configurations are applied while updating CPE, and the logs show pe_overlay.sh
failed
This issue typically occurs if a process, or interactive shell, has a lock on a path within CPE overlay-mounted paths. A reboot of the affected nodes should mitigate the issue. Afterwards, a re-run of CFS should work on affected nodes.
Before rebooting:
Make sure no UAIs are still running on that worker node if this an NCN node:
ncn-m001# cray uas uais list
Run the following commands manually on the affected node(s) (for example, a UAI host node, UAN, or compute node). The example below uses
uan01
.uan01# bash /etc/cray-pe.d/pe_overlay.sh cleanup uan01# find /var/opt/cray/pe/pe_images -maxdepth 1 -exec umount -f {} \; uan01# find /var/opt/cray/pe -maxdepth 1 -exec umount -f {} \; uan01# mount | grep pe_image
The
mount
command should list no mounts; otherwise,lsof pe_overlay_path
might help narrow down which process may need to be terminated to free up the path.
If no previous CPE mounts are active, a re-run of CFS on the affected node(s) should be successful.
Replace an installed CPE release squashfs file for redeployment
If an incorrect CPE image is installed for a release (e.g., service pack
2 instead of service pack 3), follow this procedure to delete both the
CPE base image in S3 storage and the CPS cache, and then redeploy the
correct image. Repeat as necessary for optional packages such as amd
,
aocc
, intel
, or nvidia
.
Delete the CPE base image in S3 storage and the CPS cache.
ncn-m001# cray artifacts delete boot-images PE/CPE-base.x86_64-<CPE_RELEASE>.squashfs ncn-m001# cray cps contents delete \ --s3path s3://boot-images/PE/CPE-base.x86_64-<CPE_RELEASE>.squashfs
Rerun CPE
install.sh
from the correct.tar
package.ncn-m001# cpe-<CPE_RELEASE>-sles15-<spX>/install.sh
Finally, either:
Rerun CFS on affected nodes.
ncn-m001# cray cfs components update --enabled true --state '[]' --error-count 0 <xnode>
Or:
Reboot all or a limited number (using
--limit
parameter) of nodes.ncn-m001# cray bos session create --template-uuid cos-sessionTemplate-x.y.z \ --operation reboot [--limit xnode]
HPE Cray Supercomputing EX documentation links
The following table includes release-specific links to related HPE Cray Supercomputing EX documentation referenced within this guide. If the link points to a PDF on the HPE Support Center website, use the reference string to search within the PDF for the applicable content.
Reference |
COS 24.7/CSM 1.5.0 links |
---|---|
Software Stack Installation and Upgrade Guide (S-8052) |
|
GPU Support |
|
SAT |
|
CSM User Access Service |
|
Use SAT to perform compute node operations |
|
Boot orchestration |
|
Boot UAN nodes |
|
CSM Administration Guide |
|
CSM Configuration Management |
|
Ceph Latency Issues |