HPE Cray PE DL Plugin

Name

craype-dl-plugin - introduces the HPE Cray PE DL Plugin for accelerating distributed deep learning

Description

The HPE Cray PE DL Plugin provides a highly tuned communication layer that can be easily added to any deep learning framework. Starting from a single process version of a deep learning application, users can include the HPE Cray PE DL Plugin through the C or Python APIs. The pro- vided routines include a high performance gradient averaging operation. Other routines facilitate process identification, job size deter- mination, and broadcasting of initial weights and biases. HPE Cray PE DL Plugin 23.09.1 supports TensorFlow version v2.11 and PyTorch v1.12.

More information about the Python API is available from within Python. For example:

% python
>>> import dl_comm as cdl
>>> help(cdl)
>>> help(cdl.gradients)

Basic C API

   int dl_comm_init_mpi();

          Initializes MPI components of the Plugin.

   int dl_comm_init();

          Initializes the plugin.

   int dl_comm_finalize();

          Cleans up the plugin at the end of execution.

   int dl_create_team(int teamID, int nthreads_in_team, int prec_level);

          Creates a thread team. teamID specifies the base-zero team index to create. nthreads_in_team defines how man threads to use for
          team's communication. prev_level determines whether floating point or double precision is used to complete math operations. float-
          ing point precision is specified via 0 and double via 1.

   int dl_comm_gradients(void *** all_tensors, int ** lengths, dlDataType_t ** dtypes,
                         int * ntensors_on_tower, int ntowers, int teamID);

          Computes the average values for each input buffer across every process. all_tensors is a list of model towers, where each tower
          contains its list of gradient tensor data. Results are stored in-place. lengths is each tower's list of tensor lengths. dtypes is
          each tower's list of tensor datatypes. ntensors_on_tower specifies the number of tensors stored in each tower's list. ntowers spec-
          ifies the number of towers. teamID specifies which thread team to use for communication.

   int dl_comm_broadcast(void *** all_tensors, int ** lengths, dlDataType_t ** dtypes,
                         int * ntensors_on_tower, int ntowers, int root);

          Broadcasts the given set of tensors from the root rank to all other ranks. all_tensors is a list of model towers, where each tower
          contains its list of tensor data. Results are stored in-place. lengths is each tower's list of tensor lengths. dtypes is each
          tower's list of tensor datatypes. ntensors_on_tower specifies the number of tensors stored in each tower's list. ntowers specified
          the number of towers. root specifies which process to broadcast from.

   int dl_comm_get_rank();

          Retrieves the given process's rank.

   int dl_comm_localrank();

          Retrieves node-specific rank for a process (for example, for a node with eight processes, results are between 0 and 7).

   int dl_comm_get_nranks();

          Retrieves the total number of processes.

Notes and Usage

The HPE Cray PE DL Plugin can be used with TensorFlow, PyTorch, and Keras client applications without recompiling said frameworks. To use the frame- work-specific components of the HPE Cray PE DL Plugin, please refer to the specific examples in the $CRAYPE_ML_PLUGIN_BASEDIR/examples directory.

These components are further built with TensorFlow v2.9 and PyTorch v1.10 and Keras’ TensorFlow backend. If using different versions of these frameworks, you may need to build your own versions of these components. In that case, a Python pip source distribution is included in $CRAYPE_ML_PLUGIN_BASEDIR/wheel.

A README found in $CRAYPE_ML_PLUGIN_BASEDIR/examples details how to install this source distribution into your Python environment along with usage rules and troubleshooting tips.

If the compute mode of GPUs is set as “exclusive process” then nvidia-cuda-mps-control needs to be launched before using the Plugin. If you are using TensorFlow with the Plugin, the TensorFlow configuration, should be modified such that config.gpu_options.per_process_gpu_memory_fraction = .7, where config = tf.ConfigProto(). For example usage, refer to the TensorFlow examples included with the Plugin installation.

Examples

The examples directory includes sample Python clients for TensorFlow, Keras, and PyTorch modified to use the plugin. The tf_cnn_benchmarks is a common benchmark code for TensorFlow that includes several CNN models. There are many options for running this benchmark including single and multiple worker setups. With the modified version provided comparisons can be made between the various gRPC based parallel schemes and parallelization with the HPE Cray PE DL Plugin. The benchmark will run on both CPU and GPU versions of TensorFlow including MKL opti- mizations.

To illustrate how to modify a serial training script to enable scalable training with the HPE Cray PE DL Plugin, MNIST training examples are included in the examples directory. For Keras, TensorFlow, and PyTorch examples, refer to keras_mnist, tf_mnist, and torch_mnist examples, respectively. In addition to the required Plugin calls there are other typical modifications included when extending a serial script.

Environment Variables

   DL_COMM_DEFAULT_NTHREADS

          Default number of threads to create teams with in the case that dl_comm_create_team is not explicitly called. Defaults to 2.

   DL_COMM_PIPELINE_CHUNK_KB

          Size in KB used to transfer data between the host and GPU. Defaults to 256.

   DL_COMM_NUM_CUDA_STREAMS

          Integer sets the number of CUDA streams each thread uses for data transfers between the host and GPU. Using more streams can
          improve performance. Defaults to 1.

   DL_COMM_DEFAULT_PREC_LEVEL

          Sets precision used for math operations. 0 is floating point. 1 is double. Defaults to 0.

Additional Information

The HPE Cray PE DL Plugin package includes an examples directory for common usecases.