HPE Cray PE DL Plugin
Name
craype-dl-plugin - introduces the HPE Cray PE DL Plugin for accelerating distributed deep learning
Description
The HPE Cray PE DL Plugin provides a highly tuned communication layer that can be easily added to any deep learning framework. Starting from a single process version of a deep learning application, users can include the HPE Cray PE DL Plugin through the C or Python APIs. The pro- vided routines include a high performance gradient averaging operation. Other routines facilitate process identification, job size deter- mination, and broadcasting of initial weights and biases. HPE Cray PE DL Plugin 23.09.1 supports TensorFlow version v2.11 and PyTorch v1.12.
More information about the Python API is available from within Python. For example:
% python
>>> import dl_comm as cdl
>>> help(cdl)
>>> help(cdl.gradients)
Basic C API
int dl_comm_init_mpi();
Initializes MPI components of the Plugin.
int dl_comm_init();
Initializes the plugin.
int dl_comm_finalize();
Cleans up the plugin at the end of execution.
int dl_create_team(int teamID, int nthreads_in_team, int prec_level);
Creates a thread team. teamID specifies the base-zero team index to create. nthreads_in_team defines how man threads to use for
team's communication. prev_level determines whether floating point or double precision is used to complete math operations. float-
ing point precision is specified via 0 and double via 1.
int dl_comm_gradients(void *** all_tensors, int ** lengths, dlDataType_t ** dtypes,
int * ntensors_on_tower, int ntowers, int teamID);
Computes the average values for each input buffer across every process. all_tensors is a list of model towers, where each tower
contains its list of gradient tensor data. Results are stored in-place. lengths is each tower's list of tensor lengths. dtypes is
each tower's list of tensor datatypes. ntensors_on_tower specifies the number of tensors stored in each tower's list. ntowers spec-
ifies the number of towers. teamID specifies which thread team to use for communication.
int dl_comm_broadcast(void *** all_tensors, int ** lengths, dlDataType_t ** dtypes,
int * ntensors_on_tower, int ntowers, int root);
Broadcasts the given set of tensors from the root rank to all other ranks. all_tensors is a list of model towers, where each tower
contains its list of tensor data. Results are stored in-place. lengths is each tower's list of tensor lengths. dtypes is each
tower's list of tensor datatypes. ntensors_on_tower specifies the number of tensors stored in each tower's list. ntowers specified
the number of towers. root specifies which process to broadcast from.
int dl_comm_get_rank();
Retrieves the given process's rank.
int dl_comm_localrank();
Retrieves node-specific rank for a process (for example, for a node with eight processes, results are between 0 and 7).
int dl_comm_get_nranks();
Retrieves the total number of processes.
Notes and Usage
The HPE Cray PE DL Plugin can be used with TensorFlow, PyTorch, and Keras client applications without recompiling said frameworks. To use the frame-
work-specific components of the HPE Cray PE DL Plugin, please refer to the specific examples in the $CRAYPE_ML_PLUGIN_BASEDIR/examples
directory.
These components are further built with TensorFlow v2.9 and PyTorch v1.10 and Keras’ TensorFlow backend. If using different versions of
these frameworks, you may need to build your own versions of these components. In that case, a Python pip source distribution is included
in $CRAYPE_ML_PLUGIN_BASEDIR/wheel
.
A README found in $CRAYPE_ML_PLUGIN_BASEDIR/examples
details how to install this source distribution into your Python environment
along with usage rules and troubleshooting tips.
If the compute mode of GPUs is set as “exclusive process” then nvidia-cuda-mps-control
needs to be launched before using the Plugin.
If you are using TensorFlow with the Plugin, the TensorFlow configuration, should be modified such that
config.gpu_options.per_process_gpu_memory_fraction = .7
, where config = tf.ConfigProto()
.
For example usage, refer to the TensorFlow examples included with the Plugin installation.
Examples
The examples directory includes sample Python clients for TensorFlow, Keras, and PyTorch modified to use the plugin. The tf_cnn_benchmarks is a common benchmark code for TensorFlow that includes several CNN models. There are many options for running this benchmark including single and multiple worker setups. With the modified version provided comparisons can be made between the various gRPC based parallel schemes and parallelization with the HPE Cray PE DL Plugin. The benchmark will run on both CPU and GPU versions of TensorFlow including MKL opti- mizations.
To illustrate how to modify a serial training script to enable scalable training with the HPE Cray PE DL Plugin, MNIST training examples are included in the examples directory. For Keras, TensorFlow, and PyTorch examples, refer to keras_mnist, tf_mnist, and torch_mnist examples, respectively. In addition to the required Plugin calls there are other typical modifications included when extending a serial script.
Environment Variables
DL_COMM_DEFAULT_NTHREADS
Default number of threads to create teams with in the case that dl_comm_create_team is not explicitly called. Defaults to 2.
DL_COMM_PIPELINE_CHUNK_KB
Size in KB used to transfer data between the host and GPU. Defaults to 256.
DL_COMM_NUM_CUDA_STREAMS
Integer sets the number of CUDA streams each thread uses for data transfers between the host and GPU. Using more streams can
improve performance. Defaults to 1.
DL_COMM_DEFAULT_PREC_LEVEL
Sets precision used for math operations. 0 is floating point. 1 is double. Defaults to 0.
Additional Information
The HPE Cray PE DL Plugin package includes an examples directory for common usecases.