Cray PE DL Plugin
Name
craype-dl-plugin - introduces the Cray PE DL Plugin for accelerating distributed deep learning
Description
The Cray PE DL Plugin provides a highly tuned communication layer that can be easily added to any deep learning framework. Starting from a single process version of a deep learning application, users can include the Cray PE DL Plugin through the C or Python APIs. The pro- vided routines include a high performance gradient averaging operation. Other routines facilitate process identification, job size deter- mination, and broadcasting of initial weights and biases. CPE DL Plugin 23.09.1 supports TensorFlow version v2.11 and PyTorch v1.12.
More information about the Python API is available from within Python. For example:
% python
>>> import dl_comm as cdl
>>> help(cdl)
>>> help(cdl.gradients)
Basic C API
int dl_comm_init_mpi();
Initialize MPI components of the Plugin
int dl_comm_init();
Initialize the plugin.
int dl_comm_finalize();
Cleanup the plugin at the end of execution.
int dl_create_team(int teamID, int nthreads_in_team, int prec_level);
Create a thread team. teamID specifies the base-zero team index to create. nthreads_in_team defines how man threads to use for
team's communication. prev_level determines whether floating point or double precision is used to complete math operations. float-
ing point precision is specified via 0 and double via 1.
int dl_comm_gradients(void *** all_tensors, int ** lengths, dlDataType_t ** dtypes,
int * ntensors_on_tower, int ntowers, int teamID);
Compute the average values for each input buffer across every process. all_tensors is a list of model towers, where each tower
contains its list of gradient tensor data. Results are stored in-place. lengths is each tower's list of tensor lengths. dtypes is
each tower's list of tensor datatypes. ntensors_on_tower specifies the number of tensors stored in each tower's list. ntowers spec-
ifies the number of towers. teamID specifies which thread team to use for communication.
int dl_comm_broadcast(void *** all_tensors, int ** lengths, dlDataType_t ** dtypes,
int * ntensors_on_tower, int ntowers, int root);
Broadcast the given set of tensors from the root rank to all other ranks. all_tensors is a list of model towers, where each tower
contains its list of tensor data. Results are stored in-place. lengths is each tower's list of tensor lengths. dtypes is each
tower's list of tensor datatypes. ntensors_on_tower specifies the number of tensors stored in each tower's list. ntowers specified
the number of towers. root specifies which process to broadcast from.
int dl_comm_get_rank();
Get the given process's rank
int dl_comm_localrank();
Get process's node-specific rank (eg: for a node with 8 processes, result will be between 0 and 7)
int dl_comm_get_nranks();
Get the total number of processes.
Notes and Usage
The DL Plugin can used with TensorFlow, PyTorch, and Keras client applications without recompiling said frameworks. To use the frame-
work-specific components of the DL Plugin, please refer to the specific examples in the $CRAYPE_ML_PLUGIN_BASEDIR/examples
directory.
These components are further built with TensorFlow v2.9 and PyTorch v1.10 and Keras’ TensorFlow backend. If using different versions of
these frameworks, you may need to build your own versions of these components. In that case, a Python pip source distribution is included
in $CRAYPE_ML_PLUGIN_BASEDIR/wheel
.
A README found in $CRAYPE_ML_PLUGIN_BASEDIR/examples
details how to install this source distribution into a user’s Python environment
along with usage rules and troubleshooting tips.
If the compute mode of GPUs is set as “exclusive process” then nvidia-cuda-mps-control
needs to be launched before using the Plugin.
When using TensorFlow with the Plugin, the TensorFlow configuration, should be modified such that config.gpu_options.per_process_gpu_memory_fraction = .7
, where config = tf.ConfigProto()
. For example usage, refer to the TensorFlow examples included with the Plugin
installation.
Examples
The examples directory includes sample Python clients for TensorFlow, Keras, and PyTorch modified to use the plugin. The tf_cnn_benchmarks is a common benchmark code for TensorFlow that includes several CNN models. There are many options for running this benchmark including single and multiple worker setups. With the modified version provided comparisons can be made between the various gRPC based parallel schemes and parallelization with the CPE DL Plugin. The benchmark will run on both CPU and GPU versions of TensorFlow including MKL opti- mizations.
To illustrate how to modify a serial training script to enable scalable training with the CPE DL Plugin, MNIST training examples are included in the examples directory. For Keras, TensorFlow, and PyTorch examples, refer to keras_mnist, tf_mnist, and torch_mnist examples, respectively. In addition to the required Plugin calls there are other typical modifications included when extending a serial script.
Environment Variables
DL_COMM_DEFAULT_NTHREADS
Default number of threads to create teams with in the case that dl_comm_create_team is not explicitly called. Defaults to 2.
DL_COMM_PIPELINE_CHUNK_KB
Size in KB used to transfer data between the host and GPU. Defaults to 256.
DL_COMM_NUM_CUDA_STREAMS
Integer sets the number of CUDA streams each thread uses for data transfers between the host and GPU. Using more streams can
improve performance. Defaults to 1.
DL_COMM_DEFAULT_PREC_LEVEL
Sets precision used for math operations. 0 is floating point. 1 is double. Defaults to 0.
Additional Information
The Cray DL Plugin package includes an examples directory for common usecases.