Cray LibSci_ACC

Description
Invoking LibSci-Acc Routines
- Automatic Mode
- Manual Mode
Available Routines
Interfaces
- Fortran Interface
- C/C++ Interface
Memory Management
CPU/GPU Switching
Thread Safety

Description

Cray LibSci_ACC is HPE Cray’s scientific and mathematics library for accelerators. It provides accelerated BLAS (Basic Linear Algebra Subroutines), LAPACK (Linear Algebra PACKage), PBLAS (Parallel Basic Linear Algebra Subprograms), ScaLAPACK (Scalable Linear Algebra PACKage), batched BLAS and LAPACK routines that enhance user application performance by generating and executing auto-tuned GPU kernels on Cray compute nodes.

Invoking LibSci-Acc Routines

We recommends enclosing a user program that will access Cray LibSci_ACC routines between function calls to libsci_acc_init() and libsci_acc_finalize() to ensure optimal performance enhancement.

Fortran API:

subroutine libsci_acc_init()
subroutine libsci_acc_finalize()

C/C++ API:

void libsci_acc_init(void);
void libsci_acc_finalize(void);

There are two ways to invoke Cray LibSci_ACC computational routines.

Automatic Mode

When standard BLAS, LAPACK, PBLAS or ScaLAPACK routines are called from applications built with Cray LibSci_ACC, the library will automatically offload computational tasks to the GPU at runtime if it determines performance will be enhanced by a nontrivial amount. If Cray LibSci_ACC determines that the overhead of moving data to the GPU is greater than the benefit of executing computations on the GPU, the traditional Cray LibSci routine will execute on the host CPU(s).

The automatic interface checks the memory locations of pointers for matrices and vectors and then initiates the appropriate execution mode (CPU, GPU or hybrid). This eliminates the burden of applying API routines manually, especially for data located on the GPU.

Manual Mode

Advanced users may want to explicitly manage the accelerator resources used by their applications. Each accelerated routines (except for batched routines) has three invocation methods:

<routine_name>

This is the automatic mode described above.
<routine_name>_cpu

Forces use of the standard Cray LibSci routine on host processors only, regardless of environment variable settings.
<routine_name>_acc

Forces use of the accelerated routine on the accelerator only, regardless of environment variable settings. BLAS vectors/matrices and LAPACK matrices must be allocated and stored in the GPU memory before calling an _acc routine. For better performance, use pinned memory.

For example, the three variants to compute double precision general matrix-matrix multiplication include dgemm, dgemm_cpu and dgemm_acc.

Available Routines

The following routines are implemented in Cray LibSci_ACC. Note that all BLAS, LAPACK, PBLAS and ScaLAPACK routines have the above three variants.

BLAS Routines

Level 1

sswap dswap cswap zswap
sscal dscal cscal csscal zscal zdscal
scopy dcopy ccopy zcopy
saxpy daxpy caxpy zaxpy
sdot ddot cdotu cdotc zdotu zdotc
snrm2 dnrm2 scnrm2 dznrm2
sasum dasum scasum dzasum
isamax idamax icamax izamax
srot drot csrot zdrot
srotm drotm
srotmg drotmg
srotg drotg

Level 2

sgemv dgemv cgemv zgemv
sgbmv dgbmv cgbmv zgbmv
ssymv dsymv chemv zhemv
ssbmv dsbmv chbmv zhbmv
sspmv dspmv chpmv zhpmv
strmv dtrmv ctrmv ztrmv
stbmv dtbmv ctbmv ztbmv
stpmv dtpmv ctpmv ztpmv
strsv dtrsv ctrsv ztrsv
stbsv dtbsv ctbsv ztbsv
stpsv dtpsv ctpsv ztpsv
sger dger cgerc cgeru zgerc zgeru
ssyr dsyr cher zher
sspr dspr chpr zhpr
ssyr2 dsyr2 cher2 zher2
sspr2 dspr2 chpr2 zhpr2

Level 3

sgemm dgemm cgemm zgemm
ssymm dsymm csymm zsymm chemm zhemm
ssyrk dsyrk csyrk zsyrk cherk zherk
ssyr2k dsyr2k csyr2k zsyr2k cher2k zher2k
strmm dtrmm ctrmm ztrmm
strsm dtrsm ctrsm ztrsm

LAPACK Routines

dgetrf zgetrf
dgetrs zgetrs
dpotrf zpotrf
dpotrs zpotrs
dgeqrf zgeqrf
dgelqf zgelqf
dgebrd zgebrd
dsyevd zheevd
dsyevr zheevr
dsyevx zheevx
dsygvx zhegvx
dsygvd zhegvd
dsyev zheev
dgeev zgeev
dgesdd zgesdd
dgesv zgesv

PBLAS Routines

pdamax pzamax
pdscal pzscal
pdswap pzswap
pdger pzgerc pzgeru
pdgemm pzgemm
pdsymm pzsymm pzhemm
pdsyrk pzsyrk pzherk
pdsyr2k pzsyr2k pzher2k
pdtrmm pztrmm
pdtrsm pztrsm

The environment variable setting MPICH_GPU_SUPPORT_ENABLED=1 is required to enable GPU-resident computation for PBLAS functions.

ScaLAPACK Routines

pdgetrf pzgetrf
pdpotrf pzpotrf

Batched Routines

sgemm_batch_acc dgemm_batch_acc cgemm_batch_acc zgemm_batch_acc
sgemm_vbatch_acc dgemm_vbatch_acc cgemm_vbatch_acc zgemm_vbatch_acc
sgemm_batch_strided dgemm_batch_strided cgemm_batch_strided zgemm_batch_strided
sgemm_batch_strided_acc dgemm_batch_strided_acc cgemm_batch_strided_acc zgemm_batch_strided_acc
strsm_batch_acc dtrsm_batch_acc ctrsm_batch_acc ztrsm_batch_acc

dgetrf_batch_acc zgetrf_batch_acc
dgetri_batch_acc zgetri_batch_acc

Interfaces

Fortran Interface

The Fortran interfaces to the Cray LibSci_ACC API follows the conventional Netlib’s implementations of BLAS and LAPACK, with a few exceptions such as batched routines. Manual mode computational routines require appending _cpu or _acc to their original names.

The following are examples of the Fortran interface prototype declarations:

interface
subroutine zgemm( transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc )
character,       intent(in)    :: transa, transb
integer,         intent(in)    :: m, n, k, lda, ldb, ldc
complex(kind=8), intent(in)    :: alpha, beta, a( lda, * ), b( ldb, * )
complex(kind=8), intent(inout) :: c( ldc, * )
end subroutine
end interface

interface
subroutine dgetrf( m, n, a, lda, ipiv, info )
integer,      intent(in)    :: m, n, lda
real(kind=8), intent(inout) :: a( lda, * )
integer,      intent(out)   :: ipiv( * ), info
end subroutine
end interface

interface
subroutine dgemm_batch_acc( transa, transb, m, n, k, alpha, a, lda, b, ldb, &
  beta, c, ldc, batch ) bind( c, name = "dgemm_batch_acc" )
use iso_c_binding, only: c_ptr
character, intent(in), value :: transa, transb
integer, intent(in), value :: m, n, k, lda, ldb, ldc, batch
real(kind=8), intent(in), value :: alpha, beta
type(c_ptr), value :: a, b, c
end subroutine
end interface

C/C++ Interface

The C/C++ interfaces to the Cray LibSci_ACC API treat integer and real scalar parameters as pass-by-value. Matrices and vectors are pass-by-reference. Complex and double complex data types are represented as an array of two elements with corresponding real and imaginary components. Complex scalar, vector and matrix inputs are pass-by-reference of the real data type component. Manual mode computational routines require appending _cpu or _acc to their original names.

The following are examples of the C interface prototype declarations:

void zgemm(char transa, char transb, int m, int n, int k, double *alpha, double *a, int lda,
           double *b, int ldb, double *beta, double *c, int ldc);

void dgetrf(int m, int n, double *a, int lda, int *ipiv, int *info);

void dgemm_batch_acc(char transa, char transb, int m, int n, int k, double alpha, double **a,
                     int lda, double **b, int ldb, double beta, double **c, int ldc, int batch);

Memory Management

Cray LibSci_ACC provides several native memory management routines so the users can manage host and device memory spaces without calling vendor runtime routines. Currently available routines are

Allocate/free page-locked memory on the host: libsci_acc_HostAlloc and libsci_acc_HostFree
Allocate/free memory on the device: libsci_acc_DeviceAlloc and libsci_acc_DeviceFree
Register an existing host memory range so it can be accessed from device: libsci_acc_HostRegister and libsci_acc_HostUnregister
Memory copy between host and device: libsci_acc_Memcpy

They all return 0 on normal exit and -1 otherwise. We provide both Fortran and C/C++ APIs, the following is an example on libsci_acc_HostAlloc and libsci_acc_HostFree.

Fortran API:

interface
integer function libsci_acc_HostAlloc( ptr, size )
use iso_c_binding, only: c_ptr, c_size_t
type(c_ptr), value :: ptr
integer(c_size_t), value :: size
end function
end interface

interface
integer function libsci_acc_HostFree( ptr )
use iso_c_binding, only: c_ptr
type(c_ptr), value :: ptr
end function
end interface

C/C++ API:

int libsci_acc_HostAlloc(void **ptr, size_t size);
int libsci_acc_HostFree(void *ptr);

CPU/GPU Switching

When dealing with programs that make numerous BLAS calls on data of small sizes, the automatic interfaces in Cray LibSci_ACC may incur some performance degradation due to constant overhead taking a proportionally larger segment of runtime. To avoid this, the manual CPU or GPU interfaces should be used as much as possible; however, some source codes may be unmodifiable or excessively burdensome to modify. To minimize the degradation in these cases, each BLAS function (for all precisions and levels) may be individually specified to bypass the overhead of the automatic interface and go directly to the version indicated by the value of an environment variable.

The relevant environment variables for mode switching follow the pattern: LIBSCI_ACC_BYPASS_<FUNCTION>. For example:

If LIBSCI_ACC_BYPASS_DTBMV=0 (this is the default value), the automatic version of dtbmv is called.
If LIBSCI_ACC_BYPASS_SSWAP=1, the GPU version of sswap is directly called.
If LIBSCI_ACC_BYPASS_DAXPY=2, the CPU version of daxpy is directly called.
If LIBSCI_ACC_BYPASS_ZGEMM=3, the hybrid version of zgemm is directly called.

For convenience, environment variables LIBSCI_ACC_BYPASS_BLAS1, LIBSCI_ACC_BYPASS_BLAS2, and LIBSCI_ACC_BYPASS_BLAS3 are provided and will assign the defined value to all functions for the specified BLAS level. Additionally, bypass environment variables for specific functions can be provided in combination, and the value of the specific function overrides the generic BLAS level value. For example:

If LIBSCI_ACC_BYPASS_BLAS1=2 and LIBSCI_ACC_BYPASS_DDOT=0, all BLAS1 calls, with the exception of ddot, will execute the CPU version. ddot will use the automatic version.

Each function is independent, so any combination of bypass values can be handled by the library; however, the valid execution is limited to the data layout in the running program. To achieve maximum performance with this feature to accelerate a standard CPU application, set BLAS1 and BLAS2 functions to call the CPU directly while leaving BLAS3 functions potentially hybrid.

Caution: Explicitly setting the execution mode for automatic BLAS functions will affect all automatic BLAS calls that resolve to the definitions in Cray LibSci_ACC. This means that all calls to dynamically linked BLAS from the application, third party libraries, or even LAPACK routines not implemented in Cray LibSci_ACC will execute with the specified behavior. However, this typically only raises a concern with forcing automatic routines to execute on the GPU. If the execution mode for automatic BLAS is CPU, hybrid, or default, then most applications will result in expected behavior.

Thread Safety

Cray LibSci_ACC BLAS and LAPACK routines are thread safe. Both manual and automatic modes, can be called concurrently from multiple threads such as OpenMP’s parallel region. Other Cray LibSci_ACC routines, like pblas, scalapack and batched routines are not thread safe.

Note: Even though calling some routines from a OpenMP parallel region is supported, it is not recommended. If there is a need to compute multiple GEMMs, for example, the use of batched routines might be preferable.