blockable

Date:: 09-08-2014

NAME

blockable - specifies that it is legal to cache block the subsequent loops

SYNOPSIS

!DIR$ BLOCKABLE (do_variable,do_variable [,do_variable]... )

#pragma _CRI blockable(num_loops)

IMPLEMENTATION

Cray Linux Environment (CLE)

DESCRIPTION

The BLOCKABLE directive specifies that it is legal and desirable to cache block the subsequent loop nest, even when the compiler has not made such a determination. To be legally blockable, the nest must be perfect (without code between constituent loops), rectangular (trip counts of member loops are fixed over the life time of nest), and fully permutable (loop interchange and unrolling is legal at all levels). This directive both permits and requests blocking of the indicated loop nest.

The Fortran directive arguments are a comma-delimited list of two or more loop control variables, do_variable.

The C directive argument is the number of subsequent loops to be blocked, num_loops.

If a BLOCKINGSIZE directive is also provided for the indicated loop, the following rules apply:

If blockingsize at least two, the indicated blockingsize is used.
If blockingsize is zero, the loop itself is not blocked and its is treated as an inner loop (as part of the nest that traverses the cache block tile).
If blockingsize is one, the loop itself is not blocked and it is treated as an outer loop (as a loop in the nest that moves from tile to tile).When no blockingsize directive is supplied the compiler chooses the blockingsize according to its own heuristics.

EXAMPLES

Example 1: blockable and blockingsize Directives

% cat blk.c

#define N 1000

float A[N][N];
float B[N][N];

void
func(int n)
{
#pragma _CRI blockable(2)
#pragma _CRI blockingsize( 32 )
for (int i = 2; i <= N-1; ++i)  {
#pragma _CRI blockingsize( 128 )
             for (int j = 2; j <= N-1; ++j)  {
                     A[i][j] = B[i-1][j-1]
                             + B[i-1][j+1]
                             + B[i+1][j-1]
                             + B[i+1][j+1];
                     }
             }
}

% cc -c -hlist=md blk.c
% cat blk.lst

...
    7.              func(int n)
    8.              {
    9.              #pragma _CRI blockable(2)
   10.              #pragma _CRI blockingsize( 32 )
   11.  + b-------<     for (int i = 2; i <= N-1; ++i)  {
   12.    b         #pragma _CRI blockingsize( 128 )
   13.    b Vbr4--<  for (int j = 2; j <= N-1; ++j)  {
   14.    b Vbr4         A[i][j] = B[i-1][j-1]
   15.    b Vbr4                 + B[i-1][j+1]
   16.    b Vbr4                 + B[i+1][j-1]
   17.    b Vbr4                 + B[i+1][j+1];
   18.    b Vbr4-->  }
   19.    b------->     }
   20.              }

CC-6294 CC: VECTOR File = blk.c, Line = 11
  A loop was not vectorized because a better candidate was found at line 13.

CC-6051 CC: SCALAR File = blk.c, Line = 11
  A loop was blocked according to user directive with block size 32.

CC-6051 CC: SCALAR File = blk.c, Line = 13
  A loop was blocked according to user directive with block size 128.
...

Example 2: noblocking Directive

Change the value of N in the previous example from 1000 to 999999, and modify func() as shown below. Compile with -hlist=md to see automatic blocking. In this example, blocking will not occur, as it takes at least two loops to cache block.

Blocking sizes 0 and 1 allow loops to “participate” in blocking without being themselves blocked.

func(int n)
{
             for (int i = 2; i <= N-1; ++i)  {
#pragma _CRI noblocking
                     for (int j = 2; j <= N-1; ++j)  {
...

Example 3: blockingsize 0 Directive Followed by its Equivalent

% cat ex0.f90

  subroutine EX0(A, B, n)
    real A(n,n), B(n,n)

!dir$ blockable(i,j)
!dir$ blockingsize(0)
    do j = 1, n-1
!dir$ blockingsize(512)
        do i = 1, n
            A(i,j) = B(i,j) + B(i,j+1)
        enddo
    enddo
  end subroutine EX0

% cat ex0m.f90

subroutine EX0m(A, B, n)
  real A(n,n), B(n,n)

  do is = 1, n, 512
      do j = 1, n-1
          do i = is, min( n, is+511 )
              A(i,j) = B(i,j) + B(i,j+1)
          enddo
      enddo
  enddo
end subroutine EX0m

Notice that the j-loop remains undivided as it traverses the tile, while the i-loop is split into an outer loop (over tiles) and an inner loop (within a tile).

Example 4: blockingsize 1 Directive Followed by its Equivalent

% cat ex1.f90

  subroutine EX1(A, B, n)
    real A(n,n), B(n,n)

!dir$ blockable(i,j)
!dir$ blockingsize(512)
    do j = 1, n
!dir$ blockingsize(1)
        do i = 1, n-1
            A(j,i) = B(j,i) + B(j,i+1)
        enddo
    enddo
  end subroutine EX1

% cat ex1m.f90

subroutine EX1m(A, B, n)
  real A(n,n), B(n,n)

  do js = 1, n, 512
      do i = 1, n-1
          do j = js, min( n, js+511 )
              A(j,i) = B(j,i) + B(j,i+1)
          enddo
      enddo
  enddo
end subroutine EX1m

Notice that blockingsize(1) is applied to an inner loop, while blockingsize(0) typically is used for outer loops.

Example 5: blockingsize >1 at Both Levels, Followed by Equivalent

% cat ex2.f90

  subroutine EX2(A, B, n)
    real A(n,n), B(n,n)

!dir$ blockable(i,j)
!dir$ blockingsize(32)
    do j = 1, n-1
!dir$ blockingsize(128)
        do i = 1, n-1
            A(i,j) = B(i,j) + B(i+1,j) + B(i,j+1)
        enddo
    enddo
  end subroutine EX2

% cat ex2m.f90

subroutine EX2(A, B, n)
  real A(n,n), B(n,n)

  do js = 1, n-1, 32
      do is = 1, n-1, 128
          do j = js, min( n-1, js+31 )
              do i = is, min( n-1, is+127 )
                  A(i,j) = B(i,j) + B(i+1,j) + B(i,j+1)
              enddo
          enddo
      enddo
  enddo
end subroutine EX2