Difference between revisions of "Programming/OpenACC"
m |
m (→Next Steps) |
||
(One intermediate revision by the same user not shown) | |||
Line 87: | Line 87: | ||
* [[programming/Fortran|Fortran Programming]] | * [[programming/Fortran|Fortran Programming]] | ||
− | + | {{Librariespagenav}} | |
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Latest revision as of 11:07, 16 November 2022
Contents
Introduction to openACC
OpenACC is a user-driven directive-based performance-portable parallel programming model. It is designed for scientists and engineers interested in porting their codes to a wide variety of heterogeneous HPC hardware platforms and architectures with significantly less programming effort than required with a low-level model. The OpenACC specification supports C, C++, Fortran programming languages, and multiple hardware architectures.
OpenACC programming
The first thing to realize when trying to port a code to a GPU is that they do not share the same memory as the CPU. In other words, a GPU does not have direct access to the host memory. The host memory is generally larger, but slower than the GPU memory. To use a GPU, data must therefore be transferred from the main program to the GPU through the PCI bus, which has a much lower bandwidth than either memory. This means that managing data transfer between the host and the GPU will be of paramount importance. Transferring the data and the code onto the device is called offloading.
OpenACC directives are much like OpenMP directives. They take the form of pragma in C/C++ and comments in Fortran. There are several advantages to using directives. First, since it involves very minor modifications to the code, changes can be done incrementally, one pragma at a time. This is especially useful for debugging purposes since making a single change at a time allows one to quickly identify which change created a bug. Second, OpenACC support can be disabled at compile time. When OpenACC support is disabled, the pragma is considered comments and ignored by the compiler. This means that a single source code can be used to compile both an accelerated version and a normal version. Third, since all of the offloading work is done by the compiler, the same code can be compiled for various accelerator types: GPUs, or CPUs. It also means that a new generation of devices only requires one to update the compiler, not to change the code.
In the following example, we take a code comprised of two loops
Example openACC C/C++ code
#pragma acc kernels { for (int i=0; i<N; i++) { x[i] = 1.0; y[i] = 2.0; } for (int i=0; i<N; i++) { y[i] = a * x[i] + y[i]; } }
Example openACC Fortran code
!$acc kernels do i=1,N x(i) = 1.0 y(i) = 2.0 end do y(:) = a*x(:) + y(:) !$acc end kernels
SAXPY example
(adapted from NVIDIA tutorial)
Here is a really simple example using OpenACC. This loop performs a SAXPY operation. SAXPY stands for Single-precision A times X Plus Y. A is a scalar value and X and Y are vectors, so this is a vector scale and add operation. Here is a simple SAXPY in C with an OpenACC directive to parallelize it.
void saxpy_parallel(int n, float a, float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; }
Here’s the equivalent in Fortran.
subroutine saxpy(x,y,n,a) real :: a, x(:), y(:) integer :: n, i !$ acc kernels do i = 1, n y(i) = a*x(i)+y(i) enddo !$ acc end kernels end subroutine saxpy
The #pragma line in the C program and the !$acc lines in Fortran indicate that these are compiler directives: hints for the compiler. In this case, we are simply suggesting that this is a parallel loop and that the compiler should attempt to generate parallel kernel code for an accelerator (in our case, a GPU). Also notice that we don’t have to do anything else to get this onto the GPU. In contrast to CUDA, we don’t have to allocate or initialize arrays on the device; we don’t have to copy the host (CPU) data to the accelerator (GPU) or copy the accelerator results back to the host after the loop; we don’t have to write a CUDA kernel to execute the body of the loop in parallel, and we don’t have to explicitly launch the kernel on the accelerator. The OpenACC compiler and runtime do all this work for us behind the scenes.