Programming/Cuda

Programming Details

CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows you to program a CUDA-enabled graphics processing unit (GPU) for general-purpose processing.

The CUDA platform is designed to work with programming languages such as C, C++. Fortran CUDA is possible through the use of PGI-fortran, which is now available.

GPU hardware

Abstraction

GPU hardware consists of a number of key blocks:

Memory (global, constant, shared)
Streaming multiprocessors (SMs)
Streaming processors (SPs)

Specification

Viper has 4 K40m (GPU01-GPU04) and 1 P100 (GPU05)

NVidia K40m

Key features of the Tesla K40 GPU accelerator include:

12GB of ultra-fast GDDR5 memory allows users to process 2X larger datasets, enabling them to rapidly analyze massive volumes of data.
2,880 CUDA® parallel processing cores deliver application acceleration by up to 10X compared to using a CPU alone.
Dynamic Parallelism enables GPU threads to dynamically spawn new threads, enabling users to quickly and easily crunch through adaptive and dynamic data structures.
PCIe Gen-3 interconnect support accelerates data movement by 2X compared to PCIe Gen-2 technology.

NVidia P100

Key features of the P100 GPU accelerator include:

16GB HBM2 Memory with a Type PCI Express 3.0 x16 interface (bandwidth 720 Gbps)
3584 CUDA cores Graphics Engine NVIDIA Tesla P100
Bus Type PCI Express 3.0 x16, API Supported OpenCL, OpenACC

Compute capability

For applications that require this information:

K40m is 3.5
P100 is 6.0

Programming example

#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n)
        y[i] = a*x[i] + y[i];
}

int main(void)
{
  int N = 1<<31;

  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float));
  cudaMalloc(&d_y, N*sizeof(float));

        for (int i = 0; i < N; i++)
        {
                x[i] = 1.0f;
                y[i] = 2.0f;
        }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %fn", maxError);
}

Profiling

Nvidia provide a good profiling tool call nvprof; its invocation is shown below:

$ module load cuda/11.5.0
$ nvprof ./mycudaprogram

This can also be done with Python too

$ nvprof --print-gpu-trace python train_mnist.py

An example of nvprof output is shown below:

$ nvprof python examples/stream/cusolver.py                                                                                                                                              [10/1910]
==27986== NVPROF is profiling process 27986, command: python examples/stream/cusolver.py
==27986== Profiling application: python examples/stream/cusolver.py
==27986== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 41.70%  125.73us         4  31.431us  30.336us  33.312us  void nrm2_kernel<double, double, double, int=0, int=0, int=128, int=0>(cublasNrm2Params<double, double>)
 21.94%  66.144us        36  1.8370us  1.7600us  2.1760us  [CUDA memcpy DtoH]
 13.77%  41.536us        48     865ns     800ns  1.4400us  [CUDA memcpy HtoD]
  3.02%  9.1200us         2  4.5600us  3.8720us  5.2480us  void syhemv_kernel<double, int=64, int=128, int=4, int=5, bool=1, bool=0>(cublasSyhemvParams<double>)
  2.65%  8.0000us         2  4.0000us  3.8720us  4.1280us  void gemv2T_kernel_val<double, double, double, int=128, int=16, int=2, int=2, bool=0>(int, int, double, double const *, int, double const *, i
nt, double, double*, int)
  2.63%  7.9360us         2  3.9680us  3.8720us  4.0640us  cupy_copy
  2.44%  7.3600us         2  3.6800us  3.1680us  4.1920us  void syr2_kernel<double, int=128, int=5, bool=1>(cublasSyher2Params<double>, int, double const *, double)
  2.23%  6.7200us         2  3.3600us  3.2960us  3.4240us  void dot_kernel<double, double, double, int=128, int=0, int=0>(cublasDotParams<double, double>)
  1.88%  5.6640us         2  2.8320us  2.7840us  2.8800us  void reduce_1Block_kernel<double, double, double, int=128, int=7>(double*, int, double*)
  1.74%  5.2480us         2  2.6240us  2.5600us  2.6880us  void ger_kernel<double, double, int=256, int=5, bool=0>(cublasGerParams<double, double>)
  1.57%  4.7360us         2  2.3680us  2.1760us  2.5600us  void axpy_kernel_val<double, double, int=0>(cublasAxpyParamsVal<double, double, double>)
  1.28%  3.8720us         2  1.9360us  1.7920us  2.0800us  void lacpy_kernel<double, int=5, int=3>(int, int, double const *, int, double*, int, int, int)
  1.19%  3.5840us         2  1.7920us  1.6960us  1.8880us  void scal_kernel_val<double, double, int=0>(cublasScalParamsVal<double, double>)
  0.98%  2.9440us         2  1.4720us  1.2160us  1.7280us  void reset_diagonal_real<double, int=8>(int, double*, int)
  0.98%  2.9440us         4     736ns     736ns     736ns  [CUDA memset]

==27986== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 60.34%  408.55ms         9  45.395ms  4.8480us  407.94ms  cudaMalloc
 37.60%  254.60ms         2  127.30ms     556ns  254.60ms  cudaFree
  0.94%  6.3542ms       712  8.9240us     119ns  428.32us  cuDeviceGetAttribute
  0.72%  4.8747ms         8  609.33us  320.37us  885.26us  cuDeviceTotalMem
  0.10%  693.60us        82  8.4580us  2.8370us  72.004us  cudaMemcpyAsync
  0.08%  511.79us         1  511.79us  511.79us  511.79us  cudaHostAlloc
  0.08%  511.75us         8  63.969us  41.317us  99.232us  cuDeviceGetName
  0.05%  310.04us         1  310.04us  310.04us  310.04us  cuModuleLoadData
  0.03%  234.87us        24  9.7860us  5.7190us  50.465us  cudaLaunch
  0.01%  50.874us         2  25.437us  16.898us  33.976us  cuLaunchKernel
  0.01%  49.923us         2  24.961us  15.602us  34.321us  cudaMemcpy
  0.01%  47.622us         4  11.905us  8.6190us  19.889us  cudaMemsetAsync
  0.01%  44.811us         2  22.405us  9.5590us  35.252us  cudaStreamDestroy
  0.01%  35.136us        27  1.3010us     289ns  5.8480us  cudaGetDevice
  0.00%  31.113us        24  1.2960us     972ns  3.2380us  cudaStreamSynchronize
  0.00%  30.736us         2  15.368us  4.4580us  26.278us  cudaStreamCreate
  0.00%  13.932us        17     819ns     414ns  3.7090us  cudaEventCreateWithFlags
  0.00%  13.678us        70     195ns     130ns     801ns  cudaSetupArgument
  0.00%  12.050us         4  3.0120us  2.1290us  4.5130us  cudaFuncGetAttributes
  0.00%  10.407us        22     473ns     268ns  1.9540us  cudaDeviceGetAttribute
  0.00%  10.370us        40     259ns     126ns  1.4100us  cudaGetLastError
  0.00%  9.9680us        16     623ns     185ns  2.9600us  cuDeviceGet

Modules Available

The following modules are available:

module add cuda/6.5.14 (to be retired)
module add cuda/7.5.18 (to be retired)
module add cuda/8.0.61
module add cuda/9.0.176
module add cuda/11.0.3 (to be removed - superseded)
module add cuda/11.5.0

Compilation

The program would be compiled using NVIDIA's own compiler:

[username@login01 ~]$ module add cuda/11.5.0
[username@login01 ~]$ nvcc -o testGPU testGPU.cu

Usage Examples

Batch example


#!/bin/bash

#SBATCH -J gpu-cuda
#SBATCH -N 1
#SBATCH --ntasks-per-node 1
#SBATCH -o %N.%j.%a.out
#SBATCH -e %N.%j.%a.err
#SBATCH -p gpu
#SBATCH --gres=gpu:tesla
#SBATCH --exclusive
#SBATCH --mail-user= your email address here

module add cuda/11.5.0

/home/user/CUDA/testGPU

[username@login01 ~]$ sbatch demoCUDA.job
Submitted batch job 290552

Alternatives to CUDA

OpenACC (part of later gcc compilers)
OpenCL (used for DSP, FPGAs too)
openMP to GPU pragmas ( >version 4, CPU and GPU)
MPI (CPU nodes only)

HPC