Programming/Cuda

Programming Details

CUDA is a parallel computing platform and application programming interface (API) model created by NVidia. It allows you to program a CUDA-enabled graphics processing unit (GPU) for general purpose processing.

The CUDA platform is designed to work with programming languages such as C, C++. Fortran CUDA is possible through the use of PGI-fortran, which is now available.

GPU hardware

Abstraction

GPU hardware consists of a number of key blocks:

Memory (global, constant, shared)
Streaming multiprocessors (SMs)
Streaming processors (SPs)

Specification

Viper has 4 K40m (GPU01-GPU04) and 1 P100 (GPU05)

NVidia K40m

Key features of the Tesla K40 GPU accelerator include:

12GB of ultra-fast GDDR5 memory allows users to process 2X larger datasets, enabling them to rapidly analyze massive volumes of data.
2,880 CUDA® parallel processing cores deliver application acceleration by up to 10X compared to using a CPU alone.
Dynamic Parallelism enables GPU threads to dynamically spawn new threads, enabling users to quickly and easily crunch through adaptive and dynamic data structures.
PCIe Gen-3 interconnect support accelerates data movement by 2X compared to PCIe Gen-2 technology.

NVidia P100

Key features of the P100 GPU accelerator include:

16GB HBM2 Memory with a Type PCI Express 3.0 x16 interface (bandwidth 720 GBps)
3584 CUDA cores graphics Engine NVIDIA Tesla P100
Bus Type PCI Express 3.0 x16, API Supported OpenCL, OpenACC

Compute capability

For applications that require this information:

K40m is 3.5
P100 is 6.0

Programming example

#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n)
        y[i] = a*x[i] + y[i];
}

int main(void)
{
  int N = 1<<31;

  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float));
  cudaMalloc(&d_y, N*sizeof(float));

        for (int i = 0; i < N; i++)
        {
                x[i] = 1.0f;
                y[i] = 2.0f;
        }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %fn", maxError);
}

Modules Available

The following modules are available:

module add cuda/6.5.14 (to be retired)
module add cuda/7.5.18 (to be retired)
module add cuda/8.0.61
module add cuda/9.0.176

Compilation

The program would be compiled using NVIDIA's own compiler:

[username@login01 ~]$ module add cuda/9.0.176
[username@login01 ~]$ nvcc -o testGPU testGPU.cu

Usage Examples

Batch example


#!/bin/bash

#SBATCH -J gpu-cuda
#SBATCH -N 1
#SBATCH --ntasks-per-node 1
#SBATCH -o %N.%j.%a.out
#SBATCH -e %N.%j.%a.err
#SBATCH -p gpu
#SBATCH --gres=gpu:tesla
#SBATCH --exclusive

module add cuda/10.1.168

/home/user/CUDA/testGPU

[username@login01 ~]$ sbatch demoCUDA.job
Submitted batch job 290552

Alternatives to CUDA

OpenACC (part of later gcc compilers)
OpenCL (used for DSP, FPGAs too)
openMP to GPU pragmas ( >version 4, CPU and GPU)
MPI (CPU nodes only)

HPC