From HPC
Jump to: navigation , search

Programming Details

CUDA is a parallel computing platform and application programming interface (API) model created by NVidia. It allows you to program a CUDA-enabled graphics processing unit (GPU) for general purpose processing.

Icon tick.png The CUDA platform is designed to work with programming languages such as C, C++. Fortran CUDA is possible through the use of PGI-fortran, which is now available.

GPU hardware


GPU hardware consists of a number of key blocks:

  • Memory (global, constant, shared)
  • Streaming multiprocessors (SMs)
  • Streaming processors (SPs)


Viper has 4 K40m (GPU01-GPU04) and 1 P100 (GPU05)

NVidia K40m

Key features of the Tesla K40 GPU accelerator include:

  • 12GB of ultra-fast GDDR5 memory allows users to process 2X larger datasets, enabling them to rapidly analyze massive volumes of data.
  • 2,880 CUDA® parallel processing cores deliver application acceleration by up to 10X compared to using a CPU alone.
  • Dynamic Parallelism enables GPU threads to dynamically spawn new threads, enabling users to quickly and easily crunch through adaptive and dynamic data structures.
  • PCIe Gen-3 interconnect support accelerates data movement by 2X compared to PCIe Gen-2 technology.

NVidia P100

Key features of the P100 GPU accelerator include:

  • 16GB HBM2 Memory with a Type PCI Express 3.0 x16 interface (bandwidth 720 GBps)
  • 3584 CUDA cores graphics Engine NVIDIA Tesla P100
  • Bus Type PCI Express 3.0 x16, API Supported OpenCL, OpenACC

Compute capability

For applications that require this information:

  • K40m is 3.5
  • P100 is 6.0

Programming example

#include <stdio.h>

void saxpy(int n, float a, float *x, float *y)
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n)
        y[i] = a*x[i] + y[i];

int main(void)
  int N = 1<<31;

  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float));
  cudaMalloc(&d_y, N*sizeof(float));

        for (int i = 0; i < N; i++)
                x[i] = 1.0f;
                y[i] = 2.0f;

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %fn", maxError);

Modules Available

The following modules are available:

  • module add cuda/6.5.14 (to be retired)
  • module add cuda/7.5.18 (to be retired)
  • module add cuda/8.0.61
  • module add cuda/9.0.176


The program would be compiled using NVIDIA's own compiler:

[username@login01 ~]$ module add cuda/9.0.176
[username@login01 ~]$ nvcc -o testGPU

Usage Examples

Batch example


#SBATCH -J gpu-cuda
#SBATCH --ntasks-per-node 1
#SBATCH -o %N.%j.%a.out
#SBATCH -e %N.%j.%a.err
#SBATCH -p gpu
#SBATCH --gres=gpu:tesla
#SBATCH --exclusive

module add cuda/10.1.168


[username@login01 ~]$ sbatch demoCUDA.job
Submitted batch job 290552

Alternatives to CUDA

  • OpenACC (part of later gcc compilers)
  • OpenCL (used for DSP, FPGAs too)
  • openMP to GPU pragmas ( >version 4, CPU and GPU)
  • MPI (CPU nodes only)

Further Information