Difference between revisions of "Programming/Cuda"

From HPC
Jump to: navigation , search
m
m
 
(36 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
== Programming Details ==
 
== Programming Details ==
  
CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows you to program a CUDA-enabled graphics processing unit (GPU) for general purpose processing.
+
CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows you to program a CUDA-enabled graphics processing unit (GPU) for general-purpose processing.
  
The CUDA platform is designed to work with programming languages such as C, C++, and Fortran
+
{|
 +
|style="width:5%; border-width: 0;cellpadding=0" | [[File:icon_tick.png]]
 +
|style="width:95%; border-width: 0;cellpadding=0" | The CUDA platform is designed to work with programming languages such as [[programming/C|C]], [[programming/C-Plusplus|C++]].
 +
|-
 +
|}
  
 +
==GPU hardware==
 +
 +
===Abstraction===
 +
 +
GPU hardware consists of a number of key blocks:
 +
 +
*Memory (global, constant, shared)
 +
*Streaming multiprocessors (SMs)
 +
*Streaming processors (SPs)
 +
 +
 +
===Specification===
 +
 +
Viper has 4 A40 (GPU01-GPU04) and 1 P100 (GPU05)
 +
 +
====NVidia A40====
 +
 +
Key features of the Ampere A40 GPU accelerator include:
 +
 +
* 48 GB GDDDR6 with ECC (696 GB/s bandwidth)
 +
* 10,752 CUDA-based cores
 +
* 84 Second-generation RT core
 +
* 336 Third-generation Tensor cores
 +
 +
====NVidia P100====
 +
 +
Key features of the P100 GPU accelerator include:
 +
 +
* 16GB HBM2 Memory with a  Type PCI Express 3.0 x16 interface (bandwidth 720 Gbps)
 +
* 3584 CUDA cores Graphics Engine NVIDIA Tesla P100
 +
* Bus Type PCI Express 3.0 x16, API Supported OpenCL, OpenACC
 +
 +
===Compute capability===
 +
 +
For applications that require this information:
 +
 +
* A40 is 8.5
 +
* P100 is 6.0
  
 
=== Programming example ===
 
=== Programming example ===
  
<pre style="background-color: #C8C8C8; color: black; border: 2px solid green; font-family: monospace, sans-serif;">
+
 
 +
<pre style="background-color: #f5f5dc; color: black; font-family: monospace, sans-serif;">
 
#include <stdio.h>
 
#include <stdio.h>
  
Line 49: Line 92:
 
   printf("Max error: %fn", maxError);
 
   printf("Max error: %fn", maxError);
 
}
 
}
 +
</pre>
  
 +
====Profiling====
  
 +
Nvidia provide a good profiling tool call nvprof; its invocation is shown below:
 +
 +
<pre>
 +
$ module load cuda/11.5.0
 +
$ nvprof ./mycudaprogram
 
</pre>
 
</pre>
 +
 +
This can also be done with Python too
 +
 +
<pre>
 +
$ nvprof --print-gpu-trace python train_mnist.py
 +
</pre>
 +
 +
An example of nvprof output is shown below:
 +
 +
<pre>
 +
$ nvprof python examples/stream/cusolver.py                                                                                                                                              [10/1910]
 +
==27986== NVPROF is profiling process 27986, command: python examples/stream/cusolver.py
 +
==27986== Profiling application: python examples/stream/cusolver.py
 +
==27986== Profiling result:
 +
Time(%)      Time    Calls      Avg      Min      Max  Name
 +
41.70%  125.73us        4  31.431us  30.336us  33.312us  void nrm2_kernel<double, double, double, int=0, int=0, int=128, int=0>(cublasNrm2Params<double, double>)
 +
21.94%  66.144us        36  1.8370us  1.7600us  2.1760us  [CUDA memcpy DtoH]
 +
13.77%  41.536us        48    865ns    800ns  1.4400us  [CUDA memcpy HtoD]
 +
  3.02%  9.1200us        2  4.5600us  3.8720us  5.2480us  void syhemv_kernel<double, int=64, int=128, int=4, int=5, bool=1, bool=0>(cublasSyhemvParams<double>)
 +
  2.65%  8.0000us        2  4.0000us  3.8720us  4.1280us  void gemv2T_kernel_val<double, double, double, int=128, int=16, int=2, int=2, bool=0>(int, int, double, double const *, int, double const *, i
 +
nt, double, double*, int)
 +
  2.63%  7.9360us        2  3.9680us  3.8720us  4.0640us  cupy_copy
 +
  2.44%  7.3600us        2  3.6800us  3.1680us  4.1920us  void syr2_kernel<double, int=128, int=5, bool=1>(cublasSyher2Params<double>, int, double const *, double)
 +
  2.23%  6.7200us        2  3.3600us  3.2960us  3.4240us  void dot_kernel<double, double, double, int=128, int=0, int=0>(cublasDotParams<double, double>)
 +
  1.88%  5.6640us        2  2.8320us  2.7840us  2.8800us  void reduce_1Block_kernel<double, double, double, int=128, int=7>(double*, int, double*)
 +
  1.74%  5.2480us        2  2.6240us  2.5600us  2.6880us  void ger_kernel<double, double, int=256, int=5, bool=0>(cublasGerParams<double, double>)
 +
  1.57%  4.7360us        2  2.3680us  2.1760us  2.5600us  void axpy_kernel_val<double, double, int=0>(cublasAxpyParamsVal<double, double, double>)
 +
  1.28%  3.8720us        2  1.9360us  1.7920us  2.0800us  void lacpy_kernel<double, int=5, int=3>(int, int, double const *, int, double*, int, int, int)
 +
  1.19%  3.5840us        2  1.7920us  1.6960us  1.8880us  void scal_kernel_val<double, double, int=0>(cublasScalParamsVal<double, double>)
 +
  0.98%  2.9440us        2  1.4720us  1.2160us  1.7280us  void reset_diagonal_real<double, int=8>(int, double*, int)
 +
  0.98%  2.9440us        4    736ns    736ns    736ns  [CUDA memset]
 +
 +
==27986== API calls:
 +
Time(%)      Time    Calls      Avg      Min      Max  Name
 +
60.34%  408.55ms        9  45.395ms  4.8480us  407.94ms  cudaMalloc
 +
37.60%  254.60ms        2  127.30ms    556ns  254.60ms  cudaFree
 +
  0.94%  6.3542ms      712  8.9240us    119ns  428.32us  cuDeviceGetAttribute
 +
  0.72%  4.8747ms        8  609.33us  320.37us  885.26us  cuDeviceTotalMem
 +
  0.10%  693.60us        82  8.4580us  2.8370us  72.004us  cudaMemcpyAsync
 +
  0.08%  511.79us        1  511.79us  511.79us  511.79us  cudaHostAlloc
 +
  0.08%  511.75us        8  63.969us  41.317us  99.232us  cuDeviceGetName
 +
  0.05%  310.04us        1  310.04us  310.04us  310.04us  cuModuleLoadData
 +
  0.03%  234.87us        24  9.7860us  5.7190us  50.465us  cudaLaunch
 +
  0.01%  50.874us        2  25.437us  16.898us  33.976us  cuLaunchKernel
 +
  0.01%  49.923us        2  24.961us  15.602us  34.321us  cudaMemcpy
 +
  0.01%  47.622us        4  11.905us  8.6190us  19.889us  cudaMemsetAsync
 +
  0.01%  44.811us        2  22.405us  9.5590us  35.252us  cudaStreamDestroy
 +
  0.01%  35.136us        27  1.3010us    289ns  5.8480us  cudaGetDevice
 +
  0.00%  31.113us        24  1.2960us    972ns  3.2380us  cudaStreamSynchronize
 +
  0.00%  30.736us        2  15.368us  4.4580us  26.278us  cudaStreamCreate
 +
  0.00%  13.932us        17    819ns    414ns  3.7090us  cudaEventCreateWithFlags
 +
  0.00%  13.678us        70    195ns    130ns    801ns  cudaSetupArgument
 +
  0.00%  12.050us        4  3.0120us  2.1290us  4.5130us  cudaFuncGetAttributes
 +
  0.00%  10.407us        22    473ns    268ns  1.9540us  cudaDeviceGetAttribute
 +
  0.00%  10.370us        40    259ns    126ns  1.4100us  cudaGetLastError
 +
  0.00%  9.9680us        16    623ns    185ns  2.9600us  cuDeviceGet
 +
</pre>
 +
 +
 +
 +
  
 
==== Modules Available ====
 
==== Modules Available ====
Line 57: Line 168:
 
The following modules are available:
 
The following modules are available:
  
* module load cuda/6.5.14 (or)
+
* module add cuda/8.0.61
* module load cuda/7.5.18
+
* module add cuda/9.0.176
 
+
* module add cuda/10.1.168
 +
* module add cuda/11.5.0
  
 
==== Compilation ====
 
==== Compilation ====
Line 65: Line 177:
 
The program would be compiled using NVIDIA's own compiler:
 
The program would be compiled using NVIDIA's own compiler:
  
<pre style="background-color: #C8C8C8; color: black; border: 2px solid black; font-family: monospace, sans-serif;">
+
<pre style="background-color: black; color: white; border: 2px solid black; font-family: monospace, sans-serif;">
 
+
[username@login01 ~]$ module add cuda/11.5.0
module load cuda/7.5.18
+
[username@login01 ~]$ nvcc -o testGPU testGPU.cu
nvcc -o testGPU testGPU.cu
 
 
 
 
</pre>
 
</pre>
  
 +
== Usage Examples ==
  
== Usage Examples ==
+
=== Batch example ===
  
= Batch example =
 
  
<pre style="background-color: #C8C8C8; color: black; border: 2px solid blue; font-family: monospace, sans-serif;">
+
<pre style="background-color: #C8C8C8; color: black; border: 2px solid #C8C8C8; font-family: monospace, sans-serif;">
  
 
#!/bin/bash
 
#!/bin/bash
Line 84: Line 194:
 
#SBATCH -N 1
 
#SBATCH -N 1
 
#SBATCH --ntasks-per-node 1
 
#SBATCH --ntasks-per-node 1
#SBATCH -D /home/user/CUDA
 
 
#SBATCH -o %N.%j.%a.out
 
#SBATCH -o %N.%j.%a.out
 
#SBATCH -e %N.%j.%a.err
 
#SBATCH -e %N.%j.%a.err
 
#SBATCH -p gpu
 
#SBATCH -p gpu
#SBATCH --gres=gpu:tesla
+
#SBATCH --gres=gpu
 
#SBATCH --exclusive
 
#SBATCH --exclusive
 +
#SBATCH --mail-user= your email address here
  
module load cuda/7.5.18
+
module add cuda/11.5.0
  
 
/home/user/CUDA/testGPU
 
/home/user/CUDA/testGPU
 
 
</pre>
 
</pre>
  
  
<pre style="background-color: #C8C8C8; color: black; border: 2px solid black; font-family: monospace, sans-serif;">
+
<pre style="background-color: black; color: white; border: 2px solid black; font-family: monospace, sans-serif;">
 
[username@login01 ~]$ sbatch demoCUDA.job
 
[username@login01 ~]$ sbatch demoCUDA.job
 
Submitted batch job 290552
 
Submitted batch job 290552
 
</pre>
 
</pre>
  
[[Category:Programming]]
+
==Alternatives to CUDA==
 +
 
 +
*OpenACC (part of later gcc compilers)
 +
*OpenCL (used for DSP, FPGAs too)
 +
* openMP to GPU pragmas ( >version 4, CPU and GPU)
 +
*MPI (CPU nodes only)
 +
 
 +
== Next Steps ==
 +
 
 +
* [http://www.nvidia.com/object/cuda_home_new.html http://www.nvidia.com/object/cuda_home_new.html]
 +
* [https://www.youtube.com/watch?v=_41LCMFpsFs&t=6s 'Youtube - CUDA training video']
 +
 
 +
{{Librariespagenav}}

Latest revision as of 14:03, 5 January 2023

Programming Details

CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows you to program a CUDA-enabled graphics processing unit (GPU) for general-purpose processing.

Icon tick.png The CUDA platform is designed to work with programming languages such as C, C++.

GPU hardware

Abstraction

GPU hardware consists of a number of key blocks:

  • Memory (global, constant, shared)
  • Streaming multiprocessors (SMs)
  • Streaming processors (SPs)


Specification

Viper has 4 A40 (GPU01-GPU04) and 1 P100 (GPU05)

NVidia A40

Key features of the Ampere A40 GPU accelerator include:

  • 48 GB GDDDR6 with ECC (696 GB/s bandwidth)
  • 10,752 CUDA-based cores
  • 84 Second-generation RT core
  • 336 Third-generation Tensor cores

NVidia P100

Key features of the P100 GPU accelerator include:

  • 16GB HBM2 Memory with a Type PCI Express 3.0 x16 interface (bandwidth 720 Gbps)
  • 3584 CUDA cores Graphics Engine NVIDIA Tesla P100
  • Bus Type PCI Express 3.0 x16, API Supported OpenCL, OpenACC

Compute capability

For applications that require this information:

  • A40 is 8.5
  • P100 is 6.0

Programming example

#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n)
        y[i] = a*x[i] + y[i];
}

int main(void)
{
  int N = 1<<31;

  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float));
  cudaMalloc(&d_y, N*sizeof(float));

        for (int i = 0; i < N; i++)
        {
                x[i] = 1.0f;
                y[i] = 2.0f;
        }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %fn", maxError);
}

Profiling

Nvidia provide a good profiling tool call nvprof; its invocation is shown below:

$ module load cuda/11.5.0
$ nvprof ./mycudaprogram

This can also be done with Python too

$ nvprof --print-gpu-trace python train_mnist.py

An example of nvprof output is shown below:

$ nvprof python examples/stream/cusolver.py                                                                                                                                              [10/1910]
==27986== NVPROF is profiling process 27986, command: python examples/stream/cusolver.py
==27986== Profiling application: python examples/stream/cusolver.py
==27986== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 41.70%  125.73us         4  31.431us  30.336us  33.312us  void nrm2_kernel<double, double, double, int=0, int=0, int=128, int=0>(cublasNrm2Params<double, double>)
 21.94%  66.144us        36  1.8370us  1.7600us  2.1760us  [CUDA memcpy DtoH]
 13.77%  41.536us        48     865ns     800ns  1.4400us  [CUDA memcpy HtoD]
  3.02%  9.1200us         2  4.5600us  3.8720us  5.2480us  void syhemv_kernel<double, int=64, int=128, int=4, int=5, bool=1, bool=0>(cublasSyhemvParams<double>)
  2.65%  8.0000us         2  4.0000us  3.8720us  4.1280us  void gemv2T_kernel_val<double, double, double, int=128, int=16, int=2, int=2, bool=0>(int, int, double, double const *, int, double const *, i
nt, double, double*, int)
  2.63%  7.9360us         2  3.9680us  3.8720us  4.0640us  cupy_copy
  2.44%  7.3600us         2  3.6800us  3.1680us  4.1920us  void syr2_kernel<double, int=128, int=5, bool=1>(cublasSyher2Params<double>, int, double const *, double)
  2.23%  6.7200us         2  3.3600us  3.2960us  3.4240us  void dot_kernel<double, double, double, int=128, int=0, int=0>(cublasDotParams<double, double>)
  1.88%  5.6640us         2  2.8320us  2.7840us  2.8800us  void reduce_1Block_kernel<double, double, double, int=128, int=7>(double*, int, double*)
  1.74%  5.2480us         2  2.6240us  2.5600us  2.6880us  void ger_kernel<double, double, int=256, int=5, bool=0>(cublasGerParams<double, double>)
  1.57%  4.7360us         2  2.3680us  2.1760us  2.5600us  void axpy_kernel_val<double, double, int=0>(cublasAxpyParamsVal<double, double, double>)
  1.28%  3.8720us         2  1.9360us  1.7920us  2.0800us  void lacpy_kernel<double, int=5, int=3>(int, int, double const *, int, double*, int, int, int)
  1.19%  3.5840us         2  1.7920us  1.6960us  1.8880us  void scal_kernel_val<double, double, int=0>(cublasScalParamsVal<double, double>)
  0.98%  2.9440us         2  1.4720us  1.2160us  1.7280us  void reset_diagonal_real<double, int=8>(int, double*, int)
  0.98%  2.9440us         4     736ns     736ns     736ns  [CUDA memset]

==27986== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 60.34%  408.55ms         9  45.395ms  4.8480us  407.94ms  cudaMalloc
 37.60%  254.60ms         2  127.30ms     556ns  254.60ms  cudaFree
  0.94%  6.3542ms       712  8.9240us     119ns  428.32us  cuDeviceGetAttribute
  0.72%  4.8747ms         8  609.33us  320.37us  885.26us  cuDeviceTotalMem
  0.10%  693.60us        82  8.4580us  2.8370us  72.004us  cudaMemcpyAsync
  0.08%  511.79us         1  511.79us  511.79us  511.79us  cudaHostAlloc
  0.08%  511.75us         8  63.969us  41.317us  99.232us  cuDeviceGetName
  0.05%  310.04us         1  310.04us  310.04us  310.04us  cuModuleLoadData
  0.03%  234.87us        24  9.7860us  5.7190us  50.465us  cudaLaunch
  0.01%  50.874us         2  25.437us  16.898us  33.976us  cuLaunchKernel
  0.01%  49.923us         2  24.961us  15.602us  34.321us  cudaMemcpy
  0.01%  47.622us         4  11.905us  8.6190us  19.889us  cudaMemsetAsync
  0.01%  44.811us         2  22.405us  9.5590us  35.252us  cudaStreamDestroy
  0.01%  35.136us        27  1.3010us     289ns  5.8480us  cudaGetDevice
  0.00%  31.113us        24  1.2960us     972ns  3.2380us  cudaStreamSynchronize
  0.00%  30.736us         2  15.368us  4.4580us  26.278us  cudaStreamCreate
  0.00%  13.932us        17     819ns     414ns  3.7090us  cudaEventCreateWithFlags
  0.00%  13.678us        70     195ns     130ns     801ns  cudaSetupArgument
  0.00%  12.050us         4  3.0120us  2.1290us  4.5130us  cudaFuncGetAttributes
  0.00%  10.407us        22     473ns     268ns  1.9540us  cudaDeviceGetAttribute
  0.00%  10.370us        40     259ns     126ns  1.4100us  cudaGetLastError
  0.00%  9.9680us        16     623ns     185ns  2.9600us  cuDeviceGet



Modules Available

The following modules are available:

  • module add cuda/8.0.61
  • module add cuda/9.0.176
  • module add cuda/10.1.168
  • module add cuda/11.5.0

Compilation

The program would be compiled using NVIDIA's own compiler:

[username@login01 ~]$ module add cuda/11.5.0
[username@login01 ~]$ nvcc -o testGPU testGPU.cu

Usage Examples

Batch example


#!/bin/bash

#SBATCH -J gpu-cuda
#SBATCH -N 1
#SBATCH --ntasks-per-node 1
#SBATCH -o %N.%j.%a.out
#SBATCH -e %N.%j.%a.err
#SBATCH -p gpu
#SBATCH --gres=gpu
#SBATCH --exclusive
#SBATCH --mail-user= your email address here

module add cuda/11.5.0

/home/user/CUDA/testGPU


[username@login01 ~]$ sbatch demoCUDA.job
Submitted batch job 290552

Alternatives to CUDA

  • OpenACC (part of later gcc compilers)
  • OpenCL (used for DSP, FPGAs too)
  • openMP to GPU pragmas ( >version 4, CPU and GPU)
  • MPI (CPU nodes only)

Next Steps





Libraries | Main Page | Further Topics