Difference between revisions of "Programming/Cuda"
From HPC
m |
m |
||
(19 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== Programming Details == | == Programming Details == | ||
− | CUDA is a parallel computing platform and application programming interface (API) model created by | + | CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows you to program a CUDA-enabled graphics processing unit (GPU) for general-purpose processing. |
{| | {| | ||
|style="width:5%; border-width: 0;cellpadding=0" | [[File:icon_tick.png]] | |style="width:5%; border-width: 0;cellpadding=0" | [[File:icon_tick.png]] | ||
− | |style="width:95%; border-width: 0;cellpadding=0" | The CUDA platform is designed to work with programming languages such as [[programming/C|C]], [[programming/C-Plusplus|C++]] | + | |style="width:95%; border-width: 0;cellpadding=0" | The CUDA platform is designed to work with programming languages such as [[programming/C|C]], [[programming/C-Plusplus|C++]]. |
|- | |- | ||
|} | |} | ||
+ | |||
+ | ==GPU hardware== | ||
+ | |||
+ | ===Abstraction=== | ||
+ | |||
+ | GPU hardware consists of a number of key blocks: | ||
+ | |||
+ | *Memory (global, constant, shared) | ||
+ | *Streaming multiprocessors (SMs) | ||
+ | *Streaming processors (SPs) | ||
+ | |||
+ | |||
+ | ===Specification=== | ||
+ | |||
+ | Viper has 4 A40 (GPU01-GPU04) and 1 P100 (GPU05) | ||
+ | |||
+ | ====NVidia A40==== | ||
+ | |||
+ | Key features of the Ampere A40 GPU accelerator include: | ||
+ | |||
+ | * 48 GB GDDDR6 with ECC (696 GB/s bandwidth) | ||
+ | * 10,752 CUDA-based cores | ||
+ | * 84 Second-generation RT core | ||
+ | * 336 Third-generation Tensor cores | ||
+ | |||
+ | ====NVidia P100==== | ||
+ | |||
+ | Key features of the P100 GPU accelerator include: | ||
+ | |||
+ | * 16GB HBM2 Memory with a Type PCI Express 3.0 x16 interface (bandwidth 720 Gbps) | ||
+ | * 3584 CUDA cores Graphics Engine NVIDIA Tesla P100 | ||
+ | * Bus Type PCI Express 3.0 x16, API Supported OpenCL, OpenACC | ||
+ | |||
+ | ===Compute capability=== | ||
+ | |||
+ | For applications that require this information: | ||
+ | |||
+ | * A40 is 8.5 | ||
+ | * P100 is 6.0 | ||
+ | |||
=== Programming example === | === Programming example === | ||
Line 53: | Line 93: | ||
} | } | ||
</pre> | </pre> | ||
+ | |||
+ | ====Profiling==== | ||
+ | |||
+ | Nvidia provide a good profiling tool call nvprof; its invocation is shown below: | ||
+ | |||
+ | <pre> | ||
+ | $ module load cuda/11.5.0 | ||
+ | $ nvprof ./mycudaprogram | ||
+ | </pre> | ||
+ | |||
+ | This can also be done with Python too | ||
+ | |||
+ | <pre> | ||
+ | $ nvprof --print-gpu-trace python train_mnist.py | ||
+ | </pre> | ||
+ | |||
+ | An example of nvprof output is shown below: | ||
+ | |||
+ | <pre> | ||
+ | $ nvprof python examples/stream/cusolver.py [10/1910] | ||
+ | ==27986== NVPROF is profiling process 27986, command: python examples/stream/cusolver.py | ||
+ | ==27986== Profiling application: python examples/stream/cusolver.py | ||
+ | ==27986== Profiling result: | ||
+ | Time(%) Time Calls Avg Min Max Name | ||
+ | 41.70% 125.73us 4 31.431us 30.336us 33.312us void nrm2_kernel<double, double, double, int=0, int=0, int=128, int=0>(cublasNrm2Params<double, double>) | ||
+ | 21.94% 66.144us 36 1.8370us 1.7600us 2.1760us [CUDA memcpy DtoH] | ||
+ | 13.77% 41.536us 48 865ns 800ns 1.4400us [CUDA memcpy HtoD] | ||
+ | 3.02% 9.1200us 2 4.5600us 3.8720us 5.2480us void syhemv_kernel<double, int=64, int=128, int=4, int=5, bool=1, bool=0>(cublasSyhemvParams<double>) | ||
+ | 2.65% 8.0000us 2 4.0000us 3.8720us 4.1280us void gemv2T_kernel_val<double, double, double, int=128, int=16, int=2, int=2, bool=0>(int, int, double, double const *, int, double const *, i | ||
+ | nt, double, double*, int) | ||
+ | 2.63% 7.9360us 2 3.9680us 3.8720us 4.0640us cupy_copy | ||
+ | 2.44% 7.3600us 2 3.6800us 3.1680us 4.1920us void syr2_kernel<double, int=128, int=5, bool=1>(cublasSyher2Params<double>, int, double const *, double) | ||
+ | 2.23% 6.7200us 2 3.3600us 3.2960us 3.4240us void dot_kernel<double, double, double, int=128, int=0, int=0>(cublasDotParams<double, double>) | ||
+ | 1.88% 5.6640us 2 2.8320us 2.7840us 2.8800us void reduce_1Block_kernel<double, double, double, int=128, int=7>(double*, int, double*) | ||
+ | 1.74% 5.2480us 2 2.6240us 2.5600us 2.6880us void ger_kernel<double, double, int=256, int=5, bool=0>(cublasGerParams<double, double>) | ||
+ | 1.57% 4.7360us 2 2.3680us 2.1760us 2.5600us void axpy_kernel_val<double, double, int=0>(cublasAxpyParamsVal<double, double, double>) | ||
+ | 1.28% 3.8720us 2 1.9360us 1.7920us 2.0800us void lacpy_kernel<double, int=5, int=3>(int, int, double const *, int, double*, int, int, int) | ||
+ | 1.19% 3.5840us 2 1.7920us 1.6960us 1.8880us void scal_kernel_val<double, double, int=0>(cublasScalParamsVal<double, double>) | ||
+ | 0.98% 2.9440us 2 1.4720us 1.2160us 1.7280us void reset_diagonal_real<double, int=8>(int, double*, int) | ||
+ | 0.98% 2.9440us 4 736ns 736ns 736ns [CUDA memset] | ||
+ | |||
+ | ==27986== API calls: | ||
+ | Time(%) Time Calls Avg Min Max Name | ||
+ | 60.34% 408.55ms 9 45.395ms 4.8480us 407.94ms cudaMalloc | ||
+ | 37.60% 254.60ms 2 127.30ms 556ns 254.60ms cudaFree | ||
+ | 0.94% 6.3542ms 712 8.9240us 119ns 428.32us cuDeviceGetAttribute | ||
+ | 0.72% 4.8747ms 8 609.33us 320.37us 885.26us cuDeviceTotalMem | ||
+ | 0.10% 693.60us 82 8.4580us 2.8370us 72.004us cudaMemcpyAsync | ||
+ | 0.08% 511.79us 1 511.79us 511.79us 511.79us cudaHostAlloc | ||
+ | 0.08% 511.75us 8 63.969us 41.317us 99.232us cuDeviceGetName | ||
+ | 0.05% 310.04us 1 310.04us 310.04us 310.04us cuModuleLoadData | ||
+ | 0.03% 234.87us 24 9.7860us 5.7190us 50.465us cudaLaunch | ||
+ | 0.01% 50.874us 2 25.437us 16.898us 33.976us cuLaunchKernel | ||
+ | 0.01% 49.923us 2 24.961us 15.602us 34.321us cudaMemcpy | ||
+ | 0.01% 47.622us 4 11.905us 8.6190us 19.889us cudaMemsetAsync | ||
+ | 0.01% 44.811us 2 22.405us 9.5590us 35.252us cudaStreamDestroy | ||
+ | 0.01% 35.136us 27 1.3010us 289ns 5.8480us cudaGetDevice | ||
+ | 0.00% 31.113us 24 1.2960us 972ns 3.2380us cudaStreamSynchronize | ||
+ | 0.00% 30.736us 2 15.368us 4.4580us 26.278us cudaStreamCreate | ||
+ | 0.00% 13.932us 17 819ns 414ns 3.7090us cudaEventCreateWithFlags | ||
+ | 0.00% 13.678us 70 195ns 130ns 801ns cudaSetupArgument | ||
+ | 0.00% 12.050us 4 3.0120us 2.1290us 4.5130us cudaFuncGetAttributes | ||
+ | 0.00% 10.407us 22 473ns 268ns 1.9540us cudaDeviceGetAttribute | ||
+ | 0.00% 10.370us 40 259ns 126ns 1.4100us cudaGetLastError | ||
+ | 0.00% 9.9680us 16 623ns 185ns 2.9600us cuDeviceGet | ||
+ | </pre> | ||
+ | |||
+ | |||
+ | |||
+ | |||
==== Modules Available ==== | ==== Modules Available ==== | ||
Line 58: | Line 168: | ||
The following modules are available: | The following modules are available: | ||
− | * module add cuda/ | + | * module add cuda/8.0.61 |
− | * module add cuda/ | + | * module add cuda/9.0.176 |
− | + | * module add cuda/10.1.168 | |
+ | * module add cuda/11.5.0 | ||
==== Compilation ==== | ==== Compilation ==== | ||
Line 67: | Line 178: | ||
<pre style="background-color: black; color: white; border: 2px solid black; font-family: monospace, sans-serif;"> | <pre style="background-color: black; color: white; border: 2px solid black; font-family: monospace, sans-serif;"> | ||
− | [username@login01 ~]$ module add cuda/ | + | [username@login01 ~]$ module add cuda/11.5.0 |
[username@login01 ~]$ nvcc -o testGPU testGPU.cu | [username@login01 ~]$ nvcc -o testGPU testGPU.cu | ||
</pre> | </pre> | ||
Line 86: | Line 197: | ||
#SBATCH -e %N.%j.%a.err | #SBATCH -e %N.%j.%a.err | ||
#SBATCH -p gpu | #SBATCH -p gpu | ||
− | #SBATCH --gres=gpu | + | #SBATCH --gres=gpu |
#SBATCH --exclusive | #SBATCH --exclusive | ||
+ | #SBATCH --mail-user= your email address here | ||
− | module add cuda/ | + | module add cuda/11.5.0 |
/home/user/CUDA/testGPU | /home/user/CUDA/testGPU | ||
Line 102: | Line 214: | ||
==Alternatives to CUDA== | ==Alternatives to CUDA== | ||
− | *OpenCL | + | *OpenACC (part of later gcc compilers) |
− | * | + | *OpenCL (used for DSP, FPGAs too) |
− | *MPI (CPU nodes) | + | * openMP to GPU pragmas ( >version 4, CPU and GPU) |
+ | *MPI (CPU nodes only) | ||
− | == | + | == Next Steps == |
* [http://www.nvidia.com/object/cuda_home_new.html http://www.nvidia.com/object/cuda_home_new.html] | * [http://www.nvidia.com/object/cuda_home_new.html http://www.nvidia.com/object/cuda_home_new.html] | ||
* [https://www.youtube.com/watch?v=_41LCMFpsFs&t=6s 'Youtube - CUDA training video'] | * [https://www.youtube.com/watch?v=_41LCMFpsFs&t=6s 'Youtube - CUDA training video'] | ||
− | { | + | |
− | + | {{Librariespagenav}} | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Latest revision as of 14:03, 5 January 2023
Contents
Programming Details
CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows you to program a CUDA-enabled graphics processing unit (GPU) for general-purpose processing.
The CUDA platform is designed to work with programming languages such as C, C++. |
GPU hardware
Abstraction
GPU hardware consists of a number of key blocks:
- Memory (global, constant, shared)
- Streaming multiprocessors (SMs)
- Streaming processors (SPs)
Specification
Viper has 4 A40 (GPU01-GPU04) and 1 P100 (GPU05)
NVidia A40
Key features of the Ampere A40 GPU accelerator include:
- 48 GB GDDDR6 with ECC (696 GB/s bandwidth)
- 10,752 CUDA-based cores
- 84 Second-generation RT core
- 336 Third-generation Tensor cores
NVidia P100
Key features of the P100 GPU accelerator include:
- 16GB HBM2 Memory with a Type PCI Express 3.0 x16 interface (bandwidth 720 Gbps)
- 3584 CUDA cores Graphics Engine NVIDIA Tesla P100
- Bus Type PCI Express 3.0 x16, API Supported OpenCL, OpenACC
Compute capability
For applications that require this information:
- A40 is 8.5
- P100 is 6.0
Programming example
#include <stdio.h> __global__ void saxpy(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } int main(void) { int N = 1<<31; float *x, *y, *d_x, *d_y; x = (float*)malloc(N*sizeof(float)); y = (float*)malloc(N*sizeof(float)); cudaMalloc(&d_x, N*sizeof(float)); cudaMalloc(&d_y, N*sizeof(float)); for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; } cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice); // Perform SAXPY on 1M elements saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y); cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost); float maxError = 0.0f; for (int i = 0; i < N; i++) maxError = max(maxError, abs(y[i]-4.0f)); printf("Max error: %fn", maxError); }
Profiling
Nvidia provide a good profiling tool call nvprof; its invocation is shown below:
$ module load cuda/11.5.0 $ nvprof ./mycudaprogram
This can also be done with Python too
$ nvprof --print-gpu-trace python train_mnist.py
An example of nvprof output is shown below:
$ nvprof python examples/stream/cusolver.py [10/1910] ==27986== NVPROF is profiling process 27986, command: python examples/stream/cusolver.py ==27986== Profiling application: python examples/stream/cusolver.py ==27986== Profiling result: Time(%) Time Calls Avg Min Max Name 41.70% 125.73us 4 31.431us 30.336us 33.312us void nrm2_kernel<double, double, double, int=0, int=0, int=128, int=0>(cublasNrm2Params<double, double>) 21.94% 66.144us 36 1.8370us 1.7600us 2.1760us [CUDA memcpy DtoH] 13.77% 41.536us 48 865ns 800ns 1.4400us [CUDA memcpy HtoD] 3.02% 9.1200us 2 4.5600us 3.8720us 5.2480us void syhemv_kernel<double, int=64, int=128, int=4, int=5, bool=1, bool=0>(cublasSyhemvParams<double>) 2.65% 8.0000us 2 4.0000us 3.8720us 4.1280us void gemv2T_kernel_val<double, double, double, int=128, int=16, int=2, int=2, bool=0>(int, int, double, double const *, int, double const *, i nt, double, double*, int) 2.63% 7.9360us 2 3.9680us 3.8720us 4.0640us cupy_copy 2.44% 7.3600us 2 3.6800us 3.1680us 4.1920us void syr2_kernel<double, int=128, int=5, bool=1>(cublasSyher2Params<double>, int, double const *, double) 2.23% 6.7200us 2 3.3600us 3.2960us 3.4240us void dot_kernel<double, double, double, int=128, int=0, int=0>(cublasDotParams<double, double>) 1.88% 5.6640us 2 2.8320us 2.7840us 2.8800us void reduce_1Block_kernel<double, double, double, int=128, int=7>(double*, int, double*) 1.74% 5.2480us 2 2.6240us 2.5600us 2.6880us void ger_kernel<double, double, int=256, int=5, bool=0>(cublasGerParams<double, double>) 1.57% 4.7360us 2 2.3680us 2.1760us 2.5600us void axpy_kernel_val<double, double, int=0>(cublasAxpyParamsVal<double, double, double>) 1.28% 3.8720us 2 1.9360us 1.7920us 2.0800us void lacpy_kernel<double, int=5, int=3>(int, int, double const *, int, double*, int, int, int) 1.19% 3.5840us 2 1.7920us 1.6960us 1.8880us void scal_kernel_val<double, double, int=0>(cublasScalParamsVal<double, double>) 0.98% 2.9440us 2 1.4720us 1.2160us 1.7280us void reset_diagonal_real<double, int=8>(int, double*, int) 0.98% 2.9440us 4 736ns 736ns 736ns [CUDA memset] ==27986== API calls: Time(%) Time Calls Avg Min Max Name 60.34% 408.55ms 9 45.395ms 4.8480us 407.94ms cudaMalloc 37.60% 254.60ms 2 127.30ms 556ns 254.60ms cudaFree 0.94% 6.3542ms 712 8.9240us 119ns 428.32us cuDeviceGetAttribute 0.72% 4.8747ms 8 609.33us 320.37us 885.26us cuDeviceTotalMem 0.10% 693.60us 82 8.4580us 2.8370us 72.004us cudaMemcpyAsync 0.08% 511.79us 1 511.79us 511.79us 511.79us cudaHostAlloc 0.08% 511.75us 8 63.969us 41.317us 99.232us cuDeviceGetName 0.05% 310.04us 1 310.04us 310.04us 310.04us cuModuleLoadData 0.03% 234.87us 24 9.7860us 5.7190us 50.465us cudaLaunch 0.01% 50.874us 2 25.437us 16.898us 33.976us cuLaunchKernel 0.01% 49.923us 2 24.961us 15.602us 34.321us cudaMemcpy 0.01% 47.622us 4 11.905us 8.6190us 19.889us cudaMemsetAsync 0.01% 44.811us 2 22.405us 9.5590us 35.252us cudaStreamDestroy 0.01% 35.136us 27 1.3010us 289ns 5.8480us cudaGetDevice 0.00% 31.113us 24 1.2960us 972ns 3.2380us cudaStreamSynchronize 0.00% 30.736us 2 15.368us 4.4580us 26.278us cudaStreamCreate 0.00% 13.932us 17 819ns 414ns 3.7090us cudaEventCreateWithFlags 0.00% 13.678us 70 195ns 130ns 801ns cudaSetupArgument 0.00% 12.050us 4 3.0120us 2.1290us 4.5130us cudaFuncGetAttributes 0.00% 10.407us 22 473ns 268ns 1.9540us cudaDeviceGetAttribute 0.00% 10.370us 40 259ns 126ns 1.4100us cudaGetLastError 0.00% 9.9680us 16 623ns 185ns 2.9600us cuDeviceGet
Modules Available
The following modules are available:
- module add cuda/8.0.61
- module add cuda/9.0.176
- module add cuda/10.1.168
- module add cuda/11.5.0
Compilation
The program would be compiled using NVIDIA's own compiler:
[username@login01 ~]$ module add cuda/11.5.0 [username@login01 ~]$ nvcc -o testGPU testGPU.cu
Usage Examples
Batch example
#!/bin/bash #SBATCH -J gpu-cuda #SBATCH -N 1 #SBATCH --ntasks-per-node 1 #SBATCH -o %N.%j.%a.out #SBATCH -e %N.%j.%a.err #SBATCH -p gpu #SBATCH --gres=gpu #SBATCH --exclusive #SBATCH --mail-user= your email address here module add cuda/11.5.0 /home/user/CUDA/testGPU
[username@login01 ~]$ sbatch demoCUDA.job Submitted batch job 290552
Alternatives to CUDA
- OpenACC (part of later gcc compilers)
- OpenCL (used for DSP, FPGAs too)
- openMP to GPU pragmas ( >version 4, CPU and GPU)
- MPI (CPU nodes only)