Difference between revisions of "Programming/Cuda"
From HPC
m |
m |
||
Line 21: | Line 21: | ||
===Specification=== | ===Specification=== | ||
+ | |||
+ | Viper has 4 K40m (GPU01-GPU04) and 1 P100 (GPU05) | ||
+ | |||
+ | ====NVidia K40m==== | ||
Key features of the Tesla K40 GPU accelerator include: | Key features of the Tesla K40 GPU accelerator include: | ||
Line 29: | Line 33: | ||
* PCIe Gen-3 interconnect support accelerates data movement by 2X compared to PCIe Gen-2 technology. | * PCIe Gen-3 interconnect support accelerates data movement by 2X compared to PCIe Gen-2 technology. | ||
+ | ====NVidia P100==== | ||
+ | |||
+ | Key features of the P100 GPU accelerator include: | ||
+ | |||
+ | * 16GB HBM2 Memory with a Type PCI Express 3.0 x16 interface (bandwidth 720 GBps) | ||
+ | * 3584 CUDA cores graphics Engine NVIDIA Tesla P100 | ||
+ | * Bus Type PCI Express 3.0 x16, API Supported OpenCL, OpenACC | ||
+ | |||
+ | ===Compute capability=== | ||
+ | |||
+ | For applications that require this information: | ||
+ | |||
+ | * K40m is 3.5 | ||
+ | * P100 is 6.0 | ||
=== Programming example === | === Programming example === |
Revision as of 08:19, 17 June 2019
Contents
Programming Details
CUDA is a parallel computing platform and application programming interface (API) model created by NVidia. It allows you to program a CUDA-enabled graphics processing unit (GPU) for general purpose processing.
The CUDA platform is designed to work with programming languages such as C, C++. Fortran CUDA is possible through the use of PGI-fortran, which is now available. |
GPU hardware
Abstraction
GPU hardware consists of a number of key blocks:
- Memory (global, constant, shared)
- Streaming multiprocessors (SMs)
- Streaming processors (SPs)
Specification
Viper has 4 K40m (GPU01-GPU04) and 1 P100 (GPU05)
NVidia K40m
Key features of the Tesla K40 GPU accelerator include:
- 12GB of ultra-fast GDDR5 memory allows users to process 2X larger datasets, enabling them to rapidly analyze massive volumes of data.
- 2,880 CUDA® parallel processing cores deliver application acceleration by up to 10X compared to using a CPU alone.
- Dynamic Parallelism enables GPU threads to dynamically spawn new threads, enabling users to quickly and easily crunch through adaptive and dynamic data structures.
- PCIe Gen-3 interconnect support accelerates data movement by 2X compared to PCIe Gen-2 technology.
NVidia P100
Key features of the P100 GPU accelerator include:
- 16GB HBM2 Memory with a Type PCI Express 3.0 x16 interface (bandwidth 720 GBps)
- 3584 CUDA cores graphics Engine NVIDIA Tesla P100
- Bus Type PCI Express 3.0 x16, API Supported OpenCL, OpenACC
Compute capability
For applications that require this information:
- K40m is 3.5
- P100 is 6.0
Programming example
#include <stdio.h> __global__ void saxpy(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } int main(void) { int N = 1<<31; float *x, *y, *d_x, *d_y; x = (float*)malloc(N*sizeof(float)); y = (float*)malloc(N*sizeof(float)); cudaMalloc(&d_x, N*sizeof(float)); cudaMalloc(&d_y, N*sizeof(float)); for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; } cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice); // Perform SAXPY on 1M elements saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y); cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost); float maxError = 0.0f; for (int i = 0; i < N; i++) maxError = max(maxError, abs(y[i]-4.0f)); printf("Max error: %fn", maxError); }
Modules Available
The following modules are available:
- module add cuda/6.5.14 (to be retired)
- module add cuda/7.5.18 (to be retired)
- module add cuda/8.0.61
- module add cuda/9.0.176
Compilation
The program would be compiled using NVIDIA's own compiler:
[username@login01 ~]$ module add cuda/9.0.176 [username@login01 ~]$ nvcc -o testGPU testGPU.cu
Usage Examples
Batch example
#!/bin/bash #SBATCH -J gpu-cuda #SBATCH -N 1 #SBATCH --ntasks-per-node 1 #SBATCH -o %N.%j.%a.out #SBATCH -e %N.%j.%a.err #SBATCH -p gpu #SBATCH --gres=gpu:tesla #SBATCH --exclusive module add cuda/10.1.168 /home/user/CUDA/testGPU
[username@login01 ~]$ sbatch demoCUDA.job Submitted batch job 290552
Alternatives to CUDA
- OpenACC (part of later gcc compilers)
- OpenCL (used for DSP, FPGAs too)
- openMP to GPU pragmas ( >version 4, CPU and GPU)
- MPI (CPU nodes only)