Difference between revisions of "Training/openMP"

From HPC
Jump to: navigation , search
m
m
 
(5 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
==Introduction==
 
==Introduction==
  
openMP allows threaded programming across a shared memory system, so on our HPC this means utilizing more than one processing core across one computing node.
+
OpenMP allows threaded programming across a shared memory system, so on our HPC, this means utilizing more than one processing core across one computing node.
  
 
A shared memory computer consists of a number of processing cores together with some memory.  
 
A shared memory computer consists of a number of processing cores together with some memory.  
Line 8: Line 8:
 
the whole memory system.
 
the whole memory system.
  
* every processing core can read and write all memory locations in the system
+
* Every processing core can read and write all memory locations in the system
* one logical memory space
+
* One logical memory space
* all cores refer to a memory location using the same address
+
* All cores refer to a memory location using the same address
  
 
==Programming model==
 
==Programming model==
  
Within the idea of the shared memory model we use the idea of '''threads''' which can share memory by all the other threads. These also have the following characteristics:
+
Within the idea of the shared memory model, we use the idea of '''threads''' which can share the memory with all the other threads. These also have the following characteristics:
  
 
* Private data can only be accessed by the thread owning it
 
* Private data can only be accessed by the thread owning it
* Each thread can run simultaneous to other threads but also asynchronously, so we need to be careful of race conditions.
+
* Each thread can run simultaneously with other threads but also asynchronously, so we need to be careful of race conditions.
* Usually we have one thread per processing core, although there maybe hardware support for more (e.g. hyper-threading)
+
* Usually we have one thread per processing core, although there may be hardware support for more (e.g. hyper-threading)
  
 
==Thread Synchronization==
 
==Thread Synchronization==
  
As previously mention threads execute asynchronous;y, which means each thread proceeds through program instruction independently of other threads.  
+
As previously mention threads execute asynchronously, which means each thread proceeds through program instruction independently of other threads.  
  
Although this makes for a very flexible system we must be very careful about the actions on shared variables occur in the correct order.
+
Although this makes for a very flexible system we must be very careful about the actions on shared variables occurring in the correct order.
  
* e.g. If we access a variable to read on thread 1 before thread 2 has written to it we will cause a program crash, likewise is updates to shared variables are accessed by different threads at the same time, one of the updates may get overwritten.
+
* e.g. If we access a variable to read on thread 1 before thread 2 has written to it we will cause a program crash, likewise if updates to shared variables are accessed by different threads at the same time, one of the updates may get overwritten.
  
 
To prevent this happen we must either use variables that are independent of the different threads (ie different parts of an array) or perform some sort of synchronization within the code so different threads get to the same point at the same time.
 
To prevent this happen we must either use variables that are independent of the different threads (ie different parts of an array) or perform some sort of synchronization within the code so different threads get to the same point at the same time.
Line 59: Line 59:
 
</pre>
 
</pre>
  
Lets look at the extra components to make this a parallel threaded program
+
Let us look at the extra components to make this a parallel-threaded program
  
 
* We have an openMP include file '''(#include <omp.h>''')
 
* We have an openMP include file '''(#include <omp.h>''')
Line 80: Line 80:
 
</pre>
 
</pre>
  
* Not very coherent but remember the threads all executed at different times, that is asynchronously and this is why we must be very careful about communicating between different threads of the same program.
+
* Not very coherent but remember the threads all executed at different times, that is asynchronous and this is why we must be very careful about communicating between different threads of the same program.
  
 
==Second threaded program==
 
==Second threaded program==
  
Although the previous program is threaded it does not represent a real world example
+
Although the previous program is threaded it does not represent a real-world example
  
 
<pre>
 
<pre>
Line 103: Line 103:
 
Here each thread executes the same code independently, the only thing different is the omp thread ID is also passed to each call of pooh(ID,A).
 
Here each thread executes the same code independently, the only thing different is the omp thread ID is also passed to each call of pooh(ID,A).
  
All threads again wait of the end of the closing parallel region brace to finish before proceeding (i.e. a synchronization barrier).
+
All threads again wait for the end of the closing parallel region brace to finish before proceeding (i.e. a synchronization barrier).
  
In this program we always expect 4 threads to be given to us by the underlying operating system. Unfortunately this may not happen and we will allocated what the scheduler is prepared to offer. This could cause us serious program difficulties if we rely on a fixed number of threads every time.
+
In this program we always expect 4 threads to be given to us by the underlying operating system. Unfortunately, this may not happen and we will be allocated what the scheduler is prepared to offer. This could cause us serious program difficulties if we rely on a fixed number of threads every time.
  
 
We must call the  openMP library (at runtime) how many threads we actually got, this is done with the following code:
 
We must call the  openMP library (at runtime) how many threads we actually got, this is done with the following code:
Line 135: Line 135:
 
==Parallel loops==
 
==Parallel loops==
  
Loops are the main source of parallelism in many applications. If the iterations of a loop are independent (can be done in any order) then we can share out the iterations between different threads. openMP has a native call to do this efficiently:
+
Loops are the main source of parallelism in many applications. If the iterations of a loop are independent (can be done in any order) then we can share out the iterations between different threads. OpenMP has a native call to do this efficiently:
  
  
Line 149: Line 149:
 
</pre>
 
</pre>
  
This is a much neater way and allows the compiler to perform optimizations automatically (unless otherwise stated). The variable loop is made 'private' by each thread by default. Also all threads have to wait at the end of the parallel loop before proceeding past the end of this region.
+
This is a much neater way and allows the compiler to perform optimizations automatically (unless otherwise stated). The variable loop is made 'private' by each thread by default. Also, all threads have to wait at the end of the parallel loop before proceeding past the end of this region.
  
 
One OpenMP shortcut is to put the '''parallel pragma''' in the '''parallel for''' part, which just makes the code more readable.
 
One OpenMP shortcut is to put the '''parallel pragma''' in the '''parallel for''' part, which just makes the code more readable.
Line 165: Line 165:
 
</pre>
 
</pre>
  
There is an side effect of threading is called '''False sharing''' which can cause poor scaling. If independent data elements happen to sit on the same cache line, each update will cause the cache lines to “slosh back and forth” between threads hence each time the variable have to be re-loaded from main memory.
+
There is a side effect of threading is called '''False sharing''' which can cause poor scaling. If independent data elements happen to sit on the same cache line, each update will cause the cache lines to “slosh back and forth” between threads hence each time the variable has to be re-loaded from the main memory.
  
One solution is to pad arrays so elements are on distinct cache lines, another method is to allow single threads to operate on a cache line completely. A better way is to re-write the program so this effect is avoiding, an example is given below:
+
One solution is to pad arrays so elements are on distinct cache lines, another method is to allow single threads to operate on a cache line completely. A better way is to re-write the program so this effect is avoided, an example is given below:
  
 
<pre>
 
<pre>
Line 202: Line 202:
 
* Here we create a scalar local to each thread to accumulate partial sums
 
* Here we create a scalar local to each thread to accumulate partial sums
 
* No array, no false sharing
 
* No array, no false sharing
* Within the parallel region we create a '''#pragma omp''' critical region to allow on one thread at a time to enter to perform the summation of the result of pi.
+
* Within the parallel region we create a '''#pragma omp''' critical region to allow one thread at a time to enter to perform the summation of the result of pi.
  
This solution will scale much better than using an array based design. Again because the above code of often seen in loops, openMP provides a reduction tool as well:
+
This solution will scale much better than using an array-based design. Again because the above code of often seen in loops, openMP provides a reduction tool as well:
  
 
<pre>
 
<pre>
Line 229: Line 229:
 
     #pragma omp barrier
 
     #pragma omp barrier
 
     othercode();
 
     othercode();
 +
}
 
</pre>
 
</pre>
  
Line 249: Line 250:
  
 
{| class="wikitable"
 
{| class="wikitable"
| style="width:25%" | <Strong>Package</Strong>
+
| style="width:25%" | <Strong>OMP Construct</Strong>
 
| style="width:75%" | <Strong>Description</Strong>
 
| style="width:75%" | <Strong>Description</Strong>
 
|-
 
|-
Line 300: Line 301:
 
|}
 
|}
  
 +
==Performance threading==
  
 +
* Avoid false sharing possibilities, this will cause the greatest limitation to scaling
 +
* Does your program justify the overhead of threads, if you're parallelising out only a small '''for..loop''' the overhead cannot be justified and will increase processing time.
 +
* Best choice of schedule might change with the system
 +
* Minimize synchronisation, use nowait where practical
 +
* Locality, must systems are NUMA, modify your loop nest, or change loop order to get better cache behaviour
  
  
== Further Information ==
 
  
 +
==Next Steps==
  
 
{|
 
{|

Latest revision as of 11:43, 15 November 2022

Introduction

OpenMP allows threaded programming across a shared memory system, so on our HPC, this means utilizing more than one processing core across one computing node.

A shared memory computer consists of a number of processing cores together with some memory. Shared memory systems is a single address space across the whole memory system.

  • Every processing core can read and write all memory locations in the system
  • One logical memory space
  • All cores refer to a memory location using the same address

Programming model

Within the idea of the shared memory model, we use the idea of threads which can share the memory with all the other threads. These also have the following characteristics:

  • Private data can only be accessed by the thread owning it
  • Each thread can run simultaneously with other threads but also asynchronously, so we need to be careful of race conditions.
  • Usually we have one thread per processing core, although there may be hardware support for more (e.g. hyper-threading)

Thread Synchronization

As previously mention threads execute asynchronously, which means each thread proceeds through program instruction independently of other threads.

Although this makes for a very flexible system we must be very careful about the actions on shared variables occurring in the correct order.

  • e.g. If we access a variable to read on thread 1 before thread 2 has written to it we will cause a program crash, likewise if updates to shared variables are accessed by different threads at the same time, one of the updates may get overwritten.

To prevent this happen we must either use variables that are independent of the different threads (ie different parts of an array) or perform some sort of synchronization within the code so different threads get to the same point at the same time.

First threaded program

Creating the most basic C program would be like the following:

#include<stdio.h>
int main()
{
    printf(“ hello world\n”);
    return 0;   
}

To thread this we must tell the compiler which parts of the program to make into threads

#include<omp.h>
#include<stdio.h>
int main()
{
    #pragma omp parallel
    {
        printf(“hello ");
        printf("world\n”);
    }
    return 0;   
}

Let us look at the extra components to make this a parallel-threaded program

  • We have an openMP include file (#include <omp.h>)
  • We use the #pragma omp parallel which tells the compiler the following region within the { } is the going to be executed as threads

To compile this we use the command:

$ gcc -fopenmp myprogram.c -o myprogram ( for the gcc compiler), or

$ icc -fopenmp myprogram.c -o myprogram (for the Intel compiler)

And when we run this we would get something like the following

$ ./myprogram
hello hello world
world
hello hello world
world
  • Not very coherent but remember the threads all executed at different times, that is asynchronous and this is why we must be very careful about communicating between different threads of the same program.

Second threaded program

Although the previous program is threaded it does not represent a real-world example

#include<omp.h>
#include<stdio.h>
int main()
{
    double A[1000];
    omp_set_num_threads(4);
    #pragma omp parallel
   {
        int ID = omp_get_thread_num();
        pooh(ID, A);
    }
}

Here each thread executes the same code independently, the only thing different is the omp thread ID is also passed to each call of pooh(ID,A).

All threads again wait for the end of the closing parallel region brace to finish before proceeding (i.e. a synchronization barrier).

In this program we always expect 4 threads to be given to us by the underlying operating system. Unfortunately, this may not happen and we will be allocated what the scheduler is prepared to offer. This could cause us serious program difficulties if we rely on a fixed number of threads every time.

We must call the openMP library (at runtime) how many threads we actually got, this is done with the following code:

#include<omp.h>
#include<stdio.h>
int main()
{
    double A[1000];
    omp_set_num_threads(4);
    #pragma omp parallel
   {
        int ID = omp_get_thread_num();
        int nthrds = omp_get_num_threads();
        pooh(ID, A);
    }
}
  • Each thread calls pooh(ID,A) for ID = 0 to nthrds-1
  • This program hard codes the number of threads requested to 4, this isn't a good way of programming. and would need re-compiling every time we changed this. A better way of doing this is to set the environment variable OMP_NUM_THREADS, and remove the omp_set_num_threads(4).
$ export OMP_NUM_THREADS=4
$ ./myprogram

Parallel loops

Loops are the main source of parallelism in many applications. If the iterations of a loop are independent (can be done in any order) then we can share out the iterations between different threads. OpenMP has a native call to do this efficiently:


#pragma omp parallel
{
    #pragma omp for
    for (loop=0;loop<N;loop++)
    {
        do_threaded_task(loop);
    }
}

This is a much neater way and allows the compiler to perform optimizations automatically (unless otherwise stated). The variable loop is made 'private' by each thread by default. Also, all threads have to wait at the end of the parallel loop before proceeding past the end of this region.

One OpenMP shortcut is to put the parallel pragma in the parallel for part, which just makes the code more readable.

#define MAX = 100;

double data[MAX]; 
int loop;
#pragma omp parallel for
    for (loop=0;loop< MAX; loop++) 
    {
        data[loop] = process_data(loop);
    }

There is a side effect of threading is called False sharing which can cause poor scaling. If independent data elements happen to sit on the same cache line, each update will cause the cache lines to “slosh back and forth” between threads hence each time the variable has to be re-loaded from the main memory.

One solution is to pad arrays so elements are on distinct cache lines, another method is to allow single threads to operate on a cache line completely. A better way is to re-write the program so this effect is avoided, an example is given below:


#include <omp.h>
static long num_steps = 100000; double step;

#define NUM_THREADS 2
void main ()
{ 
    int nthreads; double pi=0.0; 
    step = 1.0/(double) num_steps;
    omp_set_num_threads(NUM_THREADS);

    #pragma omp parallel
    {
        int i, id, nthrds; double x, sum;
        id = omp_get_thread_num();
        nthrds = omp_get_num_threads();

        if (id == 0) 
               nthreads = nthrds;
        for (i=id, sum=0.0;i< num_steps; i=i+nthrds) 
       {
               x = (i+0.5)*step;
               sum += 4.0/(1.0+x*x);
       }
    #pragma omp critical
    pi += sum * step;
    }
}
  • Here we create a scalar local to each thread to accumulate partial sums
  • No array, no false sharing
  • Within the parallel region we create a #pragma omp critical region to allow one thread at a time to enter to perform the summation of the result of pi.

This solution will scale much better than using an array-based design. Again because the above code of often seen in loops, openMP provides a reduction tool as well:

double ave=0.0, A[MAX]; 
int i;

#pragma omp parallel for reduction (+:ave)
    for (i=0;i< MAX; i++) 
    {
        ave + = A[i];
    }
ave = ave/MAX;
  • Other reduction operators are provided as well including subtraction, multiplication, division etc.

Barriers

#pragma omp parallel
{
    somecode();
    // all threads wait until all threads get here
    #pragma omp barrier
    othercode();
}
  • These are very useful (and sometimes essential) to allow threads to all get to the same place. However, there are expensive in terms of efficiency since we are forcing the parallel processing back to eventually a single thread up to the barrier.
  • Also there are times when you don't want the assumed barrier at the end of a parallel for loop.
#pragma omp parallel
{
    #pragma omp for nowait
    for(i=0;i<N;i++)
    { 
        B[i]=big_calc2(C, i); 
    }
   A[id] = big_calc4(id);    // this get called with a thread as we have indicated a nowait
}

openMP pragmas

OMP Construct Description
#pragma omp parallel parallel region, teams of threads, structured block, interleaved execution

across threads

int omp_get_thread_num()

int omp_get_num_threads()

Create threads with a parallel region and split up the work using the

number of threads and thread ID

double omp_get_wtime() speedup and Amdahl's law.

False Sharing and other performance issues

setenv OMP_NUM_THREADS N internal control variables. Setting the default number of threads with an

environment variable

#pragma omp barrier
  1. pragma omp critical
Synchronization and race conditions. Revisit interleaved execution.
#pragma omp for
  1. pragma omp parallel for
worksharing, parallel loops, loop carried dependencies
reduction(op:list) reductions of values across a team of threads
schedule(dynamic [,chunk])

schedule (static [,chunk])

Loop schedules, loop overheads and load balance
private(list), firstprivate(list), shared(list) Data environment
nowait disabling implied barriers on workshare constructs, the high cost of

barriers. The flush concept (but not the concept)

#pragma omp single workshare with a single thread
#pragma omp task
  1. pragma omp taskwait
tasks including the data environment for tasks.

Performance threading

  • Avoid false sharing possibilities, this will cause the greatest limitation to scaling
  • Does your program justify the overhead of threads, if you're parallelising out only a small for..loop the overhead cannot be justified and will increase processing time.
  • Best choice of schedule might change with the system
  • Minimize synchronisation, use nowait where practical
  • Locality, must systems are NUMA, modify your loop nest, or change loop order to get better cache behaviour


Next Steps

Icon home.png