Difference between revisions of "General/Batch"

From HPC
Jump to: navigation , search
(Example batch scripts)
(High memory batch job)
Line 82: Line 82:
  
 
==== High memory batch job ====
 
==== High memory batch job ====
 +
If you task requires more memory than the standard provision (approximately 4GB) then you need to include a directive in your submission script to request the appropriate resource. The standard compute nodes have 128GB of RAM available and there are dedicated high memory nodes which have a total of 1TB of RAM. If your job requires more than 128GB of RAM, then submit to the highmem partition.
 +
 +
The following job submission script runs on the highmem partition and uses the ''#SBATCH --mem'' flag to request 500GB of RAM, which results in three things:
 +
 +
* The job will only be allocated to a node with this much memory available
 +
* No other jobs will be allocated to this node unless their memory requirements fit in with the remaining available memory
 +
* If the job exceeds this requested value, the task will terminate
 +
 
<pre style="background-color: #C8C8C8; color: black; font-family: monospace, sans-serif;">
 
<pre style="background-color: #C8C8C8; color: black; font-family: monospace, sans-serif;">
 
#!/bin/bash
 
#!/bin/bash
Line 88: Line 96:
 
#SBATCH -o %N.%j.out        # Standard output will be written here
 
#SBATCH -o %N.%j.out        # Standard output will be written here
 
#SBATCH -e %N.%j.err        # Standard error will be written here
 
#SBATCH -e %N.%j.err        # Standard error will be written here
#SBATCH -p compute         # Slurm partition, where you want the job to be queued  
+
#SBATCH -p highmem         # Slurm partition, where you want the job to be queued  
 +
#SBATCH --mem=500G
 
   
 
   
 
module purge
 
module purge
Line 94: Line 103:
 
   
 
   
 
command
 
command
 +
</pre>
 +
 +
If a job exceeds the requested about of memory, it will terminate with an error message similar to the following (a job which ran with a memory limit of 2GB):
 +
 +
<pre style="background-color: #f5f5dc; color: black; font-family: monospace, sans-serif;">
 +
slurmstepd: Step 307110.0 exceeded memory limit (23933492 > 2097152), being killed
 +
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
 +
srun: got SIGCONT
 +
slurmstepd: Exceeded job memory limit
 
</pre>
 
</pre>
  

Revision as of 13:51, 8 February 2017

Introduction

Example batch scripts

Basic batch job

#!/bin/bash
#SBATCH -J jobname          # Job name, you can change it to whatever you want
#SBATCH -n 1                # Number of cores 
#SBATCH -o %N.%j.out        # Standard output will be written here
#SBATCH -e %N.%j.err        # Standard error will be written here
#SBATCH -p compute          # Slurm partition, where you want the job to be queued 
 
module purge
module add modulename
 
command

Exclusive batch job

#!/bin/bash
#SBATCH -J jobname          # Job name, you can change it to whatever you want
#SBATCH -o %N.%j.out        # Standard output will be written here
#SBATCH -e %N.%j.err        # Standard error will be written here
#SBATCH -p compute          # Slurm partition, where you want the job to be queued 
#SBATCH --exclusive         # Request exclusive access to a node (all 28 cores, 128GB of RAM) 

module purge
module add modulename
 
command

Parallel batch jobs

Intel MPI parallel batch job
#!/bin/bash
#SBATCH -J jobname          # Job name, you can change it to whatever you want
#SBATCH -n 1                # Number of cores 
#SBATCH -n 1                # Number of nodes 
#SBATCH -o %N.%j.out        # Standard output will be written here
#SBATCH -e %N.%j.err        # Standard error will be written here
#SBATCH -p compute          # Slurm partition, where you want the job to be queued 
 
module purge
module add modulename
 
command
MVAPICH parallel batch job
#!/bin/bash
#SBATCH -J jobname          # Job name, you can change it to whatever you want
#SBATCH -n 1                # Number of cores 
#SBATCH -n 1                # Number of nodes 
#SBATCH -o %N.%j.out        # Standard output will be written here
#SBATCH -e %N.%j.err        # Standard error will be written here
#SBATCH -p compute          # Slurm partition, where you want the job to be queued 
 
module purge
module add modulename
 
command
OpenMPI parallel batch job
#!/bin/bash
#SBATCH -J jobname          # Job name, you can change it to whatever you want
#SBATCH -n 1                # Number of cores 
#SBATCH -n 1                # Number of nodes 
#SBATCH -o %N.%j.out        # Standard output will be written here
#SBATCH -e %N.%j.err        # Standard error will be written here
#SBATCH -p compute          # Slurm partition, where you want the job to be queued 
 
module purge
module add modulename
 
command

High memory batch job

If you task requires more memory than the standard provision (approximately 4GB) then you need to include a directive in your submission script to request the appropriate resource. The standard compute nodes have 128GB of RAM available and there are dedicated high memory nodes which have a total of 1TB of RAM. If your job requires more than 128GB of RAM, then submit to the highmem partition.

The following job submission script runs on the highmem partition and uses the #SBATCH --mem flag to request 500GB of RAM, which results in three things:

  • The job will only be allocated to a node with this much memory available
  • No other jobs will be allocated to this node unless their memory requirements fit in with the remaining available memory
  • If the job exceeds this requested value, the task will terminate
#!/bin/bash
#SBATCH -J jobname          # Job name, you can change it to whatever you want
#SBATCH -n 1                # Number of cores 
#SBATCH -o %N.%j.out        # Standard output will be written here
#SBATCH -e %N.%j.err        # Standard error will be written here
#SBATCH -p highmem          # Slurm partition, where you want the job to be queued 
#SBATCH --mem=500G
 
module purge
module add modulename
 
command

If a job exceeds the requested about of memory, it will terminate with an error message similar to the following (a job which ran with a memory limit of 2GB):

slurmstepd: Step 307110.0 exceeded memory limit (23933492 > 2097152), being killed
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: got SIGCONT
slurmstepd: Exceeded job memory limit

GPU batch job

#!/bin/bash
#SBATCH -J jobname          # Job name, you can change it to whatever you want
#SBATCH -n 1                # Number of cores 
#SBATCH -o %N.%j.out        # Standard output will be written here
#SBATCH -e %N.%j.err        # Standard error will be written here
#SBATCH -p compute          # Slurm partition, where you want the job to be queued 
 
module purge
module add modulename
 
command

Array batch job

An array batch job allows multiple jobs to be executed with identical parameters based on a single job submission. By using the directive #SBATCH --array 1-10 the same job will be run 10 times. The indexes specification identifies what array index values should be used. Multiple values may be specified using a comma separated list and/or a range of values with a "-" separator. For example, "--array=0-15" or "--array=0,6,16-32".

A step function can also be specified with a suffix containing a colon and number. For example, "--array=0-15:4" is equivalent to "--array=0,4,8,12". A maximum number of simultaneously running tasks from the job array may be specified using a "%" separator. For example "--array=0-15%4" will limit the number of simultaneously running tasks from this job array to 4.

The variable $SLURM_ARRAY_TASK_ID can be used within the batch script, being replaced by the index of the job, for example as part of the input or data filename etc.

When the batch script below is submitted, 10 jobs will run resulting the in command being run with the first argument corresponding to the array element of that task, for instance: command 1, command 2, through to command 10. The output of each of these tasks will be logged to a different out and err file, with the format <node job ran on>.<job ID>.<array index>.out <node job ran on>.<job ID>.<array index>.err

#!/bin/bash
#SBATCH -J jobname          # Job name, you can change it to whatever you want
#SBATCH -n 1                # Number of cores 
#SBATCH -o %N.%A.%a.out     # Standard output will be written here
#SBATCH -e %N.%A.%a.err     # Standard error will be written here
#SBATCH -p compute          # Slurm partition, where you want the job to be queued 
#SBATCH --array 1-10
 
module purge
module add modulename
 
command $SLURM_ARRAY_TASK_ID