Difference between revisions of "FurtherTopics/Advanced Batch Jobs"

From HPC
Jump to: navigation , search
m (Array batch job)
m (Array batch job)
Line 125: Line 125:
 
* All jobs must have the same initial options (e.g. size, time limit, etc.)
 
* All jobs must have the same initial options (e.g. size, time limit, etc.)
 
* It is implemented by using the following additional line '''#SBATCH --array=...''' in the job submission file, see the different examples below:
 
* It is implemented by using the following additional line '''#SBATCH --array=...''' in the job submission file, see the different examples below:
 +
  
 
{| class="wikitable" style="margin:auto"
 
{| class="wikitable" style="margin:auto"
Line 139: Line 140:
  
  
* The variable ''$SLURM_ARRAY_TASK_ID'' can be used within the batch script, being replaced by the index of the job, for example as part of the input or data filename, etc.
+
* As mentioned the variable ''$SLURM_ARRAY_TASK_ID'' can be used within the batch script, being replaced by the index of the job, for example as part of the input or data filename, etc.
  
  

Revision as of 11:51, 10 November 2022

Other Directives

Mail When Done

Including the directive #SBATCH --mail-user=your email means you will receive an email when your job has finished.

#!/bin/bash
#SBATCH -J jobname                # Job name, you can change it to whatever you want
#SBATCH -n 4                      # Number of cores 
#SBATCH -o %N.%j.out              # Standard output will be written here
#SBATCH -e %N.%j.err              # Standard error will be written here
#SBATCH -p compute                # Slurm partition, where you want the job to be queued 
#SBATCH -t=20:00:00               # Run for 20 hours
#SBATCH --mail-user=your email    # Mail to email address when finished


module purge
module add modulename
 
command

Time >1 Day

Use #SBATCH -t=0-00:00:00 (D-HH:MM:SS).

#!/bin/bash
#SBATCH -J jobname                # Job name, you can change it to whatever you want
#SBATCH -n 4                      # Number of cores 
#SBATCH -o %N.%j.out              # Standard output will be written here
#SBATCH -e %N.%j.err              # Standard error will be written here
#SBATCH -p compute                # Slurm partition, where you want the job to be queued 
#SBATCH -t=3-20:00:00             # Run for 3 days and 20 hours

 
module purge
module add modulename
 
command

Increase resources

Additional Memory

Use #SBATCH --mem=<amount>G to specify the amount of memory. For jobs needing more than 128GB you will need to use a high memory node.

#!/bin/bash
#SBATCH -J jobname          # Job name, you can change it to whatever you want
#SBATCH -n 1                # Number of cores 
#SBATCH -o %N.%j.out        # Standard output will be written here
#SBATCH -e %N.%j.err        # Standard error will be written here
#SBATCH -p compute          # Slurm partition, where you want the job to be queued 
#SBATCH --mem=50G           # 500GB of memory
#SBATCH -t=20:00:00         # Run for 20 hours

 
module purge
module add modulename
 
command

Additional Nodes

Some parallel jobs require multiple nodes to be used, the number of nodes can be speficied by using: #SBATCH -N <number>.

#!/bin/bash
#SBATCH -J jobname          # Job name, you can change it to whatever you want
#SBATCH -N 4                # Number of cores 
#SBATCH -o %N.%j.out        # Standard output will be written here
#SBATCH -e %N.%j.err        # Standard error will be written here
#SBATCH -p compute          # Slurm partition, where you want the job to be queued 
#SBATCH -t=20:00:00         # Run for 20 hours


module purge
module add modulename
 
command

Other Partition Queues

highmem

Standard compute nodes have 128GB of RAM available and there are dedicated high memory nodes that have a total of 1TB of RAM. If your job requires more than 128GB of RAM, submit it to the highmem partition.

Use #SBATCH --exclusive for the full 1TB or # SBATCH --mem=<amount>G for a specific amount of RAM.

#!/bin/bash
#SBATCH -J jobname          # Job name, you can change it to whatever you want
#SBATCH -n 1                # Number of cores 
#SBATCH -o %N.%j.out        # Standard output will be written here
#SBATCH -e %N.%j.err        # Standard error will be written here
#SBATCH -p highmem          # Slurm partition, where you want the job to be queued 
#SBATCH --mem=500G          # 500GB of memory
#SBATCH -t=3-00:00:00       # Run for 3 days

 
module purge
module add modulename
 
command

gpu

Use --gres=gpu to use the GPU resource instead of the CPU on the node.

#!/bin/bash
#SBATCH -J jobname          # Job name, you can change it to whatever you want
#SBATCH -n 1                # Number of cores 
#SBATCH -o %N.%j.out        # Standard output will be written here
#SBATCH -e %N.%j.err        # Standard error will be written here
#SBATCH --gres=gpu          # use the GPU resource not the CPU
#SBATCH -p gpu              # Slurm partition, where you want the job to be queued 
#SBATCH -t=00:40:00         # Run for 40 minutes

module purge
module add modulename
 
command

Array batch job

Array batch jobs offer a mechanism for submitting and managing collections of similar jobs quickly and easily with one batch file.

  • All tasks are run concurrently by the scheduler and are identified by an environment variable ($SLURM_ARRAY_TASK_ID).
  • All jobs must have the same initial options (e.g. size, time limit, etc.)
  • It is implemented by using the following additional line #SBATCH --array=... in the job submission file, see the different examples below:


#SBATCH --array=1-10 (typical usage) the same job will be run 10 times, or
#SBATCH --array=1,6,16 certain elements ie 1,6 and 16, or
#SBATCH --array=1,6,16,51-60 certain elements and a range, or
#SBATCH --array=1-15%4 a range, but limit simultaneous tasks to 4 concurrently.


  • As mentioned the variable $SLURM_ARRAY_TASK_ID can be used within the batch script, being replaced by the index of the job, for example as part of the input or data filename, etc.


The following batch file shows how this is implemented and will execute calcFFTs db1.data to calcFFTs db10.data concurrently.

#!/bin/bash
#SBATCH -J jobname          # Job name, you can change it to whatever you want
#SBATCH -n 1                # Number of cores 
#SBATCH -o %N.%A.%a.out     # Standard output will be written here
#SBATCH -e %N.%A.%a.err     # Standard error will be written here
#SBATCH -p compute          # Slurm partition, where you want the job to be queued 
#SBATCH --array 1-10
 
module purge
module add modulename
 
calcFFTs db$SLURM_ARRAY_TASK_ID.data