OpenHPC

From HPC
Revision as of 21:04, 6 January 2022 by Chris.collins (talk | contribs) (Introduction)

Jump to: navigation , search

Introduction

As part of the Viper upgrade, various parts of the software stack are seeing an upgrade:

  • Operating System is being upgraded, which will provide improvements in terms of security but will also allow improved functionality (e.g. Highlights#Using_Containers
  • Slurm is having a version upgrade to provide functionality, stability and security improvements

Resource Requirements

In order to reduce issues with tasks using more resource (CPU or RAM) than they should, Slurm will now uses Linux cgroups to more closely monitor resource use of tasks and will limit (CPU) or terminate (RAM) tasks exceeding what they should use. In particular this means it is important to request the correct amount of RAM needed for a task.

Example memory request (standard compute)
Default (no specific memory request) approx 4GB (i.e. 128 GB RAM / 28 cores)
#SBATCH --mem=40G 40GB (i.e. specific amount)
#SBATCH --exclusive 128GB (i.e. 128GB for exclusive use)

If a task is terminated due to exceeding the requested amount of memory, you should see a message in your Slurm error log file, such as:

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=319.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Job Emails

It is now possible to get emails alerts when certain event types occur using Slurms built in --mail-type SBATCH directive support.

The most commonly used valid type values are as follows (multiple type values may be specified in a comma separated list):

  • NONE (the default if you don't set --mait-type
  • BEGIN
  • END
  • FAIL
  • REQUEUE
  • ALL (equivalent to BEGIN, END, FAIL, INVALID_DEPEND, REQUEUE, and STAGE_OUT)
  • INVALID_DEPEND (dependency never satisfied)
  • TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80 percent of time limit), TIME_LIMIT_50 (reached 50 percent of time limit) * ARRAY_TASKS (sends emails for each array task otherwise job BEGIN, END and FAIL apply to a job array as a whole rather than generating individual email messages for each task in the job array).

The user to be notified is indicated with --mail-user, however only @hull.ac.uk email addresses are valid.

If you want to be alerted when your task completes, if it advised to use #SBATCH --mail-type=END,FAIL to catch if a job finishes cleanly or if it finishes due to an error.

An example of a completion email is shown below: SlurmEmail.png

Slurm Information

It is now possible to check the details of a job submission script used to submit a job. This is done using sacct -B -j <jobnumber> e.g.:

$ sacct -B -j 317
Batch Script for 317
--------------------------------------------------------------------------------
#!/bin/bash
#SBATCH -J jobsubmissionfile
#SBATCH -n 1
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.out
#SBATCH -p compute
#SBATCH --exclusive
#SBATCH --time=1-00:00:00
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=<your Hull email address>

echo "This is my submission script"
sleep 10

Using Containers

Viper now supports the use of Containers, via Apptainer (previously called Singularity).

Containers can be built to contain everything needed to run an application or code, independent of the Operating System. This means application/code, libraries, scripts or even data. This makes them ideal for reproducible science as experiment and data can be distributed as a single file. Generally there is a negligible if any performance impact when running containers over bare metal. Containers are becoming an increasingly popular way of distributing applications and workflows.

Docker is perhaps the most well known form of containers, but Docker has security implications that mean it isn’t suitable for a shared user environment like HPC. Apptainer doesn't have the same security issues, indeed "[Singularity] is optimized for compute focused enterprise and HPC workloads, allowing untrusted users to run untrusted containers in a trusted way”. However, containers are built from a recipe files which include all the required components to support the workflow, and the recipe file can be based off an existing Docker container, allowing many existing Docker containers to be used with Singularity.

#!/bin/bash
#SBATCH -J Sing-maxquant
#SBATCH -p gpu
#SBATCH -o %J.out
#SBATCH -e %J.err
#SBATCH --time=20:00:00
#SBATCH --exclusive

module add test-modules singularity/3.5.3/gcc-8.2.0

singularity exec /home/ViperAppsFiles/singularity/containers/maxquant.sif mono MaxQuant/bin/MaxQuantCmd.exe mqpar.xml
Singularity/Apptainer command example
singularity Run Apptainer (Singularity)
exec Tell Apptainer (Singularity) to execute a command (rather than run a shell for example)
/home/ViperAppsFiles/singularity/containers/maxquant.sif Tell Apptainer (Singularity) the Container to use
mono MaxQuant/bin/MaxQuantCmd.exe mqpar.xml Run the command, in this case the command is mono, with the dotnet application MaxQuantCmd.exe and with experiment configuration file mqpar.xml