Difference between revisions of "Quickstart/Slurm"
m |
m (→interactive) |
||
Line 120: | Line 120: | ||
=== interactive === | === interactive === | ||
− | The [[Quickstart/Interactive| | + | The [[Quickstart/Interactive| interactive]] command, available on the Viper login node will start an interactive session on a compute node allowing you to start work. |
Revision as of 15:25, 9 November 2022
Contents
Introduction
One of the core components of HPC clusters such as Viper is the job scheduler. The basic task of the job scheduler is to manage the allocation of tasks to computes nodes. On Viper we use the SLURM (Simple Linux Utility for Resource Management) workload manager, which is one of the most common schedulers used on Supercomputers across the world.
The key things with the Slurm scheduler are:
- It has queues (or partitions as they are referred to) for each dedicated resource type, i.e. standard compute, high memory or GPU
- Manages the queue of pending jobs for efficient and fair scheduling, and allocates computer resource to jobs (i.e. nodes and cores)
- Manages the execution and monitoring of tasks on the compute nodes
- Slurm works on a (mainly) first come first served basis – accurate job requests help for efficient scheduling
- When you want to run a task need to tell the scheduler what resources you need and what you want to do
Slurm Queues
As mentioned above we use Slurm queues as a way of managed access to different node types. On Viper we have the following main queues:
Queue Name | Description |
compute | The standard compute nodes that make up the majority of Viper's compute resource. Each standard compute node has 28 compute cores and 128GB of memory. Most standard use cases will make use of this queue. |
highmem | High memory nodes for any task that is more memory intensive, with nodes that have 40 compute cores and 1TB of memory. |
gpu | Nodes with GPU accelerators, useful for specific use cases such as machine learning where the tasks can see a significant performance benefit over using standard CPU. |
Introduction to Slurm Commands
squeue
One of the most common Slurm commands you will run is squeue which shows a list of all jobs or tasks that are running on Viper, with details
squeue shows information about jobs in the scheduling queue
[username@login01 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 306414 compute clasDFT 465449 R 16:36 1 c006 306413 compute mpi_benc 465449 R 31:02 2 c[005,007] 306411 compute orca_1n 465449 R 1:00:32 1 c004 306410 compute orca_1n 465449 R 1:04:17 1 c003 306409 highmem cnv_obit 465449 R 11:37:17 1 c232 306407 compute 20M4_20 465449 R 11:45:54 1 c012 306406 compute 20_ML_20 465449 R 11:55:40 1 c012
Heading | Description |
JOBID | A unique identifier assigned to a job. If you have an issue with a task on Viper, this unique number will help us identify and get information on the task. |
PARTITION | The queue or partition the task is running on, which indicates the type of node being used e.g. compute, highmem, GPU |
NAME | Name of job |
USER | User ID of job owner |
ST | Job state code e.g. R stands for 'Running', PD stands for 'Pending' (waiting to run) |
TIME | Length of time a job has been running |
NODES | Amount of nodes a job is running on |
NODELIST(REASON) | List of nodes a job is running on also provides a reason a job is not running e.g. a dependency on a node. |
squeueme
The squeueme command provides similar information as squeue however it shows only jobs you are running making it easier to identify what you have running, and provides more specific information relating to your jobs:
JOBID PARTITION NAME STATE START_TIME TIME_LEFT CPUS/NODES NODELIST(REASON) 3619304 compute KHIrestart PENDING N/A 2-00:00:00 28/1 (Resources)
Heading | Description |
JOBID | The unique identifier assigned to a job. |
PARTITION | The queue or partition the task is running on, which indicates the type of node being used e.g. compute, highmem, GPU |
NAME | Name of job |
STATE | The job state e.g. Running, Pending (waiting to run) |
START_TIME | Prediction on when a pending job is likely to start |
TIME_LEFT | How long a job has got left to run (based on the Slurm allocation, not the actual science) |
CPU/NODES | The number of CPU cores and number of nodes the job is running on |
NODELIST(REASON) | List of nodes a job is running on also provides a reason a job is not running e.g. a dependency on a node. |
scancel
scancel is used to cancel currently running jobs. Only jobs running under your userid may be cancelled. No output is give by this command, but you can check squeueme to see if it has gone.
[username@login01 ~]$ scancel 289535 [username@login01 ~]$
interactive
The interactive command, available on the Viper login node will start an interactive session on a compute node allowing you to start work.