Difference between revisions of "Quickstart/Slurm"

Revision as of 15:24, 9 November 2022

Introduction

One of the core components of HPC clusters such as Viper is the job scheduler. The basic task of the job scheduler is to manage the allocation of tasks to computes nodes. On Viper we use the SLURM (Simple Linux Utility for Resource Management) workload manager, which is one of the most common schedulers used on Supercomputers across the world.

The key things with the Slurm scheduler are:

It has queues (or partitions as they are referred to) for each dedicated resource type, i.e. standard compute, high memory or GPU

Manages the queue of pending jobs for efficient and fair scheduling, and allocates computer resource to jobs (i.e. nodes and cores)

Manages the execution and monitoring of tasks on the compute nodes

Slurm works on a (mainly) first come first served basis – accurate job requests help for efficient scheduling

When you want to run a task need to tell the scheduler what resources you need and what you want to do

Slurm Queues

As mentioned above we use Slurm queues as a way of managed access to different node types. On Viper we have the following main queues:

Queue Name	Description
compute	The standard compute nodes that make up the majority of Viper's compute resource. Each standard compute node has 28 compute cores and 128GB of memory. Most standard use cases will make use of this queue.
highmem	High memory nodes for any task that is more memory intensive, with nodes that have 40 compute cores and 1TB of memory.
gpu	Nodes with GPU accelerators, useful for specific use cases such as machine learning where the tasks can see a significant performance benefit over using standard CPU.

Introduction to Slurm Commands

squeue

One of the most common Slurm commands you will run is squeue which shows a list of all jobs or tasks that are running on Viper, with details

squeue shows information about jobs in the scheduling queue

[username@login01 ~]$ squeue
             JOBID  PARTITION     NAME   USER   ST      TIME  NODES NODELIST(REASON)
             306414   compute  clasDFT   465449 R      16:36      1 c006
             306413   compute mpi_benc   465449 R      31:02      2 c[005,007]
             306411   compute  orca_1n   465449 R    1:00:32      1 c004
             306410   compute  orca_1n   465449 R    1:04:17      1 c003
             306409   highmem cnv_obit   465449 R   11:37:17      1 c232
             306407   compute  20M4_20   465449 R   11:45:54      1 c012
             306406   compute 20_ML_20   465449 R   11:55:40      1 c012

Heading	Description
JOBID	A unique identifier assigned to a job. If you have an issue with a task on Viper, this unique number will help us identify and get information on the task.
PARTITION	The queue or partition the task is running on, which indicates the type of node being used e.g. compute, highmem, GPU
NAME	Name of job
USER	User ID of job owner
ST	Job state code e.g. R stands for 'Running', PD stands for 'Pending' (waiting to run)
TIME	Length of time a job has been running
NODES	Amount of nodes a job is running on
NODELIST(REASON)	List of nodes a job is running on also provides a reason a job is not running e.g. a dependency on a node.

squeueme

The squeueme command provides similar information as squeue however it shows only jobs you are running making it easier to identify what you have running, and provides more specific information relating to your jobs:

JOBID           PARTITION  NAME             STATE    START_TIME  TIME_LEFT    CPUS/NODES  NODELIST(REASON)
3619304         compute    KHIrestart       PENDING  N/A         2-00:00:00   28/1        (Resources)

Heading	Description
JOBID	The unique identifier assigned to a job.
PARTITION	The queue or partition the task is running on, which indicates the type of node being used e.g. compute, highmem, GPU
NAME	Name of job
STATE	The job state e.g. Running, Pending (waiting to run)
START_TIME	Prediction on when a pending job is likely to start
TIME_LEFT	How long a job has got left to run (based on the Slurm allocation, not the actual science)
CPU/NODES	The number of CPU cores and number of nodes the job is running on
NODELIST(REASON)	List of nodes a job is running on also provides a reason a job is not running e.g. a dependency on a node.

scancel

scancel is used to cancel currently running jobs. Only jobs running under your userid may be cancelled. No output is give by this command, but you can check squeueme to see if it has gone.

[username@login01 ~]$ scancel 289535
[username@login01 ~]$

interactive

The interactive command, available on the Viper login node will start an interactive session on a compute node allowing you to start work.

Back / Next (Batch Jobs)

@@ Line 121: / Line 121: @@
 === interactive ===
 The [[Quickstart/Interactive| '''interactive''']] command, available on the Viper login node will start an interactive session on a compute node allowing you to start work.
+[[Newmainpage #Quickstart| Back]]   /   [[Quickstart/Batch Jobs| Next (Batch Jobs)]]

HPC

Difference between revisions of "Quickstart/Slurm"

Revision as of 15:24, 9 November 2022

Contents

Introduction

Slurm Queues

Introduction to Slurm Commands

squeue

squeueme

scancel

interactive

Navigation

Support

Research

Tools