Difference between revisions of "Quickstart/Slurm"

Latest revision as of 11:54, 17 November 2022

Introduction

One of the core components of HPC clusters such as Viper is the job scheduler. The basic task of the job scheduler is to manage the allocation of tasks to computes nodes. On Viper we use the SLURM (Simple Linux Utility for Resource Management) workload manager, which is one of the most common schedulers used on Supercomputers across the world.

The key things with the Slurm scheduler are:

It has queues (or partitions as they are referred to) for each dedicated resource type, i.e. standard compute, high memory or GPU

Manages the queue of pending jobs for efficient and fair scheduling, and allocates computer resource to jobs (i.e. nodes and cores)

Manages the execution and monitoring of tasks on the compute nodes

Slurm works on a (mainly) first come first served basis – accurate job requests help for efficient scheduling

When you want to run a task need to tell the scheduler what resources you need and what you want to do

Slurm Queues

As mentioned above we use Slurm queues as a way of managed access to different node types. On Viper we have the following main queues:

Queue Name	Description
compute	The standard compute nodes that make up the majority of Viper's compute resource. Each standard compute node has 28 compute cores and 128GB of memory. Most standard use cases will make use of this queue.
highmem	High memory nodes for any task that is more memory intensive, with nodes that have 40 compute cores and 1TB of memory.
gpu	Nodes with GPU accelerators, useful for specific use cases such as machine learning where the tasks can see a significant performance benefit over using standard CPU.

Introduction to Slurm Commands

squeue

One of the most common Slurm commands you will run is squeue which shows a list of all jobs or tasks that are running on Viper, with details

squeue shows information about jobs in the scheduling queue

[username@login01 ~]$ squeue
             JOBID  PARTITION     NAME   USER   ST      TIME  NODES NODELIST(REASON)
             306414   compute  clasDFT   749094 R      16:36      1 c006
             306413   compute mpi_benc   749094 R      31:02      2 c[005,007]
             306411   compute  orca_1n   465449 R    1:00:32      1 c004
             306410   compute  orca_1n   465449 R    1:04:17      1 c003
             306409   highmem cnv_obit   465449 R   11:37:17      1 c232
             306407   compute  20M4_20   465449 R   11:45:54      1 c012
             306406   compute 20_ML_20   465449 R   11:55:40      1 c012

Heading	Description
JOBID	A unique identifier assigned to a job. If you have an issue with a task on Viper, this unique number will help us identify and get information on the task.
PARTITION	The queue or partition the task is running on, which indicates the type of node being used e.g. compute, highmem, GPU
NAME	Name of job
USER	User ID of job owner
ST	Job state code e.g. R stands for 'Running', PD stands for 'Pending' (waiting to run)
TIME	Length of time a job has been running
NODES	Amount of nodes a job is running on
NODELIST(REASON)	List of nodes a job is running on also provides a reason a job is not running e.g. a dependency on a node.

squeue -u username

Using the flag -u and your username will provide only your jobs.

[749094@login01 ~]$ squeue -u 749094
             JOBID  PARTITION     NAME   USER   ST      TIME  NODES NODELIST(REASON)
             306413   compute  clasDFT   749094 R      16:36      1 c006
             306414   compute mpi_benc   749094 R      31:02      2 c[005,007]

scancel

scancel is used to cancel your jobs. Only jobs running under your userid may be cancelled. No output is given by this command, but you can check squeueme to see if it has gone.

[username@login01 ~]$ scancel 289535
[username@login01 ~]$

interactive

The interactive command, available on the Viper login node will start an interactive session on a compute node allowing you to start work.

Back / Next (Batch Jobs)

HPC

Difference between revisions of "Quickstart/Slurm"

Latest revision as of 11:54, 17 November 2022

Contents

Introduction

Slurm Queues

Introduction to Slurm Commands

squeue

squeue -u username

scancel

interactive

Navigation

Support Areas

Research

Tools

@@ Line 40: / Line 40: @@
 [username@login01 ~]$ squeue
               JOBID  PARTITION     NAME   USER   ST      TIME  NODES NODELIST(REASON)
-   compute  clasDFT   465449 R      16:36      1 c006
+   compute  clasDFT   749094 R      16:36      1 c006
-   compute mpi_benc   465449 R      31:02      2 c[005,007]
+   compute mpi_benc   749094 R      31:02      2 c[005,007]
    compute  orca_1n   465449 R    1:00:32      1 c004
    compute  orca_1n   465449 R    1:04:17      1 c003
@@ Line 77: / Line 77: @@
 |}
-=== squeueme ===
+==== squeue -u username====
+Using the flag -u and your username will provide only your jobs.
+<pre style="background-color: #000000; color: white; border: 2px solid black; font-family: monospace, sans-serif;">
+[749094@login01 ~]$ squeue -u 749094
+             JOBID  PARTITION     NAME   USER   ST      TIME  NODES NODELIST(REASON)
+   compute  clasDFT   749094 R      16:36      1 c006
+   compute mpi_benc   749094 R      31:02      2 c[005,007]
+</pre>
+<!-- === squeueme ===
 The '''squeueme''' command provides similar information as squeue however it shows only jobs you are running making it easier to identify what you have running, and provides more specific information relating to your jobs:
@@ Line 112: / Line 120: @@
 | List of nodes a job is running on also provides a reason a job is not running e.g. a dependency on a node.
 |}
+-->
 ===scancel===
 scancel is used to cancel your jobs. Only jobs running under your userid may be cancelled. No output is given by this command, but you can check squeueme to see if it has gone.