Quickstart/Slurm

From HPC
Jump to: navigation , search

Introduction

One of the core components of HPC clusters such as Viper is the job scheduler. The basic task of the job scheduler is to manage the allocation of tasks to computes nodes. On Viper we use the SLURM (Simple Linux Utility for Resource Management) workload manager, which is one of the most common schedulers used on Supercomputers across the world.

The key things with the Slurm scheduler are:

  • It has queues (or partitions as they are referred to) for each dedicated resource type, i.e. standard compute, high memory or GPU
  • Manages the queue of pending jobs for efficient and fair scheduling, and allocates computer resource to jobs (i.e. nodes and cores)
  • Manages the execution and monitoring of tasks on the compute nodes
  • Slurm works on a (mainly) first come first served basis – accurate job requests help for efficient scheduling
  • When you want to run a task need to tell the scheduler what resources you need and what you want to do

Slurm Queues

As mentioned above we use Slurm queues as a way of managed access to different node types. On Viper we have the following main queues:

Queue Name Description
compute The standard compute nodes that make up the majority of Viper's compute resource. Each standard compute node has 28 compute cores and 128GB of memory. Most standard use cases will make use of this queue.
highmem High memory nodes for any task that is more memory intensive, with nodes that have 40 compute cores and 1TB of memory.
gpu Nodes with GPU accelerators, useful for specific use cases such as machine learning where the tasks can see a significant performance benefit over using standard CPU.

Introduction to Slurm Commands

squeue

One of the most common Slurm commands you will run is squeue which shows a list of all jobs or tasks that are running on Viper, with details

squeue shows information about jobs in the scheduling queue

[username@login01 ~]$ squeue
             JOBID  PARTITION     NAME   USER   ST      TIME  NODES NODELIST(REASON)
             306414   compute  clasDFT   749094 R      16:36      1 c006
             306413   compute mpi_benc   749094 R      31:02      2 c[005,007]
             306411   compute  orca_1n   465449 R    1:00:32      1 c004
             306410   compute  orca_1n   465449 R    1:04:17      1 c003
             306409   highmem cnv_obit   465449 R   11:37:17      1 c232
             306407   compute  20M4_20   465449 R   11:45:54      1 c012
             306406   compute 20_ML_20   465449 R   11:55:40      1 c012
Heading Description
JOBID A unique identifier assigned to a job. If you have an issue with a task on Viper, this unique number will help us identify and get information on the task.
PARTITION The queue or partition the task is running on, which indicates the type of node being used e.g. compute, highmem, GPU
NAME Name of job
USER User ID of job owner
ST Job state code e.g. R stands for 'Running', PD stands for 'Pending' (waiting to run)
TIME Length of time a job has been running
NODES Amount of nodes a job is running on
NODELIST(REASON) List of nodes a job is running on also provides a reason a job is not running e.g. a dependency on a node.

squeue -u username

Using the flag -u and your username will provide only your jobs.

[749094@login01 ~]$ squeue -u 749094
             JOBID  PARTITION     NAME   USER   ST      TIME  NODES NODELIST(REASON)
             306413   compute  clasDFT   749094 R      16:36      1 c006
             306414   compute mpi_benc   749094 R      31:02      2 c[005,007]

scancel

scancel is used to cancel your jobs. Only jobs running under your userid may be cancelled. No output is given by this command, but you can check squeueme to see if it has gone.

[username@login01 ~]$ scancel 289535
[username@login01 ~]$

interactive

The interactive command, available on the Viper login node will start an interactive session on a compute node allowing you to start work.



Back / Next (Batch Jobs)