Accessing compute nodes

The login node of Keeling is useful for lightweight computing that requires not a lot of memory and not a large number of cores. Programs that require large amounts of memory or CPUs will be killed if run on the head node.

To avoid this issue, users must access the compute nodes. This can be done in two ways, interactively or using the batch system which uses SLURM.

Available compute resources

Keeling consists of different partitions that you may submit your job script to depending on your computing needs. Partitions vary on number of nodes required, amount of wall time allowed, and availability of special resources (GPUs or nodes with higher core counts),

Keeling consists of the following types of nodes with the following features

Name

Cores per node

Real memory (MB)

GPU VRAM

a

4

12000

b

8

16000, 32000

c

12

32160, 72480, 118000

d

12

64300, 80500

e

12

64300

f

8

24000

g

20

128800

gpu (Titan RTX)

20

253000

24 GB

gpu (L40S)

96

253000

46 GB

h

20 or 24

256200

i

32

192000

j

48

253000

Interactive computing

An interactive session reserves dedicated resources on compute nodes allowing you to use them interactively as you would the login node. To request an interactive job, the qlogin command is used. An example of a request:

qlogin -n 20 --mem-per-cpu=4096 --time=12:00:00 --job-name=interactive

will utilize 20 cores with 4 GB of memory per core (specified as 4096 MB), a wall clock limit of 12 hours.

Note

Interactive job launched by qlogin are only allowed one node.

To finalize your interactive compute job may do:

exit

which will end you job and release the node for others to use.

Alternatively you can use srun to access other partitions to request more resources.

Accessing GPU nodes

Keeling has 1 Titan RTX GPU node and 8 L40S GPUs. To access these resources, you must specify the gpu or L40S partition when requesting an interactive session.

To interactively compute on the Titan RTX GPU node (24 GB VRAM):

qlogin -p gpu -n 20 --gres=gpu:RTX:1 --mem=253000

To interactively compute on a newer L40S GPU node (46 GB VRAM):

qlogin -p l40s -n 96 --gres=gpu:L40S:1 --mem=253000

Note

Once in the interactive sessions, you must run the following to load the neccessary modules to use the L40S GPU:

module purge
module load L40S

Batch computing

A batch job is controlled by a script written by the user who submits the job to the batch system (Slurm). The batch system then selects the resources for the job given the parameters from the user and decides where and when to run the job. This method allows you to have access to more resources.

Jobs are submitted by:

sbatch <job script>

The job script is a shell script that contains the necessary directives for the batch system to allocate resources for the job. The job script should typically contain the following directives:

Directive

Description

#SBATCH –job-name

Name of the job

#SBATCH -p

Partition

#SBATCH -n

Number of cores

#SBATCH –time

Wall clock time

#SBATCH –mem

Memory

For GPU batch computing:

Directive

Description

#SBATCH –gres

Resources request (GPUS). For L40S, gpu:L40S:1

Additional useful directives:

Directive

Description

#SBATCH –mail-user

Email address to receive notifications

#SBATCH –mail-type

Reasons to notify (e.g. BEGIN, END, FAIL)

#SBATCH –constraint

Node constraint (i.e. required node feature)

#SBATCH –sockets

Number of sockets per node

#SBATCH –cores

Number of cores per socket

#SBATCH –mem-per-cpu

Memory per core. Alternative to –mem

Example scripts

A simple script requesting a single node consisting of 12 cores for 24 hours:

#!/bin/bash

#SBATCH --job-name=<job name>
#SBATCH -p node
#SBATCH -n 12
#SBATCH --time=24:00:00
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=END
#SBATCH --mail-user=<netID>@illinois.edu

A simple script requesting many nodes consisting of 100 cores for 12 hours without caring about node configuration:

#!/bin/bash

#SBATCH --job-name=<job name>
#SBATCH -p sesempi
#SBATCH -n 100
#SBATCH --time=12:00:00
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=END
#SBATCH --mail-user=<netID>@illinois.edu

A simple script for requesting a handful of cores but a large amount of memory per core (16 GB) for 6 hours:

#!/bin/bash

#SBATCH --job-name=<job name>
#SBATCH -p node
#SBATCH -n 4
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=16g

A more advanced job script requesting 100 cores, 4 GB of memory per core, 12 hours of wall clock time and wanting a node configuration of 2 sockets with 10 cores each (i.e. using exclusively keeling-g nodes) and running WRF:

#!/bin/bash

#SBATCH --job-name=<job name>
#SBATCH -p sesempi
#SBATCH -n 100
#SBATCH --sockets-per-node=2
#SBATCH --cores-per-socket=10
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=4096
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=END
#SBATCH --mail-user=<netID>@illinois.edu

mpirun -np $SLURM_NTASKS ./wrf.exe

A script for accessing a L40S GPU, requesting 96 cores, 2 days of wall clock time, and 253 GB of memory:

#SBATCH -p l40s
#SBATCH -N 1
#SBATCH -n 96
#SBATCH --gres=gpu:L40S:1
#SBATCH --constraint=tl40s
#SBATCH --mem=253000
#SBATCH --time=48:00:00
#SBATCH --output=batchout
#SBATCH --mail-user=<netID>@illinois.edu

module purge
module load L40S

Helpful SLURM command line options

Command

Description

sinfo

View partition and node information for a system. Helpful viewing node availability

sbatch

Submit a job script

squeue

View information about jobs located in the Slurm scheduling queue

scancel

Signal job to quit

sshare

View listing the shares of associations on the system

sacct

View accounting data for all jobs

sview

Graphical interface of the Slurm state

Information regarding each command may be found here or by viewing each command’s man page.

Access compute nodes directly using SSH

This is not allowed except for monitoring already running jobs. However if you need to monitor a job (or set up SSH tunnel), you may access the specific compute node by first identifying the node your job is running on by:

squeue -u $USER

which will list information regarding the your running and queued jobs with NODELIST denoting the nodes of your running jobs. You may then ssh directly into that node by the following:

ssh keeling-<node letter and number>