Using slurm to submit and schedule jobs
Using slurm to submit and schedule jobs
Logging in
Log into your HPC cluster using ssh
with your Terminal
ssh USERNAME@CLUSTERNAME
Interactive session
For Terminal-level operations, like file management, conda env setup, writing bash scripts, etc. it is better not to use the so-called “login nodes,” which are the nodes that you are using when you first SSH into the cluster. Instead, you should start what’s known as an “interactive session” on a compute node.
Depending on your cluster’s configuration, this may be done with the srun
command.
srun --pty bash
However, some clusters have aliases for this command, such as idev
on TACC clusters.
idev
or sdev
on the Stanford Sherlock cluster
sdev
These interactive jobs typically have a standard length and amount of compute associated with them. You can usually modify the default by adding an appropriate dotfile to your home directory; refer to your cluster’s documentation for details.
Creating a batch script
Suppose that the script that you want to run is a Python file called ‘montecarlo.py’. In order to manage your job and its resources, you must create a second file that contains the instructions for the job scheduler. This file is called a “batch script” and typically has the extension .sbatch
.
Here is a template for a typical batch script montecarlo.sbatch
. Note the different parameters that are specified at the top of the file. These are the parameters that you will need to modify for your specific job’s resource settings
#!/bin/bash
# Job name:
#SBATCH --job-name=myjob
#
# Account to charge:
#SBATCH --account=[lab account name]
#
# Pick partition to run on:
#SBATCH --partition=gpu-a100
#
# File where job progress and standard output is written
#SBATCH --output=myjob.out
#
# File where job errors are written
#SBATCH --error=myjob.err
#
# Request only one node:
#SBATCH --nodes=1
#
# memory per node: (uses full node memory if set to zero)
#SBATCH --mem=0
#
# number of tasks
#SBATCH --ntasks=1
#
# Processors per task:
#SBATCH --cpus-per-task=2
#
# Wall clock limit: HH:MM:SS. Max is 48 hours on most nodes
#SBATCH --time=05:30:00
#
## Command(s) to run
python ./scripts/montecarlo.py
This script will charge the usage to an account [lab account name]
. On some clusters this is your username, while on others like TACC it is a charge code. Some clusters do not require this field (remove it).
Your batch job will output two files, in addition to any output produced by your script. The first is myjob.out
, which contains the standard output / printing produced by your script. The second is myjob.err
, which contains any errors that your script produces.
Submitting, checking, and cancelling jobs
To submit your job, use the sbatch
command
sbatch montecarlo.sbatch
To check the status of your job, use the squeue
command
squeue -u USERNAME
To cancel your job, use the scancel
command
scancel JOBID
You can get the job ID from the output of squeue
, or from the initial output of sbatch
when you submit your job.
Loading modules
For many packages, you will need to load a pre-installed module on the HPC before you can use it. For example, to use the Anaconda Python distribution on some HPC, you must first load the anaconda3
module. This is done with the module load
command.
module load anaconda3
This bash command should be placed at the top of your batch script, before the command to run your script. Often you will need to load very specific versions of libraries like cuda
or cudnn
for GPU computing. Likewise, mpi
and gcc
are very sensitive imports.
Refer to your cluster’s documentation for details on compatibility among different modules.
Moving files between local and remote
Copy a file from local over to $HOME
$ scp cfgen.py your_username@remotehost.edu:home
Copy a file over to local
scp your_username@remotehost.edu:my_output_fils/*results.txt /some/local/directory