6

Using SLURM

6.1 Introduction

HP XC uses the Simple Linux Utility for Resource Management (SLURM) for system resource management and job scheduling. SLURM is a reliable, efficient, open source, fault-tolerant, job and compute resource manager with features that make it suitable for large-scale, high performance computing environments. SLURM can report on machine status, perform partition management, job management, and job scheduling.

The SLURM Reference Manual is available on the HP XC Documentation CD-ROM and from the following Web site: http://www.llnl.gov/LCdocs/slurm/.

As a system resource manager, SLURM has the following key functions:

Allocate exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work

Provide a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes

Arbitrate conflicting requests for resources by managing a queue of pending work

Section 1.4.3 describes the interaction between SLURM and LSF.

6.2 SLURM Commands

Users interact with SLURM through its command line utilities. SLURM has the following basic commands: srun, scancel, squeue, sinfo, and scontrol, which can run on any node in the HP XC system. These commands are summarized in Table 6-1and described

in the following sections.

Table 6-1: SLURM Commands

Command Function

srun

Submits jobs to run under SLURM management. srun is used to submit a job for

 

execution, allocate resources, attach to an existing allocation, or initiate job steps.

 

srun can:

 

• Submit a batch job and then terminate

 

• Submit an interactive job and then persist to shepherd the job as it runs

 

Allocate resources to a shell and then spawn that shell for use in running

 

 

subordinate jobs

squeue

Displays the queue of running and waiting jobs (or "job steps"), including the JobID

 

used for scancel), and the nodes assigned to each running job. It has a wide variety

 

of filtering, sorting, and formatting options. By default, it reports the running jobs in

 

priority order and then the pending jobs in priority order.

scancel

Cancels a pending or running job or job step. It can also be used to send a specified

 

signal to all processes on all nodes associated with a job. Only job owners or

 

administrators can cancel jobs.

Using SLURM 6-1