The squeue command can report on jobs in the job queue according to their state; valid states are: pending, running, completing, completed, failed, timeout, and node_fail. Example 6-3 uses the squeue command to report on failed jobs.

Example 6-3: Reporting on Failed Jobs in the Queue

$ squeue --state=FAILED

 

 

 

 

 

JOBID

PARTITION

NAME

USER

ST

TIME

NODES

NODELIST

59

amt1

hostname

root

F

0:00

0

 

 

 

 

 

 

 

 

 

6.6 Killing Jobs with the scancel Command

The scancel command cancels a pending or running job or job step. It can also be used to send a specified signal to all processes on all nodes associated with a job. Only job owners or administrators can cancel jobs.

Example 6-4 kills job 415 and all its jobsteps.

Example 6-4: Killing a Job by Its JobID

$ scancel 415

Example 6-5 cancels all pending jobs.

Example 6-5: Cancelling All Pending Jobs

$ scancel --state=PENDING

Example 6-6 sends the TERM signal to terminate jobsteps 421.2 and 421.3.

Example 6-6: Sending a Signal to a Job

$ scancel --signal=TERM 421.2 421.3

6.7 Getting System Information with the sinfo Command

The sinfo command reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options. sinfo displays a summary of available partition and node (not job) information (such as partition names, nodes/partition, and CPUs/node).

Example 6-7: Using the sinfo Command (No Options)

$ sinfo

 

 

 

 

 

PARTITION AVAIL TIMELIMIT NODES

STATE

NODELIST

lsf

up

infinite

1

down*

n15

lsf

up

infinite

2

idle

n[14,16]

 

 

 

 

 

 

Using SLURM 6-13

Page 83
Image 83
HP XC System 2.x Software manual Killing Jobs with the scancel Command, Getting System Information with the sinfo Command