Example 6-8: Reporting Reasons for Downed, Drained, and Draining Nodes

$ sinfo -R

 

REASON

NODELIST

Memory errors

dev[0,5]

Not Responding

dev8

 

 

6.8 Job Accounting

HP XC System Software provides an extension to SLURM for job accounting. The sacct command displays job accounting data in a variety of forms for your analysis. Job accounting data is stored in a log file; the sacct command filters that log file to report on your jobs, jobsteps, status, and errors. See your system administrator if job accounting is not configured on your system.

You can find detailed information on the sacct command and job accounting data in the sacct(1) manpage.

6.9 Fault Tolerance

SLURM can handle a variety of failure modes without terminating workloads, including crashes of the node running the SLURM controller. User jobs may be configured to continue execution despite the failure of one or more nodes on which they are executing (refer to Section 6.4.5.1 for further information). The command controlling a job may detach and reattach from the parallel tasks at any time. Nodes allocated to a job are available for reuse as soon as the job(s) allocated to that node terminate. If some nodes fail to complete job termination in a timely fashion because of hardware or software problems, only the scheduling of those tardy nodes will be affected.

6.10 Security

SLURM has a simple security model:

Any user of the system can submit jobs to execute. Any user can cancel his or her own jobs. Any user can view SLURM configuration and state information.

Only privileged users can modify the SLURM configuration, cancel any job, or perform other restricted activities. Privileged users in SLURM include root users and SlurmUser (as defined in the SLURM configuration file).

If permission to modify SLURM configuration is required by others, set-uidprograms may be used to grant specific permissions to specific users.

SLURM accomplishes security by means of communication authentication, job authentication, and user authorization.

Refer to SLURM documentation for further information about SLURM security features.

6-14Using SLURM