#chmod a+r /hptc_cluster/slurm/job/jobacct.log

You can find detailed information on the sacct command and job accounting data in the sacct(1) manpage.

Fault Tolerance

SLURM can handle a variety of failure modes without terminating workloads, including crashes of the node running the SLURM controller. User jobs may be configured to continue execution despite the failure of one or more nodes on which they are executing. The command controlling a job may detach and reattach from the parallel tasks at any time. Nodes allocated to a job are available for reuse as soon as the job(s) allocated to that node terminate. If some nodes fail to complete job termination in a timely fashion because of hardware or software problems, only the scheduling of those tardy nodes will be affected.

Security

SLURM has a simple security model:

Any user of the system can submit jobs to execute. Any user can cancel his or her own jobs. Any user can view SLURM configuration and state information.

Only privileged users can modify the SLURM configuration, cancel other users' jobs, or perform other restricted activities. Privileged users in SLURM include root users and SlurmUser (as defined in the SLURM configuration file).

If permission to modify SLURM configuration is required by others, set-uidprograms may be used to grant specific permissions to specific users.

SLURM accomplishes security by means of communication authentication, job authentication, and user authorization.

Refer to SLURM documentation for further information about SLURM security features.

66 Using SLURM