Job Startup and Job Control

This bsub command launches a request for four cores (from the -n4option of the bsub command) across four nodes (from the -ext "SLURM[nodes=4]" option); the job is launched on those cores. The script, myscript, which is shown here, runs the job:

#!/bin/sh hostname srun hostname

mpirun -srun ./hellompi

3.LSF-HPC schedules the job and monitors the state of the resources (compute nodes) in the SLURM lsf partition. When the LSF-HPC scheduler determines that the required resources are available, LSF-HPC allocates those resources in SLURM and obtains a SLURM job identifier (jobID) that corresponds to the allocation.

In this example, four cores spread over four nodes (n1,n2,n3,n4) are allocated for myscript, and the SLURM job id of 53 is assigned to the allocation.

4.LSF-HPC prepares the user environment for the job on the LSF execution host node and dispatches the job with the job_starter.sh script. This user environment includes standard LSF environment variables and two SLURM-specific environment variables: SLURM_JOBID and SLURM_NPROCS.

SLURM_JOBID is the SLURM job ID of the job. Note that this is not the same as the LSF-HPC jobID. "Translating SLURM and LSF-HPC JOBIDs" describes the relationship between the SLURM_JOBID

and the LSF-HPC JOBID.

SLURM_NPROCS is the number of processes allocated.

These environment variables are intended for use by the user's job, whether it is explicitly (user scripts may use these variables as necessary) or implicitly (the srun commands in the user’s job use these variables to determine its allocation of resources).

The value for SLURM_NPROCS is 4 and the SLURM_JOBID is 53 in this example.

5.The user job myscript begins execution on compute node n1.

The first line in myscript is the hostname command. It executes locally and returns the name of node, n1.

6.The second line in the myscript script is the srun hostname command. The srun command in myscript inherits SLURM_JOBID and SLURM_NPROCS from the environment and executes the hostname command on each compute node in the allocation.

7.The output of the hostname tasks (n1, n2, n3, and n4). is aggregated back to the srun launch command (shown as dashed lines in Figure 9-1), and is ultimately returned to the srun command in the job starter script, where it is collected by LSF-HPC.

The last line in myscript is the mpirun -srun ./hellompicommand. The srun command inside the mpirun command in myscript inherits the SLURM_JOBID and SLURM_NPROCS environment variables from the environment and executes hellompi on each compute node in the allocation.

The output of the hellompi tasks is aggregated back to the srun launch command where it is collected by LSF-HPC.

The command executes on the allocated compute nodes n1, n2, n3, and n4.

When the job finishes, LSF-HPC cancels the SLURM allocation, which frees the compute nodes for use by another job.

Notes About Using LSF-HPC in the HP XC Environment

This section provides some additional information that should be noted about using LSF-HPC in the HP XC Environment.

When LSF-HPC starts a SLURM job, it sets SLURM_JOBID to associate the job with the SLURM allocation. While a job is running, all LSF-HPC supported operating-system-enforced resource limits are supported, including core limit, CPU time limit, data limit, file size limit, memory limit, and stack limit. If the user kills a job, LSF-HPC propagates signals to entire job, including the job file running on the local node and all tasks running on remote nodes.

74 Using LSF

HP XC System 3.x Software manual Job Startup and Job Control

Models: XC System 3.x Software

#!/bin/sh hostname srun hostname

mpirun -srun ./hellompi

Job Startup and Job Control