HP XC System 3.x Software manual Translating Slurm and LSF-HPC JOBIDs

Page 83

Example 9-7 Using the bhist Command (Long Output)

$ bhist -l 24

Job <24>, User <lsfadmin>, Project <default>, Interactive pseudo-terminal shell mode, Extsched <SLURM[nodes=4]>, Command </bin/bash> date and time stamp: Submitted from host <n2>, to Queue <normal>, CWD <$HOME>,

4 Processors Requested, Requested Resources <type=any>;

date and time stamp: Dispatched to 4 Hosts/Processors <4*lsfhost.localdomain>;

date and time stamp: slurm_id=22;ncpus=8;slurm_alloc=n[5-8];

date and time stamp: Starting (Pid 4785);

Summary of time in seconds spent in various

states by

date and time stamp

 

 

 

 

 

PEND

PSUSP

RUN

USUSP

SSUSP

UNKWN

TOTAL

11

0

124

0

0

0

135

Translating SLURM and LSF-HPC JOBIDs

LSF-HPC and SLURM are independent resource management components of the HP XC system. They maintain their own job identifiers (JOBIDs). It may be useful to be able to determine which the SLURM_JOBID environment variable matches an LSF JOBID, and vice versa.

When a job is submitted to LSF-HPC, it is given an LSF JOBID, as in this example:

$ bsub -o %J.out -n 8 sleep 300

Job <99> is submitted to default queue <normal>

The following is the sequence of events when a SLURM JOBID is assigned:.

No SLURM_JOBID exists while the job is PENDing in LSF-HPC.

After LSF-HPC determines that the resources are available in SLURM for this job, LSF-HPC requests an allocation in SLURM.

After the SLURM allocation is established, there is a corresponding SLURM JOBID for the LSF JOBID. Use the bjobs command to view the SLURM JOBID:

$ bjobs -l 99 grep slurm

date and time stamp: slurm_id=123;ncpus=8;slurm_alloc=n[13-16];

The SLURM JOBID is 123 for the LSF JOBID 99.

You can also find the allocation information in the output of the bhist command:

$ bhist -l 99 grep slurm

date and time stamp: slurm_id=123;ncpus=8;slurm_alloc=n[13-16];

When LSF-HPC creates an allocation in SLURM, it constructs a name for the allocation by combining the LSF cluster name with the LSF-HPC JOBID. You can see this name with the scontrol and sacct commands while the job is running:

$ scontrol show job grep Name

Name=hptclsf@99

$ sacct -j 123

 

 

 

 

Jobstep

Jobname

Partition

Ncpus

Status

Error

----------

------------------

---------- -------

----------

-----

123

hptclsf@99

lsf

8

RUNNING

0

123.0

hptclsf@99

lsf

0

RUNNING

0

In these examples, the jobname is hptclsf@99; the LSF job ID is 99.

Note that the scontrol show job command keeps jobs briefly after they finish, then it purges itself; this is similar with the bjobs command. The sacct command continues to provide job information after the job has finished; this is similar to bhist command:

Using LSF-HPC 83

Image 83
Contents HP XC System Software Users Guide Page Table of Contents Submitting Jobs Configuring Your Environment with ModulefilesDeveloping Applications Debugging Applications Tuning ApplicationsUsing Slurm Using LSFGlossary 109 Index 115 Advanced TopicsExamples List of Figures Page Determining the Node Platform List of TablesPage Submitting a Job Script List of ExamplesPage This document is organized as follows About This DocumentIntended Audience Document OrganizationHP XC Information Supplementary Information $ man lsfcommandnameFor More Information $ man -k keyword Related InformationManpages $ man discover $ man 8 discoverAdditional Publications Related Linux Web SitesRelated MPI Web Sites Related Compiler Web SitesUser input Typographic ConventionsHP Encourages Your Comments Environment VariableOperating System Overview of the User EnvironmentSystem Architecture HP XC System SoftwareStorage and I/O Node SpecializationFile System Layout File SystemSAN Storage Local StorageNetwork Address Translation NAT Determining System Configuration InformationSystem Interconnect Network Modules CommandsUser Environment Serial Applications Application Development EnvironmentRun-Time Environment Parallel ApplicationsHow LSF-HPC and Slurm Interact Load Sharing Facility LSF-HPCStandard LSF Components, Tools, Compilers, Libraries, and Debuggers Mpirun commandLogging In to the System Using the SystemLVS Login Routing Using the Secure Shell to LogGetting Information About Resources IntroductionGetting Information About Queues Performing Other Common User Tasks $ man sinfo Getting System Help and InformationOverview of Modules Configuring Your Environment with ModulefilesSupplied Modulefiles Loading a Modulefile Modulefiles Automatically Loaded on the SystemViewing Available Modulefiles Viewing Loaded ModulefilesLoading a Modulefile for the Current Session Automatically Loading a Modulefile at LoginUnloading a Modulefile Modulefile Conflicts$ module help totalview Creating a ModulefileViewing Modulefile-Specific Help $ module load modules $ man modulefilePage Compilers Developing ApplicationsApplication Development Environment Overview Partition Avail Timelimit Nodes State Nodelist Examining Nodes and Partitions Before Running JobsInterrupting a Job MPI CompilerBuilding Serial Applications Setting Debugging OptionsDeveloping Serial Applications Serial Application Build EnvironmentOpenMP Developing Parallel ApplicationsParallel Application Build Environment ModulefilesIntel Fortran and C/C++Compilers PthreadsQuadrics Shmem MPI LibraryBuilding Parallel Applications Examples of Compiling and Linking HP-MPI Applications Developing LibrariesDesigning Libraries for the CP4000 Platform To build a 64-bit application, you might enter Linkcommand 32-bit -L/opt/mypackage/lib/i686 -lmystuffLinkcommand 64-bit -L/opt/mypackage/lib/x8664 -lmystuff ExtSLURMslurm-arguments Submitting JobsOverview of Job Submission $ bsub hostname Submitting a Serial Job Using Standard LSFSubmitting a Serial Job Using LSF-HPC Submitting a Serial Job with the LSF bsub CommandSubmitting a Serial Job Through Slurm only Mpirun mpirun--options-srunsrun-optionsmpi-jobname Submitting a Non-MPI Parallel Job$ bsub -n4 -I srun hostname Bsub -nnum-procsbsub-optionsmpijobBsub -nnum-procs bsub-optionsscript-name Submitting a Batch Job or Job Script$ bsub -n4 -I mpirun -srun ./helloworld $ bsub -n4 -ext SLURMnodes=4 -I ./myscript.sh $ cat myscript.sh #!/bin/shSrun hostname mpirun -srun hellompi $ bsub -I -n4 Myscript.sh$ cat ./envscript.sh #!/bin/sh name=`hostname` Running Preexecution Programs$ bsub -n4 -I ./myscript.sh Opt/hptc/bin/srun Mypreexec TotalView Debugging ApplicationsDebugging Serial Applications Debugging Parallel ApplicationsModule load mpimodule load totalview Setting Up TotalViewUsing TotalView with Slurm SSH and TotalViewDebugging an Application Using TotalView with LSF-HPCSetting TotalView Preferences $ mpirun -srun -n2 Psimple Sourcefile initfdte.f was not found, using assembler modeDebugging Running Applications Directories in File ⇒ Search Path$ squeue $ scancel --user usernameExiting TotalView Page Building a Program Intel Trace Collector and HP-MPI Tuning ApplicationsUsing the Intel Trace Collector and Intel Trace Analyzer Cldflags Running a Program Intel Trace Collector and HP-MPIVisualizing Data Intel Trace Analyzer and HP-MPI LibsUsing the Intel Trace Collector and Intel Trace Analyzer Page Introduction to Slurm Using SlurmLaunching Jobs with the srun Command Srun Squeue Scancel Sinfo ScontrolSrun Roles and Modes Monitoring Jobs with the squeue CommandUsing the srun Command with HP-MPI Using the srun Command with LSF-HPCJob Accounting Terminating Jobs with the scancel CommandGetting System Information with the sinfo Command # chmod a+r /hptccluster/slurm/job/jobacct.log Fault ToleranceSecurity Using LSF-HPC Using LSFUsing Standard LSF on an HP XC System Overview of LSF-HPC Introduction to LSF-HPC in the HP XC EnvironmentHostname Status JL/U MAX Njobs RUN Ssusp Ususp RSV Differences Between LSF-HPC and Standard LSFHostname ResourcesUnknown Unknown Job Terminology$ ssh n15 lshosts SLURMnodelist =nodelist if specified HP XCCompute Node Resource Support$ bsub -n 10 -ext SLURMconstraint=dualcore -I srun hostname $ bsub -n 10 -I srun hostname$ bsub -n 10 -ext SLURMnodes=10 -I srun hostname $ bsub -n 10 -ext SLURMnodes=10exclude=n16 -I srun hostname$ bsub -n4 -ext SLURMnodes=4 -o output.out ./myscript How LSF-HPC and Slurm Launch and Manage a Job#!/bin/sh hostname srun hostname Mpirun -srun ./hellompi Job Startup and Job ControlGetting Information About LSF Execution Host Node Determining the LSF Execution HostDetermining Available LSF-HPC System Resources Getting the Status of LSF-HPCSLINUX6 Getting Host Load InformationExamining LSF-HPC System Queues Getting Information About the lsf Partition$ sinfo -p lsf -lNe Summary of the LSF bsub Command Format$ sinfo -p lsf Bsub-options srun srun-optionsjobname job-options LSF-SLURM External SchedulerFor information about running scripts Bsub -n num-procs-ext SLURMslurm-arguments \Type=SLINUX64 Submitting a Job from a Non-HP XC HostStarting on lsfhost.localdomain n6 Waiting for dispatch ... Starting on lsfhost.localdomain n1$ bjobs -l Getting Information About JobsGetting Job Allocation Information Slurmid=slurmjobidncpus=slurmnprocsslurmalloc=nodelist$ bjobs Examining the Status of a Job$ bhist -l Time stampJobid User Jobname Pend Psusp RUN Ususp Ssusp Unkwn Total Viewing the Historical Information for a Job$ bhist Summary of time in seconds spent Various StatesTranslating Slurm and LSF-HPC JOBIDs $ srun --jobid=150 hostname Working Interactively Within an LSF-HPC Allocation$ bsub -I -n4 -ext SLURMnodes=4 /bin/bash $ bjobs -l 124 grep slurm$ unset Slurmjobid $ unset Slurmnprocs $ unset SlurmjobidAlternatively, you can use the following $ export SLURMJOBID=150 $ export SLURMNPROCS=4$ bsub -n4 -ext SLURMnodes=4 -o %J.out sleep Job 125 is submitted to the default queue normalLSF-HPC Equivalents of Slurm srun Options $ srun --jobid=250 uptimeBsub -iinputfile Mpi=mpitype Quit-on-interrupt Page Determining IP Address of Your Local Machine Advanced TopicsEnabling Remote Execution with OpenSSH Running an X Terminal Session from a Remote NodeLogging in to HP XC System Running an X terminal Session Using SlurmRunning an X terminal Session Using LSF-HPC $ srun -n2 hostname n46 Using the GNU Parallel Make Capability$ bsub -n4 -Ip srun -n1 xterm -display $ srun -n4 hostname n46$ cd subdir srun -n1 -N1 $MAKE -j4 $ make PREFIX=’srun -n1 -N1 MAKEJ=-j4 Example Procedure$ make PREFIX=srun -n1 -N1 MAKEJ=-j4 Local Disks on Compute NodesPerformance Considerations Modified Makefile is invoked as followsFp = fopen myfile, a+ Communication Between NodesShared File View Private File ViewPage Examine the LSF execution host information Appendix a ExamplesBuilding and Running a Serial Application Launching a Serial Interactive Shell Through LSF-HPCRunning LSF-HPC Jobs with a Slurm Allocation Request SLURMnodes=2 Launching a Parallel Interactive Shell Through LSF-HPCExample 2. Four cores on Two Specific Nodes R15s r1m r15m It tmp swp mem LoadSched LoadStop124 Lsfad Examine the the running jobs information$ hostname n16 $ srun hostname n5 $ bjobs Display the script Submitting a Simple Job Script with LSF-HPCExamine the the finished jobs information Show the environmentShow the job allocation Run some commands from the pseudo-terminalSubmitting an Interactive Job with LSF-HPC Submit the jobView the finished jobs Submitting an HP-MPI Job with LSF-HPCExit the pseudo-terminal View the interactive jobsLsfhost.localdomai View the running jobView the finished job $ bsub -n 8 -R ALPHA5 SLINUX64 \ -ext SLURMnodes=4-4 myjob Using a Resource Requirements String in an LSF-HPC CommandStates by date and time 108 Glossary Are not appropriate for replication To the queueFirst-come See Fcfs First-served Global storage As local storageLSF master host Single commandLogin requests and directs them to a node with a login role Linux Virtual See LVS Server Load fileSlurm backup Network See NIS Information ServicesRemotely. PXE booting is configured at the Bios level Notably to install and remove software packagesSymmetric See SMP Multiprocessing Power available per unit of spaceSsh 114 Index Index PGI Utilities, 63 Slurm commands