HP XC System 3.x Software manual Job Startup and Job Control

Page 74

This bsub command launches a request for four cores (from the -n4option of the bsub command) across four nodes (from the -ext "SLURM[nodes=4]" option); the job is launched on those cores. The script, myscript, which is shown here, runs the job:

#!/bin/sh hostname srun hostname

mpirun -srun ./hellompi

3.LSF-HPC schedules the job and monitors the state of the resources (compute nodes) in the SLURM lsf partition. When the LSF-HPC scheduler determines that the required resources are available, LSF-HPC allocates those resources in SLURM and obtains a SLURM job identifier (jobID) that corresponds to the allocation.

In this example, four cores spread over four nodes (n1,n2,n3,n4) are allocated for myscript, and the SLURM job id of 53 is assigned to the allocation.

4.LSF-HPC prepares the user environment for the job on the LSF execution host node and dispatches the job with the job_starter.sh script. This user environment includes standard LSF environment variables and two SLURM-specific environment variables: SLURM_JOBID and SLURM_NPROCS.

SLURM_JOBID is the SLURM job ID of the job. Note that this is not the same as the LSF-HPC jobID. "Translating SLURM and LSF-HPC JOBIDs" describes the relationship between the SLURM_JOBID

and the LSF-HPC JOBID.

SLURM_NPROCS is the number of processes allocated.

These environment variables are intended for use by the user's job, whether it is explicitly (user scripts may use these variables as necessary) or implicitly (the srun commands in the user’s job use these variables to determine its allocation of resources).

The value for SLURM_NPROCS is 4 and the SLURM_JOBID is 53 in this example.

5.The user job myscript begins execution on compute node n1.

The first line in myscript is the hostname command. It executes locally and returns the name of node, n1.

6.The second line in the myscript script is the srun hostname command. The srun command in myscript inherits SLURM_JOBID and SLURM_NPROCS from the environment and executes the hostname command on each compute node in the allocation.

7.The output of the hostname tasks (n1, n2, n3, and n4). is aggregated back to the srun launch command (shown as dashed lines in Figure 9-1), and is ultimately returned to the srun command in the job starter script, where it is collected by LSF-HPC.

The last line in myscript is the mpirun -srun ./hellompicommand. The srun command inside the mpirun command in myscript inherits the SLURM_JOBID and SLURM_NPROCS environment variables from the environment and executes hellompi on each compute node in the allocation.

The output of the hellompi tasks is aggregated back to the srun launch command where it is collected by LSF-HPC.

The command executes on the allocated compute nodes n1, n2, n3, and n4.

When the job finishes, LSF-HPC cancels the SLURM allocation, which frees the compute nodes for use by another job.

Notes About Using LSF-HPC in the HP XC Environment

This section provides some additional information that should be noted about using LSF-HPC in the HP XC Environment.

Job Startup and Job Control

When LSF-HPC starts a SLURM job, it sets SLURM_JOBID to associate the job with the SLURM allocation. While a job is running, all LSF-HPC supported operating-system-enforced resource limits are supported, including core limit, CPU time limit, data limit, file size limit, memory limit, and stack limit. If the user kills a job, LSF-HPC propagates signals to entire job, including the job file running on the local node and all tasks running on remote nodes.

74 Using LSF

Image 74
Contents HP XC System Software Users Guide Page Table of Contents Submitting Jobs Configuring Your Environment with ModulefilesDeveloping Applications Using LSF Tuning ApplicationsUsing Slurm Debugging ApplicationsGlossary 109 Index 115 Advanced TopicsExamples List of Figures Page List of Tables Determining the Node PlatformPage List of Examples Submitting a Job ScriptPage Document Organization About This DocumentIntended Audience This document is organized as followsHP XC Information Supplementary Information $ man lsfcommandnameFor More Information $ man discover $ man 8 discover Related InformationManpages $ man -k keywordRelated Compiler Web Sites Related Linux Web SitesRelated MPI Web Sites Additional PublicationsEnvironment Variable Typographic ConventionsHP Encourages Your Comments User inputHP XC System Software Overview of the User EnvironmentSystem Architecture Operating SystemNode Specialization Storage and I/OLocal Storage File SystemSAN Storage File System LayoutNetwork Address Translation NAT Determining System Configuration InformationSystem Interconnect Network Modules CommandsUser Environment Parallel Applications Application Development EnvironmentRun-Time Environment Serial ApplicationsHow LSF-HPC and Slurm Interact Load Sharing Facility LSF-HPCStandard LSF Mpirun command Components, Tools, Compilers, Libraries, and DebuggersUsing the Secure Shell to Log Using the SystemLVS Login Routing Logging In to the SystemGetting Information About Resources IntroductionGetting Information About Queues Performing Other Common User Tasks Getting System Help and Information $ man sinfoConfiguring Your Environment with Modulefiles Overview of ModulesSupplied Modulefiles Viewing Loaded Modulefiles Modulefiles Automatically Loaded on the SystemViewing Available Modulefiles Loading a ModulefileModulefile Conflicts Automatically Loading a Modulefile at LoginUnloading a Modulefile Loading a Modulefile for the Current Session$ module load modules $ man modulefile Creating a ModulefileViewing Modulefile-Specific Help $ module help totalviewPage Compilers Developing ApplicationsApplication Development Environment Overview MPI Compiler Examining Nodes and Partitions Before Running JobsInterrupting a Job Partition Avail Timelimit Nodes State NodelistSerial Application Build Environment Setting Debugging OptionsDeveloping Serial Applications Building Serial ApplicationsModulefiles Developing Parallel ApplicationsParallel Application Build Environment OpenMPMPI Library PthreadsQuadrics Shmem Intel Fortran and C/C++CompilersBuilding Parallel Applications Examples of Compiling and Linking HP-MPI Applications Developing LibrariesDesigning Libraries for the CP4000 Platform To build a 64-bit application, you might enter Linkcommand 32-bit -L/opt/mypackage/lib/i686 -lmystuffLinkcommand 64-bit -L/opt/mypackage/lib/x8664 -lmystuff ExtSLURMslurm-arguments Submitting JobsOverview of Job Submission Submitting a Serial Job with the LSF bsub Command Submitting a Serial Job Using Standard LSFSubmitting a Serial Job Using LSF-HPC $ bsub hostnameSubmitting a Serial Job Through Slurm only Bsub -nnum-procsbsub-optionsmpijob Submitting a Non-MPI Parallel Job$ bsub -n4 -I srun hostname Mpirun mpirun--options-srunsrun-optionsmpi-jobnameBsub -nnum-procs bsub-optionsscript-name Submitting a Batch Job or Job Script$ bsub -n4 -I mpirun -srun ./helloworld $ bsub -I -n4 Myscript.sh $ cat myscript.sh #!/bin/shSrun hostname mpirun -srun hellompi $ bsub -n4 -ext SLURMnodes=4 -I ./myscript.sh$ cat ./envscript.sh #!/bin/sh name=`hostname` Running Preexecution Programs$ bsub -n4 -I ./myscript.sh Opt/hptc/bin/srun Mypreexec Debugging Parallel Applications Debugging ApplicationsDebugging Serial Applications TotalViewSSH and TotalView Setting Up TotalViewUsing TotalView with Slurm Module load mpimodule load totalviewDebugging an Application Using TotalView with LSF-HPCSetting TotalView Preferences Directories in File ⇒ Search Path Sourcefile initfdte.f was not found, using assembler modeDebugging Running Applications $ mpirun -srun -n2 Psimple$ squeue $ scancel --user usernameExiting TotalView Page Building a Program Intel Trace Collector and HP-MPI Tuning ApplicationsUsing the Intel Trace Collector and Intel Trace Analyzer Libs Running a Program Intel Trace Collector and HP-MPIVisualizing Data Intel Trace Analyzer and HP-MPI CldflagsUsing the Intel Trace Collector and Intel Trace Analyzer Page Srun Squeue Scancel Sinfo Scontrol Using SlurmLaunching Jobs with the srun Command Introduction to SlurmUsing the srun Command with LSF-HPC Monitoring Jobs with the squeue CommandUsing the srun Command with HP-MPI Srun Roles and ModesJob Accounting Terminating Jobs with the scancel CommandGetting System Information with the sinfo Command # chmod a+r /hptccluster/slurm/job/jobacct.log Fault ToleranceSecurity Using LSF-HPC Using LSFUsing Standard LSF on an HP XC System Introduction to LSF-HPC in the HP XC Environment Overview of LSF-HPCResources Differences Between LSF-HPC and Standard LSFHostname Hostname Status JL/U MAX Njobs RUN Ssusp Ususp RSVUnknown Unknown Job Terminology$ ssh n15 lshosts HP XCCompute Node Resource Support SLURMnodelist =nodelist if specified$ bsub -n 10 -ext SLURMnodes=10exclude=n16 -I srun hostname $ bsub -n 10 -I srun hostname$ bsub -n 10 -ext SLURMnodes=10 -I srun hostname $ bsub -n 10 -ext SLURMconstraint=dualcore -I srun hostnameHow LSF-HPC and Slurm Launch and Manage a Job $ bsub -n4 -ext SLURMnodes=4 -o output.out ./myscriptJob Startup and Job Control #!/bin/sh hostname srun hostname Mpirun -srun ./hellompiGetting the Status of LSF-HPC Determining the LSF Execution HostDetermining Available LSF-HPC System Resources Getting Information About LSF Execution Host NodeGetting Information About the lsf Partition Getting Host Load InformationExamining LSF-HPC System Queues SLINUX6$ sinfo -p lsf -lNe Summary of the LSF bsub Command Format$ sinfo -p lsf Bsub -n num-procs-ext SLURMslurm-arguments \ LSF-SLURM External SchedulerFor information about running scripts Bsub-options srun srun-optionsjobname job-optionsWaiting for dispatch ... Starting on lsfhost.localdomain n1 Submitting a Job from a Non-HP XC HostStarting on lsfhost.localdomain n6 Type=SLINUX64Slurmid=slurmjobidncpus=slurmnprocsslurmalloc=nodelist Getting Information About JobsGetting Job Allocation Information $ bjobs -lTime stamp Examining the Status of a Job$ bhist -l $ bjobsSummary of time in seconds spent Various States Viewing the Historical Information for a Job$ bhist Jobid User Jobname Pend Psusp RUN Ususp Ssusp Unkwn TotalTranslating Slurm and LSF-HPC JOBIDs $ bjobs -l 124 grep slurm Working Interactively Within an LSF-HPC Allocation$ bsub -I -n4 -ext SLURMnodes=4 /bin/bash $ srun --jobid=150 hostname$ export SLURMJOBID=150 $ export SLURMNPROCS=4 $ unset SlurmjobidAlternatively, you can use the following $ unset Slurmjobid $ unset Slurmnprocs$ srun --jobid=250 uptime Job 125 is submitted to the default queue normalLSF-HPC Equivalents of Slurm srun Options $ bsub -n4 -ext SLURMnodes=4 -o %J.out sleepBsub -iinputfile Mpi=mpitype Quit-on-interrupt Page Running an X Terminal Session from a Remote Node Advanced TopicsEnabling Remote Execution with OpenSSH Determining IP Address of Your Local MachineLogging in to HP XC System Running an X terminal Session Using SlurmRunning an X terminal Session Using LSF-HPC $ srun -n4 hostname n46 Using the GNU Parallel Make Capability$ bsub -n4 -Ip srun -n1 xterm -display $ srun -n2 hostname n46$ cd subdir srun -n1 -N1 $MAKE -j4 Example Procedure $ make PREFIX=’srun -n1 -N1 MAKEJ=-j4Modified Makefile is invoked as follows Local Disks on Compute NodesPerformance Considerations $ make PREFIX=srun -n1 -N1 MAKEJ=-j4Private File View Communication Between NodesShared File View Fp = fopen myfile, a+Page Launching a Serial Interactive Shell Through LSF-HPC Appendix a ExamplesBuilding and Running a Serial Application Examine the LSF execution host informationRunning LSF-HPC Jobs with a Slurm Allocation Request R15s r1m r15m It tmp swp mem LoadSched LoadStop Launching a Parallel Interactive Shell Through LSF-HPCExample 2. Four cores on Two Specific Nodes SLURMnodes=2124 Lsfad Examine the the running jobs information$ hostname n16 $ srun hostname n5 $ bjobs Show the environment Submitting a Simple Job Script with LSF-HPCExamine the the finished jobs information Display the scriptSubmit the job Run some commands from the pseudo-terminalSubmitting an Interactive Job with LSF-HPC Show the job allocationView the interactive jobs Submitting an HP-MPI Job with LSF-HPCExit the pseudo-terminal View the finished jobsLsfhost.localdomai View the running jobView the finished job $ bsub -n 8 -R ALPHA5 SLINUX64 \ -ext SLURMnodes=4-4 myjob Using a Resource Requirements String in an LSF-HPC CommandStates by date and time 108 Glossary As local storage To the queueFirst-come See Fcfs First-served Global storage Are not appropriate for replicationLinux Virtual See LVS Server Load file Single commandLogin requests and directs them to a node with a login role LSF master hostNotably to install and remove software packages Network See NIS Information ServicesRemotely. PXE booting is configured at the Bios level Slurm backupSymmetric See SMP Multiprocessing Power available per unit of spaceSsh 114 Index Index PGI Utilities, 63 Slurm commands