HP XC System 3.x Software manual Terminating Jobs with the scancel Command, Job Accounting

Page 65

The squeue command can report on jobs in the job queue according to their state; possible states are: pending, running, completing, completed, failed, timeout, and node_fail. Example 8-3uses the squeue command to report on failed jobs.

Example 8-3

Reporting on Failed Jobs in the Queue

 

 

 

$ squeue --state=FAILED

 

 

 

 

 

JOBID PARTITION

NAME

USER

ST

TIME

NODES

NODELIST

59

amt1

hostname

root

F

0:00

0

 

Terminating Jobs with the scancel Command

The scancel command cancels a pending or running job or job step. It can also be used to send a specified signal to all processes on all nodes associated with a job. Only job owners or administrators can cancel jobs.

Example 8-4kills job 415 and all its jobsteps.

Example 8-4 Killing a Job by Its JobID

$ scancel 415

Example 8-5cancels all pending jobs.

Example 8-5 Cancelling All Pending Jobs

$ scancel --state=PENDING

Example 8-6sends the TERM signal to terminate jobsteps 421.2 and 421.3.

Example 8-6 Sending a Signal to a Job

$ scancel --signal=TERM 421.2 421.3

Getting System Information with the sinfo Command

The sinfo command reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options. sinfo displays a summary of available partition and node (not job) information, such as partition names, nodes/partition, and cores/node.

Example 8-7

Using the sinfo Command (No Options)

$ sinfo

 

 

 

 

PARTITION AVAIL TIMELIMIT NODES

STATE NODELIST

lsf

up

infinite

3

down* n[0,5,8]

lsf

up

infinite

14

idle n[1-4,6-7,9-16]

The node STATE codes in these examples may be appended by an asterisk character (*) ; this indicates that the reported node is not responding. See the sinfo(1) manpage for a complete listing and description of STATE codes.

Example 8-8 Reporting Reasons for Downed, Drained, and Draining Nodes

$ sinfo -R

 

REASON

NODELIST

Memory errors

n[0,5]

Not Responding

n8

Job Accounting

HP XC System Software provides an extension to SLURM for job accounting. The sacct command displays job accounting data in a variety of forms for your analysis. Job accounting data is stored in a log file; the sacct command filters that log file to report on your jobs, jobsteps, status, and errors. See your system administrator if job accounting is not configured on your system.

By default, only the superuser is allowed to access the job accounting data. To grant all system users read access to this data, the superuser must change the permission of the jobacct.log file, as follows:

Terminating Jobs with the scancel Command 65

Image 65
Contents HP XC System Software Users Guide Page Table of Contents Submitting Jobs Configuring Your Environment with ModulefilesDeveloping Applications Using Slurm Tuning ApplicationsUsing LSF Debugging ApplicationsGlossary 109 Index 115 Advanced TopicsExamples List of Figures Page Determining the Node Platform List of TablesPage Submitting a Job Script List of ExamplesPage Intended Audience About This DocumentDocument Organization This document is organized as followsHP XC Information Supplementary Information $ man lsfcommandnameFor More Information Manpages Related Information$ man discover $ man 8 discover $ man -k keywordRelated MPI Web Sites Related Linux Web SitesRelated Compiler Web Sites Additional PublicationsHP Encourages Your Comments Typographic ConventionsEnvironment Variable User inputSystem Architecture Overview of the User EnvironmentHP XC System Software Operating SystemStorage and I/O Node SpecializationSAN Storage File SystemLocal Storage File System LayoutNetwork Address Translation NAT Determining System Configuration InformationSystem Interconnect Network Modules CommandsUser Environment Run-Time Environment Application Development EnvironmentParallel Applications Serial ApplicationsHow LSF-HPC and Slurm Interact Load Sharing Facility LSF-HPCStandard LSF Components, Tools, Compilers, Libraries, and Debuggers Mpirun commandLVS Login Routing Using the SystemUsing the Secure Shell to Log Logging In to the SystemGetting Information About Resources IntroductionGetting Information About Queues Performing Other Common User Tasks $ man sinfo Getting System Help and InformationOverview of Modules Configuring Your Environment with ModulefilesSupplied Modulefiles Viewing Available Modulefiles Modulefiles Automatically Loaded on the SystemViewing Loaded Modulefiles Loading a ModulefileUnloading a Modulefile Automatically Loading a Modulefile at LoginModulefile Conflicts Loading a Modulefile for the Current SessionViewing Modulefile-Specific Help Creating a Modulefile$ module load modules $ man modulefile $ module help totalviewPage Compilers Developing ApplicationsApplication Development Environment Overview Interrupting a Job Examining Nodes and Partitions Before Running JobsMPI Compiler Partition Avail Timelimit Nodes State NodelistDeveloping Serial Applications Setting Debugging OptionsSerial Application Build Environment Building Serial ApplicationsParallel Application Build Environment Developing Parallel ApplicationsModulefiles OpenMPQuadrics Shmem PthreadsMPI Library Intel Fortran and C/C++CompilersBuilding Parallel Applications Examples of Compiling and Linking HP-MPI Applications Developing LibrariesDesigning Libraries for the CP4000 Platform To build a 64-bit application, you might enter Linkcommand 32-bit -L/opt/mypackage/lib/i686 -lmystuffLinkcommand 64-bit -L/opt/mypackage/lib/x8664 -lmystuff ExtSLURMslurm-arguments Submitting JobsOverview of Job Submission Submitting a Serial Job Using LSF-HPC Submitting a Serial Job Using Standard LSFSubmitting a Serial Job with the LSF bsub Command $ bsub hostnameSubmitting a Serial Job Through Slurm only $ bsub -n4 -I srun hostname Submitting a Non-MPI Parallel JobBsub -nnum-procsbsub-optionsmpijob Mpirun mpirun--options-srunsrun-optionsmpi-jobnameBsub -nnum-procs bsub-optionsscript-name Submitting a Batch Job or Job Script$ bsub -n4 -I mpirun -srun ./helloworld Srun hostname mpirun -srun hellompi $ cat myscript.sh #!/bin/sh$ bsub -I -n4 Myscript.sh $ bsub -n4 -ext SLURMnodes=4 -I ./myscript.sh$ cat ./envscript.sh #!/bin/sh name=`hostname` Running Preexecution Programs$ bsub -n4 -I ./myscript.sh Opt/hptc/bin/srun Mypreexec Debugging Serial Applications Debugging ApplicationsDebugging Parallel Applications TotalViewUsing TotalView with Slurm Setting Up TotalViewSSH and TotalView Module load mpimodule load totalviewDebugging an Application Using TotalView with LSF-HPCSetting TotalView Preferences Debugging Running Applications Sourcefile initfdte.f was not found, using assembler modeDirectories in File ⇒ Search Path $ mpirun -srun -n2 Psimple$ squeue $ scancel --user usernameExiting TotalView Page Building a Program Intel Trace Collector and HP-MPI Tuning ApplicationsUsing the Intel Trace Collector and Intel Trace Analyzer Visualizing Data Intel Trace Analyzer and HP-MPI Running a Program Intel Trace Collector and HP-MPILibs CldflagsUsing the Intel Trace Collector and Intel Trace Analyzer Page Launching Jobs with the srun Command Using SlurmSrun Squeue Scancel Sinfo Scontrol Introduction to SlurmUsing the srun Command with HP-MPI Monitoring Jobs with the squeue CommandUsing the srun Command with LSF-HPC Srun Roles and ModesJob Accounting Terminating Jobs with the scancel CommandGetting System Information with the sinfo Command # chmod a+r /hptccluster/slurm/job/jobacct.log Fault ToleranceSecurity Using LSF-HPC Using LSFUsing Standard LSF on an HP XC System Overview of LSF-HPC Introduction to LSF-HPC in the HP XC EnvironmentHostname Differences Between LSF-HPC and Standard LSFResources Hostname Status JL/U MAX Njobs RUN Ssusp Ususp RSVUnknown Unknown Job Terminology$ ssh n15 lshosts SLURMnodelist =nodelist if specified HP XCCompute Node Resource Support$ bsub -n 10 -ext SLURMnodes=10 -I srun hostname $ bsub -n 10 -I srun hostname$ bsub -n 10 -ext SLURMnodes=10exclude=n16 -I srun hostname $ bsub -n 10 -ext SLURMconstraint=dualcore -I srun hostname$ bsub -n4 -ext SLURMnodes=4 -o output.out ./myscript How LSF-HPC and Slurm Launch and Manage a Job#!/bin/sh hostname srun hostname Mpirun -srun ./hellompi Job Startup and Job ControlDetermining Available LSF-HPC System Resources Determining the LSF Execution HostGetting the Status of LSF-HPC Getting Information About LSF Execution Host NodeExamining LSF-HPC System Queues Getting Host Load InformationGetting Information About the lsf Partition SLINUX6$ sinfo -p lsf -lNe Summary of the LSF bsub Command Format$ sinfo -p lsf For information about running scripts LSF-SLURM External SchedulerBsub -n num-procs-ext SLURMslurm-arguments \ Bsub-options srun srun-optionsjobname job-optionsStarting on lsfhost.localdomain n6 Submitting a Job from a Non-HP XC HostWaiting for dispatch ... Starting on lsfhost.localdomain n1 Type=SLINUX64Getting Job Allocation Information Getting Information About JobsSlurmid=slurmjobidncpus=slurmnprocsslurmalloc=nodelist $ bjobs -l$ bhist -l Examining the Status of a JobTime stamp $ bjobs$ bhist Viewing the Historical Information for a JobSummary of time in seconds spent Various States Jobid User Jobname Pend Psusp RUN Ususp Ssusp Unkwn TotalTranslating Slurm and LSF-HPC JOBIDs $ bsub -I -n4 -ext SLURMnodes=4 /bin/bash Working Interactively Within an LSF-HPC Allocation$ bjobs -l 124 grep slurm $ srun --jobid=150 hostnameAlternatively, you can use the following $ unset Slurmjobid$ export SLURMJOBID=150 $ export SLURMNPROCS=4 $ unset Slurmjobid $ unset SlurmnprocsLSF-HPC Equivalents of Slurm srun Options Job 125 is submitted to the default queue normal$ srun --jobid=250 uptime $ bsub -n4 -ext SLURMnodes=4 -o %J.out sleepBsub -iinputfile Mpi=mpitype Quit-on-interrupt Page Enabling Remote Execution with OpenSSH Advanced TopicsRunning an X Terminal Session from a Remote Node Determining IP Address of Your Local MachineLogging in to HP XC System Running an X terminal Session Using SlurmRunning an X terminal Session Using LSF-HPC $ bsub -n4 -Ip srun -n1 xterm -display Using the GNU Parallel Make Capability$ srun -n4 hostname n46 $ srun -n2 hostname n46$ cd subdir srun -n1 -N1 $MAKE -j4 $ make PREFIX=’srun -n1 -N1 MAKEJ=-j4 Example ProcedurePerformance Considerations Local Disks on Compute NodesModified Makefile is invoked as follows $ make PREFIX=srun -n1 -N1 MAKEJ=-j4Shared File View Communication Between NodesPrivate File View Fp = fopen myfile, a+Page Building and Running a Serial Application Appendix a ExamplesLaunching a Serial Interactive Shell Through LSF-HPC Examine the LSF execution host informationRunning LSF-HPC Jobs with a Slurm Allocation Request Example 2. Four cores on Two Specific Nodes Launching a Parallel Interactive Shell Through LSF-HPCR15s r1m r15m It tmp swp mem LoadSched LoadStop SLURMnodes=2124 Lsfad Examine the the running jobs information$ hostname n16 $ srun hostname n5 $ bjobs Examine the the finished jobs information Submitting a Simple Job Script with LSF-HPCShow the environment Display the scriptSubmitting an Interactive Job with LSF-HPC Run some commands from the pseudo-terminalSubmit the job Show the job allocationExit the pseudo-terminal Submitting an HP-MPI Job with LSF-HPCView the interactive jobs View the finished jobsLsfhost.localdomai View the running jobView the finished job $ bsub -n 8 -R ALPHA5 SLINUX64 \ -ext SLURMnodes=4-4 myjob Using a Resource Requirements String in an LSF-HPC CommandStates by date and time 108 Glossary First-come See Fcfs First-served Global storage To the queueAs local storage Are not appropriate for replicationLogin requests and directs them to a node with a login role Single commandLinux Virtual See LVS Server Load file LSF master hostRemotely. PXE booting is configured at the Bios level Network See NIS Information ServicesNotably to install and remove software packages Slurm backupSymmetric See SMP Multiprocessing Power available per unit of spaceSsh 114 Index Index PGI Utilities, 63 Slurm commands