Escali 4.4 manual Appendix B, When things do not work troubleshooting

Page 66

Appendix B

Troubleshooting

 

 

 

 

This appendix offers initial suggestions for what to do when something goes wrong with applications running together with SMC. When problems occur, first check the list of common errors and their solutions; an updated list of SMC-related Frequently Asked Questions (FAQ) is posted in the Support section of the Scali website (http://www.scali.com). If you are unable to find a solution to the problem(s) there, please read this chapter before contacting support@scali.com.

Problems and fixes reported to Scali will eventually be included in the appropriate sections of this manual. Please send relevant remarks by e-mail to support@scali.com.

Many problems find their origin in not using the right application code, daemons that Scali MPI Connect rely on are stopped, and incomplete specification of network drivers. Below some typical problems and their solutions are described. Troubleshooting the DAT functionality is described in C-11.

B-1 When things do not work - troubleshooting

This section is intended to serve as a starting point to help with software and hardware debugging. The main focus is on locating and repairing faulty hardware and software setup, but can also be helpful in getting started after installing a new system. For a description of the Scali Manage GUI, see the Scali System Guide.

B-1.1 Why does not my program start to run?

mpimon: command not found.

‹Include /opt/scali/bin in the PATH environment variable. mpimon can’t find mpisubmon.

‹Set MPI_HOME=/opt/scali or use the -execpath option.

The application has problems loading libraries (libsca*).

‹Update the LD_LIBRARY_PATH to include /opt/scali/lib.

Incompatible MPI versions.

mpid, mpimon, mpisubmon and the libraries all have version variables that are checked at start-up. To insure that these are correct, try the following:

1.Set the environment variable MPI_HOME correctly

2.Restart mpid, because a new version of ScaMPI has been installed without restarting mpid

3.Reinstall SMC, because a new version of SMC was not cleanly installed on all nodes.

Set working directory failed

‹SMC assumes that there is a homogenous file-structure. If you start mpimon from a directory that is not available on all nodes you must set SCAMPI_WORKING_DIRECTORY to point to a directory that is available on all nodes.

ScaMPI uses wrong interface for TCP-IP on frontend with more than one interface

‹Set SCAMPI_NODENAME to hostname of correct interface.

MPI_Wtime gives strange values

‹SMC uses a hardware-supported high precision timer for MPI_Wtime. This timer can be disabled by using SCAMPI_DISABLE_HPT=1

Scali MPI Connect Release 4.4 Users Guide

54

Image 66
Contents Scali MPI ConnectTM Users Guide Acknowledgement Copyright 1999-2005 Scali AS. All rights reservedScali Bronze Software Certificate Maintenance II Software License Terms CommencementGrant of License Sub-license and distribution SupportLicense Manager Export RequirementsSCALI’s Obligations LICENSEE’s ObligationsTitle to Intellectual Property Rights TransferWarranty of Title and Substantial Performance Compliance with LicensesLimitation on Remedies and Liabilities Scali MPI Connect Release 4.4 Users Guide ViiProprietary Information MiscellaneousGoverning Law Scali MPI Connect Release 4.4 Users Guide Table of contents Profiling with Scali MPI Connect Appendix a Example MPI code Scali MPI Connect Release 4.4 Users Guide Chapter Scali MPI Connect product contextSupport Scali mailing lists SMC FAQ SMC release documentsProblem reports Platforms supportedLicensing How to read this guideAcronyms and abbreviations FeedbackNIC Terms and conventions Typographic conventionsGUI style font Typographic conventions Description of Scali MPI Connect Scali MPI Connect componentsSMC network devices Shared Memory Device Direct Access Transport DATNetwork devices Ethernet DevicesUsing detctl Using detstat3.2 DET 4.1 GM MyrinetInfiniband 5.1 IBCommunication protocols on DAT-devices 6 SCIChannel buffer Inlining protocol Eagerbuffering protocolTransporter protocol MPI-2 Features Support for other interconnectsZerocopy protocol Scali MPI Connect Release 4.4 Users Guide MPI-2 Features Scali MPI Connect environment variables Setting up a Scali MPI Connect environmentCompiling and linking RunningCompiler support Linker flagsRunning Scali MPI Connect programs Naming conventionsMpimon monitor program Basic usageIdentity of parallel processes Standard output Controlling options to mpimonStandard input Program specHow to provide options to mpimon Network optionsMpirun wrapper script Mpirun usageRunning with dynamic interconnect failover capabilities Running with tcp error detection TfdrSuspending and resuming jobs Part partDebugging and profiling Debugging with a sequential debuggerAssistance for external profiling Using built-in segment protect violation handlerBuilt-in-tools for debugging Debugging with Etnus TotalviewControlling communication resources Communication resources on DAT-devicesChannelinlinethreshold size to set threshold for inlining Good programming practice with SMC Using MPIIsend, MPIIrecvUsing MPIBsend Matching MPIRecv with MPIProbeFatal errors Error and warning messagesUser interface errors and warnings Unsafe MPI programsMpimon options Giving numeric values to mpimon PrefixPostfix Scali MPI Connect Release 4.4 Users Guide Profiling with Scali MPI Connect ExampleUsing Scali MPI Connect built-in trace TracingAbsRank MPIcallcommNamerankcall-dependant-parameters where +relSecs S eTime whereFeatures ExampleUsing Scali MPI Connect built-in timing TimingMPIcallDcallsDtimeDfreq TcallsTtimeTfreq Commrank send to toworldTocommonFields where Using the scanalyzeCommrank recv from fromworldFromcommonFields Count!avrLen!zroLen!inline!eager!transporter! whereFor timing Using SMCs built-in CPU-usage functionality This produces the following reportScali MPI Connect Release 4.4 Users Guide Tuning communication resources Automatic buffer managementCaching the application program on the nodes How to optimize MPI performanceBenchmarking First iteration is very slowCollective operations Memory consumption increase after warm-upFinding the best algorithm Appendix a Programs in the ScaMPItst packageImage contrast enhancement Scali MPI Connect Release 4.4 Users Guide File format OriginalWhen things do not work troubleshooting Why does not my program start to run?Appendix B Why can I not start mpid? Why does my program terminate abnormally?General problems Per node installation of Scali MPI Connect Appendix CInstall Scali MPI Connect for Myrinet Install Scali MPI Connect for TCP/IPInstall Scali MPI Connect for Direct Ethernet ExampleInstall and configure SCI management software Install Scali MPI Connect for InfinibandInstall Scali MPI Connect for SCI License optionsUninstalling SMC Troubleshooting Network providersScali kernel drivers Troubleshooting 3rdparty DAT providers Troubleshooting the GM providerScali MPI Connect Release 4.4 Users Guide Appendix D Bracket expansion and grouping Bracket expansionGrouping Scali MPI Connect Release 4.4 Users Guide Appendix E Related documentationScali MPI Connect Release 4.4 Users Guide List of figures Scali MPI Connect Release 4.4 Users Guide Index Transporter protocolSSP