Escali 4.4 Why can I not start mpid?, Why does my program terminate abnormally?, General problems

Page 67

Section:

B-1.2 Why can I not start mpid?

mpid opens a socket and assigns a predefined mpid port number (see /etc/services for more information), to the end point. If mpid is terminated abnormally, the mpid port number cannot be re-used until a system defined timer has expired. To resolve:

‹Use netstat -a grep mpid to observe when the socket is released. When the socket is released, restart mpid again.

B-1.2.1 Bad clean up

A previous SMC run has not terminated properly.

‹Check for mpi-processes on the nodes using /opt/scali/bin/scaps.

‹Use /opt/scali/sbin/scidle

‹Use /opt/scali/bin/scash to check for leftover shared memory segments on all nodes (ipcs for Solaris and Linux).

Note: core dumping takes time.

B-1.2.2 Space overflow

The application has required too much SCI or shared memory resources.

‹The mpimon pool-sizespecifications are too large, and must be reduced.

B-1.3 Why does my program terminate abnormally?

B-1.3.1 Core dump

The application core dumps.

‹Use a debugger to locate the point of violation. The application may need to be recompiled to include symbolic debug information (-gfor most compilers).

‹Define SCAMPI_INSTALL_SIGSEGV_HANDLER=1 and attach to the failing process with the debugger.

B-1.4 General problems

Are you reasonably certain that your algorithms are MPI safe?

‹Check if every send has a matching receive.

The program just hangs

‹If the application has a large degree of asynchronicity, try to increase the channel-size. Further information is available in “Communication buffer adaption: If the communication behaviour of the application is known, explicitly providing buffersize settings to mpimon, to match the requirement of the application, will in most cases improve performance. Example: Application sending only 900 bytes messages. Set channel_inline_threshold 964 (64 added for alignment) and increase the channel-size significantly (32-128 k). Setting eager_size 1k and eager_count high (16 or more). If all messages can be buffered, the transporter-{size, count} can be set to low values to reduce shared memory consumption.” on page 47.

The program terminates without an error message

‹Investigate the core file, or rerun the program in a debugger.

Scali MPI Connect Release 4.4 Users Guide

55

Image 67
Contents Scali MPI ConnectTM Users Guide Copyright 1999-2005 Scali AS. All rights reserved AcknowledgementScali Bronze Software Certificate II Software License Terms Commencement MaintenanceGrant of License Export Requirements SupportLicense Manager Sub-license and distributionLICENSEE’s Obligations SCALI’s ObligationsTransfer Title to Intellectual Property RightsCompliance with Licenses Warranty of Title and Substantial PerformanceScali MPI Connect Release 4.4 Users Guide Vii Limitation on Remedies and LiabilitiesMiscellaneous Proprietary InformationGoverning Law Scali MPI Connect Release 4.4 Users Guide Table of contents Profiling with Scali MPI Connect Appendix a Example MPI code Scali MPI Connect Release 4.4 Users Guide Scali MPI Connect product context ChapterPlatforms supported Scali mailing lists SMC FAQ SMC release documentsProblem reports SupportFeedback How to read this guideAcronyms and abbreviations LicensingNIC Typographic conventions Terms and conventionsGUI style font Typographic conventions Scali MPI Connect components Description of Scali MPI ConnectSMC network devices Ethernet Devices Direct Access Transport DATNetwork devices Shared Memory DeviceUsing detstat Using detctl3.2 DET 5.1 IB MyrinetInfiniband 4.1 GM6 SCI Communication protocols on DAT-devicesChannel buffer Eagerbuffering protocol Inlining protocolTransporter protocol Support for other interconnects MPI-2 FeaturesZerocopy protocol Scali MPI Connect Release 4.4 Users Guide MPI-2 Features Running Setting up a Scali MPI Connect environmentCompiling and linking Scali MPI Connect environment variablesLinker flags Compiler supportNaming conventions Running Scali MPI Connect programsBasic usage Mpimon monitor programIdentity of parallel processes Program spec Controlling options to mpimonStandard input Standard outputNetwork options How to provide options to mpimonMpirun usage Mpirun wrapper scriptPart part Running with tcp error detection TfdrSuspending and resuming jobs Running with dynamic interconnect failover capabilitiesDebugging with a sequential debugger Debugging and profilingDebugging with Etnus Totalview Using built-in segment protect violation handlerBuilt-in-tools for debugging Assistance for external profilingCommunication resources on DAT-devices Controlling communication resourcesChannelinlinethreshold size to set threshold for inlining Matching MPIRecv with MPIProbe Using MPIIsend, MPIIrecvUsing MPIBsend Good programming practice with SMCUnsafe MPI programs Error and warning messagesUser interface errors and warnings Fatal errorsMpimon options Prefix Giving numeric values to mpimonPostfix Scali MPI Connect Release 4.4 Users Guide Example Profiling with Scali MPI ConnectTracing Using Scali MPI Connect built-in trace+relSecs S eTime where AbsRank MPIcallcommNamerankcall-dependant-parameters whereExample FeaturesTiming Using Scali MPI Connect built-in timingMPIcallDcallsDtimeDfreq TcallsTtimeTfreq Count!avrLen!zroLen!inline!eager!transporter! where Using the scanalyzeCommrank recv from fromworldFromcommonFields Commrank send to toworldTocommonFields whereFor timing This produces the following report Using SMCs built-in CPU-usage functionalityScali MPI Connect Release 4.4 Users Guide Automatic buffer management Tuning communication resourcesFirst iteration is very slow How to optimize MPI performanceBenchmarking Caching the application program on the nodesMemory consumption increase after warm-up Collective operationsFinding the best algorithm Programs in the ScaMPItst package Appendix aImage contrast enhancement Scali MPI Connect Release 4.4 Users Guide Original File formatWhy does not my program start to run? When things do not work troubleshootingAppendix B Why does my program terminate abnormally? Why can I not start mpid?General problems Appendix C Per node installation of Scali MPI ConnectExample Install Scali MPI Connect for TCP/IPInstall Scali MPI Connect for Direct Ethernet Install Scali MPI Connect for MyrinetLicense options Install Scali MPI Connect for InfinibandInstall Scali MPI Connect for SCI Install and configure SCI management softwareTroubleshooting Network providers Uninstalling SMCScali kernel drivers Troubleshooting the GM provider Troubleshooting 3rdparty DAT providersScali MPI Connect Release 4.4 Users Guide Bracket expansion Appendix D Bracket expansion and groupingGrouping Scali MPI Connect Release 4.4 Users Guide Related documentation Appendix EScali MPI Connect Release 4.4 Users Guide List of figures Scali MPI Connect Release 4.4 Users Guide Transporter protocol IndexSSP