IBM pSeries Tunables and settings for switch software, MPI tunables for Parallel Environment

Page 5

2.0 Tunables and settings for switch software

To optimize the HPS, you can set shell variables for Parallel Environment MPI-based workloads and for IP-based workloads. This section reviews the shell variables that are most often used for performance tuning. For a complete list of tunables and their usage, see the documentation listed in section 7 of this paper.

2.1 MPI tunables for Parallel Environment

The following sections list the most common MPI tunables for applications that use the HPS. Along with each tunable is a description of the variable, what it is used for, and how to set it appropriately.

2.1.1 MP_EAGER_LIMIT

The MP_EAGER_LIMIT variable tells the MPI transport protocol to use the "eager" mode for messages less than or equal to the specified size. Under the "eager" mode, the sender sends the message without knowing if the matching receive has actually been posted by the destination task. For messages larger than the EAGER_LIMIT, a rendezvous must be used to confirm that the matching receive has been posted

The sending task does not have to wait for an okay from the receiver before sending the data, so the effective start-up cost for a small message is lower in “eager” mode. As a result, any messages that are smaller than the EAGER_LIMIT are typically faster, especially if the corresponding receive has already been posted. If the receive has not been posted, the transport incurs an extra copy cost on the target, because data is staged through the early-arrival buffers. However, the overall time to send a small message might still be less in "eager" mode. Well- designed MPI applications often try to post each MPI_RECV before the message is expected, but because tasks of a parallel job are not in lock step, most applications have occasional early arrivals.

The maximum message size for the “eager” protocol is currently 65536 bytes, although the default value is lower. An application for which a significant fraction of the MPI messages are less than 65536 bytes might see a performance benefit from setting MP_EAGER_LIMIT. If MP_EAGER_LIMIT is increased above the default value, it might also be necessary to increase MP_BUFFER_MEM, which determines the amount of memory available for early arrival buffers. Higher “eager” limits or larger task counts either demand more buffer memory or reduce the number of unlimited “eager” messages that can be outstanding, and therefore can also impact performance.

2.1.2MP_POLLING_INTERVAL and

MP_RETRANSMIT_INTERVAL

The MP_POLLING_INTERVAL and MP_RETRANSMIT_INTERVAL variables control how often the protocol code checks whether data that was previously sent is assumed to be lost and needs to be retransmitted. When the values are larger, this checking is done less often. There are two different environment variables because the check can be done by an MPI/LAPI service

pshpstuningguidewp040105.doc

Page 5

Image 5
Contents IBM ~pSeries High Performance Switch Contents Mpprintenv Mpstatistics Introduction MPI tunables for Parallel Environment Tunables and settings for switch softwareMpeagerlimit MppollingintervalMemoryaffinity Mprexmitbufsize and MprexmitbufcntMPI-IO MptaskaffinityMpcssinterrupt Chgsni command File cache Tunables and settings for AIX 5LIP tunables Svmon and vmstat commands Vsid Esid Type Description LPage Inuse Pin Pgsp Virtual SvmonPin Pgsp Virtual VmstatLarge page sizing Pshpstuningguidewp040105.doc Amount of memory available Large pages and IP supportMemory affinity for a single Lpar Rsct daemons Debug settings in the AIX 5L kernelDaemon configuration Reducing logging LoadLeveler daemonsReducing the number of daemons running AIX 5L mail, spool, and sync daemons Settings for AIX 5L threadsPlacement of POE managers and LoadLeveler scheduler Lsattr tuning Debug settings and data collection toolsDriverdebug setting Iptrclvl settingDeconfigured L3 cache Small Real Mode Address Region on HMC GUIService focal point Affinity LPARsMultiple versions of MPI libraries Errpt commandHMC error logging Mpprintenv Memoryaffinity MpstatisticsDropped switch packets Nddipacketsmsw 0x00000000 Nddipacketslsw Packets dropped in the ML0 interface Packets dropped because of a hardware problem on an endpoint Mpinfolevel Packets dropped in the switch hardwareLapidebugperf LapidebugcommtimeoutConclusions and summary AIX 5L trace for daemon activityAdditional reading HPS documentationMPI documentation POWER4AIX 5L performance guides IBM RedbooksPshpstuningguidewp040105.doc