IBM pSeries manual Mprexmitbufsize and Mprexmitbufcnt, Memoryaffinity

Page 6

thread, and from within the MPI/LAPI polling code that is invoked when the application makes blocking MPI calls.

MP_POLLING_INTERVAL specifies the number of microseconds an MPI/LAPI service thread should wait (sleep) before it checks whether any data previously sent by the MPI task needs to be retransmitted. MP_RETRANSMIT_INTERVAL specifies the number of passes through the internal MPI/LAPI polling routine between calls before checking whether any data needs to be resent. When the switch fabric, adapters, and nodes are operating properly, data that is sent arrives intact, and the receiver sends the source task an acknowledgment for the data. If the sending task does not receive such an acknowledgment within a reasonable amount of time (determined by the variable MP_RETRANSMIT_INTERVAL), it assumes the data has been lost and tries to resend it.

Sometimes when many MPI tasks share the switch adapters, switch fabric, or both, the time it takes to send a message and receive an acknowledgment is longer than the library expects. In this case, data might be retransmitted unnecessarily. Increasing the values of MP_POLLING_INTERVAL and MP_RETRANSMIT_INTERVAL decrease the likelihood of unnecessary retransmission but increase the time a job is delayed when a packet is actually dropped.

2.1.3 MP_REXMIT_BUF_SIZE and MP_REXMIT_BUF_CNT

You can improve application performance by allowing a task that is sending a message shorter than the “eager” limit to return the send buffer to the application before the message has reached its destination, rather than forcing the sending task to wait until the data has actually reached the receiving task and the acknowledgement has been returned. To allow immediate return of the send buffer to the application, LAPI attempts to make a copy of the data in case it must be retransmitted later (unlikely but not impossible). LAPI copies the data into a retransmit buffer (REXMIT_BUF) if one is available. The MP_REXMIT_BUF_SIZE and MP_REXMIT_BUF_CNT environment variables control the size and number of the retransmit buffers allocated by each task.

2.1.4 MEMORY_AFFINITY

The POWER4™ and POWER4+™ models of the pSeries 690 have more than one multi-chip module (MCM). An MCM contains eight CPUs and frequently has two local memory cards. On these systems, application performance can improve when each CPU and the memory it accesses are on the same MCM.

Setting the AIX MEMORY_AFFINITY environment variable to MCM tells the operating system to attempt to allocate the memory from within the MCM containing the processor that made the request. If memory is available on the MCM containing the CPU, the request is usually granted. If memory is not available on the local MCM, but is available on a remote MCM, the memory is taken from the remote MCM. (Lack of local memory does not cause the job to fail.)

pshpstuningguidewp040105.doc

Page 6

Image 6
Contents IBM ~pSeries High Performance Switch Contents Mpprintenv Mpstatistics Introduction Mpeagerlimit Tunables and settings for switch softwareMPI tunables for Parallel Environment MppollingintervalMprexmitbufsize and Mprexmitbufcnt MemoryaffinityMptaskaffinity MpcssinterruptMPI-IO Chgsni command Tunables and settings for AIX 5L IP tunablesFile cache Svmon and vmstat commands Svmon Vsid Esid Type Description LPage Inuse Pin Pgsp VirtualVmstat Pin Pgsp VirtualLarge page sizing Pshpstuningguidewp040105.doc Large pages and IP support Memory affinity for a single LparAmount of memory available Debug settings in the AIX 5L kernel Daemon configurationRsct daemons LoadLeveler daemons Reducing the number of daemons runningReducing logging Settings for AIX 5L threads Placement of POE managers and LoadLeveler schedulerAIX 5L mail, spool, and sync daemons Driverdebug setting Debug settings and data collection toolsLsattr tuning Iptrclvl settingService focal point Small Real Mode Address Region on HMC GUIDeconfigured L3 cache Affinity LPARsErrpt command HMC error loggingMultiple versions of MPI libraries Mpprintenv Mpstatistics MemoryaffinityDropped switch packets Nddipacketsmsw 0x00000000 Nddipacketslsw Packets dropped in the ML0 interface Packets dropped because of a hardware problem on an endpoint Packets dropped in the switch hardware MpinfolevelLapidebugcommtimeout LapidebugperfAdditional reading AIX 5L trace for daemon activityConclusions and summary HPS documentationAIX 5L performance guides POWER4MPI documentation IBM RedbooksPshpstuningguidewp040105.doc