thread, and from within the MPI/LAPI polling code that is invoked when the application makes blocking MPI calls.

MP_POLLING_INTERVAL specifies the number of microseconds an MPI/LAPI service thread should wait (sleep) before it checks whether any data previously sent by the MPI task needs to be retransmitted. MP_RETRANSMIT_INTERVAL specifies the number of passes through the internal MPI/LAPI polling routine between calls before checking whether any data needs to be resent. When the switch fabric, adapters, and nodes are operating properly, data that is sent arrives intact, and the receiver sends the source task an acknowledgment for the data. If the sending task does not receive such an acknowledgment within a reasonable amount of time (determined by the variable MP_RETRANSMIT_INTERVAL), it assumes the data has been lost and tries to resend it.

Sometimes when many MPI tasks share the switch adapters, switch fabric, or both, the time it takes to send a message and receive an acknowledgment is longer than the library expects. In this case, data might be retransmitted unnecessarily. Increasing the values of MP_POLLING_INTERVAL and MP_RETRANSMIT_INTERVAL decrease the likelihood of unnecessary retransmission but increase the time a job is delayed when a packet is actually dropped.

2.1.3 MP_REXMIT_BUF_SIZE and MP_REXMIT_BUF_CNT

You can improve application performance by allowing a task that is sending a message shorter than the “eager” limit to return the send buffer to the application before the message has reached its destination, rather than forcing the sending task to wait until the data has actually reached the receiving task and the acknowledgement has been returned. To allow immediate return of the send buffer to the application, LAPI attempts to make a copy of the data in case it must be retransmitted later (unlikely but not impossible). LAPI copies the data into a retransmit buffer (REXMIT_BUF) if one is available. The MP_REXMIT_BUF_SIZE and MP_REXMIT_BUF_CNT environment variables control the size and number of the retransmit buffers allocated by each task.

2.1.4 MEMORY_AFFINITY

The POWER4™ and POWER4+™ models of the pSeries 690 have more than one multi-chip module (MCM). An MCM contains eight CPUs and frequently has two local memory cards. On these systems, application performance can improve when each CPU and the memory it accesses are on the same MCM.

Setting the AIX MEMORY_AFFINITY environment variable to MCM tells the operating system to attempt to allocate the memory from within the MCM containing the processor that made the request. If memory is available on the MCM containing the CPU, the request is usually granted. If memory is not available on the local MCM, but is available on a remote MCM, the memory is taken from the remote MCM. (Lack of local memory does not cause the job to fail.)

pshpstuningguidewp040105.doc

Page 6

Page 6
Image 6
IBM pSeries manual Mprexmitbufsize and Mprexmitbufcnt, Memoryaffinity